Alexey,

I see your point and it really looks like your use case should be an option
of AlwaysFailoverSpi (which is the default one). But now it doesn't
failover if it has already tried all nodes for a particular job. So you
will have to implement your own failover SPI (it should be pretty simple -
just pick a random node from the topology each time a job is trying to fail
over).

As for global nature of the SPI, you're right, but its failover() takes
FailoverContext, which has information about failed job (task name,
attributes, exception, etc.), so you can make decision based on this
information.

Hope this helps.

Thanks!

On Mon, Jun 29, 2015 at 1:08 PM, Aleksei Valikov <[email protected]>
wrote:

> Hi,
>
> this is basically a copy of
>
>
> http://stackoverflow.com/questions/31124341/how-to-retry-failed-job-on-any-node-with-apache-ignite-gridgain
>
> I'm experimenting with fault tolerance
> <https://apacheignite.readme.io/v1.1/docs/fault-tolerance> in Apache
> Ignite.
>
> What I can't figure out is how to retry a failed job on any node. I have a
> use case where my jobs will be calling a third-party tool as a system
> process via process buildr to do some calculations. In some cases the tool
> may fail, but in most cases it's OK to retry the job on any node -
> including the one where it previously failed.
>
> At the moment Ignite seems to reroute the job to another node which did
> not have this job before. So, after a while all nodes are gone and the task
> fails.
>
> What I'm looking for is how to retry a job on any node.
>
> Here's a test to demonstrate my problem.
>
> Here's my randomly failing job:
>
> public static class RandomlyFailingComputeJob implements ComputeJob {
>     private static final long serialVersionUID = -8351095134107406874L;
>     private final String data;
>
>     public RandomlyFailingComputeJob(String data) {
>         Validate.notNull(data);
>         this.data = data;
>     }
>
>     public void cancel() {
>     }
>
>     public Object execute() throws IgniteException {
>         final double random = Math.random();
>         if (random > 0.5) {
>             throw new IgniteException();
>         } else {
>             return StringUtils.reverse(data);
>         }
>     }}
>
> An below is the task:
>
> public static class RandomlyFailingComputeTask extends
>         ComputeTaskSplitAdapter<String, String> {
>     private static final long serialVersionUID = 6756691331287458885L;
>
>     @Override
>     public ComputeJobResultPolicy result(ComputeJobResult res,
>             List<ComputeJobResult> rcvd) throws IgniteException {
>         if (res.getException() != null) {
>             return ComputeJobResultPolicy.FAILOVER;
>         }
>         return ComputeJobResultPolicy.WAIT;
>     }
>
>     public String reduce(List<ComputeJobResult> results)
>             throws IgniteException {
>         final Collection<String> reducedResults = new ArrayList<String>(
>                 results.size());
>         for (ComputeJobResult result : results) {
>             reducedResults.add(result.<String> getData());
>         }
>         return StringUtils.join(reducedResults, ' ');
>     }
>
>     @Override
>     protected Collection<? extends ComputeJob> split(int gridSize,
>             String arg) throws IgniteException {
>         final String[] args = StringUtils.split(arg, ' ');
>         final Collection<ComputeJob> computeJobs = new ArrayList<ComputeJob>(
>                 args.length);
>         for (String data : args) {
>             computeJobs.add(new RandomlyFailingComputeJob(data));
>         }
>         return computeJobs;
>     }
> }
>
> Test code:
>
>     final Ignite ignite = Ignition.start();
>     final String original = "The quick brown fox jumps over the lazy dog";
>
>     final String reversed = StringUtils.join(
>             ignite.compute().execute(new RandomlyFailingComputeTask(),
>                     original), ' ');
>
> As you can see, should always be failovered. Since the probability of
> failure != 1, I expect the task to successfully terminate at some point.
>
> With the probability threshold of 0.5 and a total of 3 nodes this hardly
> happens. I'm getting an exception like class
> org.apache.ignite.cluster.ClusterTopologyException: Failed to failover a
> job to another node (failover SPI returned null). After some debugging
> I've found out that this is because I eventually run out of nodes. All of
> the are gone.
>
> I understand that I can write my own FailoverSpi to handle this.
>
> But this just doesn't feel right.
>
> First, it seems to be an overkill to do this.
> But then the SPI is a kind of global thing. I'd like to decide per job if
> it should be retried or failed over. This may, for instance, depend on what
> the exit code of the third-party tool I'm invoking. So configuring failover
> over the global SPI isn't right.
>
> I'd appreciate any pointers.
>
> Many thanks and best wishes,
>
> Alexey
>

Reply via email to