[ 
https://issues.apache.org/jira/browse/IGNITE-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602430#comment-16602430
 ] 

Pavel Kuznetsov edited comment on IGNITE-9447 at 9/3/18 8:10 PM:
-----------------------------------------------------------------

We decided not to use distributed mutex, just lookup thhought the topology 
history. So method nodesStarted() should return true if there were topology 
that meets requirements at the moment between node joined and method got 
executed.


was (Author: pkouznet):
We decided not to use distributed mutex, just lookup thhought the topology 
history

> Benchmarks hangs intemittently due to distributed race condition.
> -----------------------------------------------------------------
>
>                 Key: IGNITE-9447
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9447
>             Project: Ignite
>          Issue Type: Bug
>          Components: sql
>            Reporter: Pavel Kuznetsov
>            Assignee: Pavel Kuznetsov
>            Priority: Minor
>
> If we run more than one yardstick driver, benchmark hangs intermittently.
> We've got yardstick's base driver class 
> org.apache.ignite.yardstick.IgniteAbstractBenchmark it has logic to wait all 
> the nodes in the cluster.
> {noformat}
> final CountDownLatch nodesStartedLatch = new CountDownLatch(1);
>         ignite().events().localListen(new IgnitePredicate<Event>() {
>             @Override public boolean apply(Event gridEvt) {
>                 if (nodesStarted())
>                     nodesStartedLatch.countDown();
>                 return true;
>             }
>         }, EVT_NODE_JOINED);
>         if (!nodesStarted()) {
>             println(cfg, "Waiting for " + (args.nodes() - 1) + " nodes to 
> start...");
>             nodesStartedLatch.await();
>         }
> {noformat}
> This code is executed on every driver node.
> If we want to close local ignite instance just after cluster is ready 
> (containing expected number of nodes), sometimes we'll have dead lock:
> 1) cluster contains N-1 nodes, all nodes are waiting for the Nth node.
> 2) Nth node is connected, cluster receives message, waitForNodes code of Nth 
> node is not executed.
> 3) N-1 nodes got this message and stop waiting.
> 4) N-1 thinks that cluster is ready and call ignite.close() on their local 
> instances
> 5) Nth node starts waiting for cluster to contain number of nodes, but N-1 of 
> them closed their instances
> 6) Nth node is waiting infinitely. 
> We can avoid this problem if we use distributed CountDownLatch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to