[ 
https://issues.apache.org/jira/browse/CASSANDRA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200947#comment-13200947
 ] 

Peter Schuller commented on CASSANDRA-3832:
-------------------------------------------

Meanwhile, MigrationStage is stuck like this:

{code}
"MigrationStage:1" daemon prio=10 tid=0x00007fb5b450e800 nid=0x3395 waiting on 
condition [0x0000000043479000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000005032ed688> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:61)
        at 
org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:119)
        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
{code}

The GossipStage submits the job on the migration state on the local node and 
waits for the result. The migration stage in turn sends a message and waits for 
the response synchronously.

The migration request runs on the migration stage on the remote node, which is 
presumably stuck with it's own task on the migration stage.

In effect, we are causing a distributed deadlock (or almost deadlock, I'm not 
sure - I suppose we might get unstuck eventually since things do time out after 
rpc timeout).

                
> gossip stage backed up due to migration manager future de-ref 
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3832
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3832
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> This is just bootstrapping a ~ 180 trunk cluster. After a while, a
> node I was on was stuck with thinking all nodes are down, because
> gossip stage was backed up, because it was spending a long time
> (multiple seconds or more, I suppose RPC timeout maybe) doing the
> following. Cluster-wide restart -> back to normal. I have not
> investigated further.
> {code}
> "GossipStage:1" daemon prio=10 tid=0x00007f9d5847a800 nid=0xa6fc waiting on 
> condition [0x000000004345f000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000005029ad1c0> (a 
> java.util.concurrent.FutureTask$Sync)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
>       at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
>       at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>       at 
> org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:364)
>       at 
> org.apache.cassandra.service.MigrationManager.rectifySchema(MigrationManager.java:132)
>       at 
> org.apache.cassandra.service.MigrationManager.onAlive(MigrationManager.java:75)
>       at org.apache.cassandra.gms.Gossiper.markAlive(Gossiper.java:802)
>       at 
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:918)
>       at 
> org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:68)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to