[ 
https://issues.apache.org/jira/browse/CASSANDRA-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369449#comment-14369449
 ] 

Philip Thompson commented on CASSANDRA-8620:
--------------------------------------------

Okay, thank you. Several LCS bugs were fixed from 2.1.2 -> 2.1.3, which may 
have resolved your issue. If you run into it again, please re-open this ticket.

> Bootstrap session hanging indefinitely
> --------------------------------------
>
>                 Key: CASSANDRA-8620
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8620
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Debian 7, Oracle JDK 1.7.0_51, AWS + GCE
>            Reporter: Adam Horwich
>
> Hi! We have been running a relatively small 2.1.2 cluster over 2 DCs for a 
> few months with ~100GB load per node and a RF=3 and over the last few weeks 
> have been trying to scale up capacity. 
> We've been recently seeing scenarios in which the Bootstrap or Unbootstrap 
> streaming process hangs indefinitely for one or more sessions on the receiver 
> without stacktrace or exception. This does not happen every time, and we do 
> not get into this state with the same sender every time. When the receiver is 
> in a hung state, the following can be found in the thread dump:
> The Stream-IN thread for one or more sessions is blocked in the following 
> state: 
> Thread 24942: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
> line=186 (Compiled frame)
>  - 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() 
> @bci=42, line=2043 (Compiled frame)
>  - java.util.concurrent.ArrayBlockingQueue.take() @bci=20, line=374 (Compiled 
> frame)
>  - org.apache.cassandra.streaming.compress.CompressedInputStream.read() 
> @bci=31, line=89 (Compiled frame)
>  - java.io.DataInputStream.readUnsignedShort() @bci=4, line=337 (Compiled 
> frame)
>  - org.apache.cassandra.utils.BytesReadTracker.readUnsignedShort() @bci=4, 
> line=140 (Compiled frame)
>  - 
> org.apache.cassandra.utils.ByteBufferUtil.readShortLength(java.io.DataInput) 
> @bci=1, line=317 (Compiled frame)
>  - 
> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(java.io.DataInput)
>  @bci=2, line=327 (Compiled frame)
>  - 
> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(java.io.DataInput)
>  @bci=5, line=397 (Compiled frame)
>  - 
> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(java.io.DataInput)
>  @bci=2, line=381 (Compiled frame)
>  - 
> org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(java.io.DataInput,
>  org.apache.cassandra.db.ColumnSerializer$Flag, int, 
> org.apache.cassandra.io.sstable.Descriptor$Version) @bci=10, line=75 
> (Compiled frame)
>  - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=25, line=52 
> (Compiled frame)
>  - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=1, line=46 
> (Compiled frame)
>  - com.google.common.collect.AbstractIterator.tryToComputeNext() @bci=9, 
> line=143 (Compiled frame)
>  - com.google.common.collect.AbstractIterator.hasNext() @bci=61, line=138 
> (Compiled frame)
>  - 
> org.apache.cassandra.io.sstable.SSTableWriter.appendFromStream(org.apache.cassandra.db.DecoratedKey,
>  org.apache.cassandra.config.CFMetaData, java.io.DataInput, 
> org.apache.cassandra.io.sstable.Descriptor$Version) @bci=320, line=283 
> (Compiled frame)
>  - 
> org.apache.cassandra.streaming.StreamReader.writeRow(org.apache.cassandra.io.sstable.SSTableWriter,
>  java.io.DataInput, org.apache.cassandra.db.ColumnFamilyStore) @bci=26, 
> line=157 (Compiled frame)
>  - 
> org.apache.cassandra.streaming.compress.CompressedStreamReader.read(java.nio.channels.ReadableByteChannel)
>  @bci=258, line=89 (Compiled frame)
>  - 
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel,
>  int, org.apache.cassandra.streaming.StreamSession) @bci=69, line=48 
> (Interpreted frame)
>  - 
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel,
>  int, org.apache.cassandra.streaming.StreamSession) @bci=4, line=38 
> (Interpreted frame)
>  - 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(java.nio.channels.ReadableByteChannel,
>  int, org.apache.cassandra.streaming.StreamSession) @bci=37, line=55 
> (Interpreted frame)
>  - 
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run() 
> @bci=24, line=245 (Interpreted frame)
>  - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
> Debug logging shows that the receiver is still reading the file it is 
> receiving from the receiver and has not yet sent an ACK.
> The receiver is waiting for more data to finish writing its row, and the 
> sender is not sending any more data. On both the receiver and sender there is 
> a large amount of data (~5MB) stuck in the Recv-Q (receiver) and Send-Q 
> (sender). 
> We've been trying to diagnose this issue internally, but it's difficult to 
> create a reliably reproducible scenario. We have so far found that restarting 
> all nodes in the cluster and ensuring that a cleanup has been performed helps 
> mitigate the problem (though a cleanup without restart can still result in a 
> hung state). However, it's unclear to me why either of these would affect the 
> streaming process in the way we've observed. One theory is that the 
> calculated section sizes are inaccurate.
> We tried setting a timeout on the dataBuffer reading (moving take to poll), 
> which forced a retry on the file, but the same transfer failed again until 
> retries were exhausted.
> From a heap dump in the dead locked state we confirmed the following:
> ByteTracker.bytesRead = 100477411
> Length of the section being read (happens to be the last section) 100477411
> We also wrote all the data in the buffer to a text file and observed that the 
> data in the buffer ended half way through a row.
> The stream reader checks the value of bytes read after importing each row 
> (StreamReader line 97) to detect the end of the section, however it does not 
> expect the data to finish half way through a row. This happens in 
> OnDiskAtom.deserializeFromSSTable in any of the various deserialization calls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to