[ https://issues.apache.org/jira/browse/CASSANDRA-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369449#comment-14369449 ]
Philip Thompson commented on CASSANDRA-8620: -------------------------------------------- Okay, thank you. Several LCS bugs were fixed from 2.1.2 -> 2.1.3, which may have resolved your issue. If you run into it again, please re-open this ticket. > Bootstrap session hanging indefinitely > -------------------------------------- > > Key: CASSANDRA-8620 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8620 > Project: Cassandra > Issue Type: Bug > Environment: Debian 7, Oracle JDK 1.7.0_51, AWS + GCE > Reporter: Adam Horwich > > Hi! We have been running a relatively small 2.1.2 cluster over 2 DCs for a > few months with ~100GB load per node and a RF=3 and over the last few weeks > have been trying to scale up capacity. > We've been recently seeing scenarios in which the Bootstrap or Unbootstrap > streaming process hangs indefinitely for one or more sessions on the receiver > without stacktrace or exception. This does not happen every time, and we do > not get into this state with the same sender every time. When the receiver is > in a hung state, the following can be found in the thread dump: > The Stream-IN thread for one or more sessions is blocked in the following > state: > Thread 24942: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=186 (Compiled frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() > @bci=42, line=2043 (Compiled frame) > - java.util.concurrent.ArrayBlockingQueue.take() @bci=20, line=374 (Compiled > frame) > - org.apache.cassandra.streaming.compress.CompressedInputStream.read() > @bci=31, line=89 (Compiled frame) > - java.io.DataInputStream.readUnsignedShort() @bci=4, line=337 (Compiled > frame) > - org.apache.cassandra.utils.BytesReadTracker.readUnsignedShort() @bci=4, > line=140 (Compiled frame) > - > org.apache.cassandra.utils.ByteBufferUtil.readShortLength(java.io.DataInput) > @bci=1, line=317 (Compiled frame) > - > org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(java.io.DataInput) > @bci=2, line=327 (Compiled frame) > - > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(java.io.DataInput) > @bci=5, line=397 (Compiled frame) > - > org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(java.io.DataInput) > @bci=2, line=381 (Compiled frame) > - > org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(java.io.DataInput, > org.apache.cassandra.db.ColumnSerializer$Flag, int, > org.apache.cassandra.io.sstable.Descriptor$Version) @bci=10, line=75 > (Compiled frame) > - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=25, line=52 > (Compiled frame) > - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=1, line=46 > (Compiled frame) > - com.google.common.collect.AbstractIterator.tryToComputeNext() @bci=9, > line=143 (Compiled frame) > - com.google.common.collect.AbstractIterator.hasNext() @bci=61, line=138 > (Compiled frame) > - > org.apache.cassandra.io.sstable.SSTableWriter.appendFromStream(org.apache.cassandra.db.DecoratedKey, > org.apache.cassandra.config.CFMetaData, java.io.DataInput, > org.apache.cassandra.io.sstable.Descriptor$Version) @bci=320, line=283 > (Compiled frame) > - > org.apache.cassandra.streaming.StreamReader.writeRow(org.apache.cassandra.io.sstable.SSTableWriter, > java.io.DataInput, org.apache.cassandra.db.ColumnFamilyStore) @bci=26, > line=157 (Compiled frame) > - > org.apache.cassandra.streaming.compress.CompressedStreamReader.read(java.nio.channels.ReadableByteChannel) > @bci=258, line=89 (Compiled frame) > - > org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel, > int, org.apache.cassandra.streaming.StreamSession) @bci=69, line=48 > (Interpreted frame) > - > org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel, > int, org.apache.cassandra.streaming.StreamSession) @bci=4, line=38 > (Interpreted frame) > - > org.apache.cassandra.streaming.messages.StreamMessage.deserialize(java.nio.channels.ReadableByteChannel, > int, org.apache.cassandra.streaming.StreamSession) @bci=37, line=55 > (Interpreted frame) > - > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run() > @bci=24, line=245 (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame) > Debug logging shows that the receiver is still reading the file it is > receiving from the receiver and has not yet sent an ACK. > The receiver is waiting for more data to finish writing its row, and the > sender is not sending any more data. On both the receiver and sender there is > a large amount of data (~5MB) stuck in the Recv-Q (receiver) and Send-Q > (sender). > We've been trying to diagnose this issue internally, but it's difficult to > create a reliably reproducible scenario. We have so far found that restarting > all nodes in the cluster and ensuring that a cleanup has been performed helps > mitigate the problem (though a cleanup without restart can still result in a > hung state). However, it's unclear to me why either of these would affect the > streaming process in the way we've observed. One theory is that the > calculated section sizes are inaccurate. > We tried setting a timeout on the dataBuffer reading (moving take to poll), > which forced a retry on the file, but the same transfer failed again until > retries were exhausted. > From a heap dump in the dead locked state we confirmed the following: > ByteTracker.bytesRead = 100477411 > Length of the section being read (happens to be the last section) 100477411 > We also wrote all the data in the buffer to a text file and observed that the > data in the buffer ended half way through a row. > The stream reader checks the value of bytes read after importing each row > (StreamReader line 97) to detect the end of the section, however it does not > expect the data to finish half way through a row. This happens in > OnDiskAtom.deserializeFromSSTable in any of the various deserialization calls. -- This message was sent by Atlassian JIRA (v6.3.4#6332)