[ https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benedict updated CASSANDRA-8815: -------------------------------- Attachment: 8815.txt Yep, nice spot. Attached a patch that calls files.clear() at the end, as well as ensuring it reaches that spot by catching any possible exceptions during cleanup. > Race in sstable ref counting during streaming failures > -------------------------------------------------------- > > Key: CASSANDRA-8815 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8815 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: sankalp kohli > Assignee: Benedict > Fix For: 2.0.13 > > Attachments: 8815.txt > > > We have a seen a machine in Prod whose all read threads are blocked(spinning) > on trying to acquire the reference lock on stables. There are also some > stream sessions which are doing the same. > On looking at the heap dump, we could see that a live sstable which is part > of the View has a ref count = 0. This sstable is also not compacting or is > part of any failed compaction. > On looking through the code, we could see that if ref goes to zero and the > stable is part of the View, all reader threads will spin forever. > On further looking through the code of streaming, we could see that if > StreamTransferTask.complete is called after closeSession has been called due > to error in OutgoingMessageHandler, it will double decrement the ref count of > an sstable. > This race can happen and we see through exception in logs that closeSession > was triggered by OutgoingMessageHandler. > The fix for this is very simple i think. In StreamTransferTask.abort, we can > remove a file from "files” before decrementing the ref count. This will avoid > this race. -- This message was sent by Atlassian JIRA (v6.3.4#6332)