[ 
https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-8815:
--------------------------------
    Attachment: 8815.txt

Yep, nice spot. Attached a patch that calls files.clear() at the end, as well 
as ensuring it reaches that spot by catching any possible exceptions during 
cleanup.

>  Race in sstable ref counting during streaming failures 
> --------------------------------------------------------
>
>                 Key: CASSANDRA-8815
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: sankalp kohli
>            Assignee: Benedict
>             Fix For: 2.0.13
>
>         Attachments: 8815.txt
>
>
> We have a seen a machine in Prod whose all read threads are blocked(spinning) 
> on trying to acquire the reference lock on stables. There are also some 
> stream sessions which are doing the same. 
> On looking at the heap dump, we could see that a live sstable which is part 
> of the View has a ref count = 0. This sstable is also not compacting or is 
> part of any failed compaction. 
> On looking through the code, we could see that if ref goes to zero and the 
> stable is part of the View, all reader threads will spin forever. 
> On further looking through the code of streaming, we could see that if 
> StreamTransferTask.complete is called after closeSession has been called due 
> to error in OutgoingMessageHandler, it will double decrement the ref count of 
> an sstable. 
> This race can happen and we see through exception in logs that closeSession 
> was triggered by OutgoingMessageHandler. 
> The fix for this is very simple i think. In StreamTransferTask.abort, we can 
> remove a file from "files” before decrementing the ref count. This will avoid 
> this race. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to