[jira] [Commented] (CASSANDRA-8815) Race in sstable ref counting during streaming failures

2015-02-18 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325709#comment-14325709
 ] 

Benedict commented on CASSANDRA-8815:
-

Agreed, although I would prefer a separate ticket for that. If you file the 
ticket and patch, feel free to mark me as reviewer and I'll commit alongside 
this.

  Race in sstable ref counting during streaming failures 
 

 Key: CASSANDRA-8815
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: sankalp kohli
Assignee: Benedict
 Fix For: 2.0.13

 Attachments: 8815.txt


 We have a seen a machine in Prod whose all read threads are blocked(spinning) 
 on trying to acquire the reference lock on stables. There are also some 
 stream sessions which are doing the same. 
 On looking at the heap dump, we could see that a live sstable which is part 
 of the View has a ref count = 0. This sstable is also not compacting or is 
 part of any failed compaction. 
 On looking through the code, we could see that if ref goes to zero and the 
 stable is part of the View, all reader threads will spin forever. 
 On further looking through the code of streaming, we could see that if 
 StreamTransferTask.complete is called after closeSession has been called due 
 to error in OutgoingMessageHandler, it will double decrement the ref count of 
 an sstable. 
 This race can happen and we see through exception in logs that closeSession 
 was triggered by OutgoingMessageHandler. 
 The fix for this is very simple i think. In StreamTransferTask.abort, we can 
 remove a file from files” before decrementing the ref count. This will avoid 
 this race. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8815) Race in sstable ref counting during streaming failures

2015-02-18 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326690#comment-14326690
 ] 

sankalp kohli commented on CASSANDRA-8815:
--

[~benedict] Can you please commit this or ask someone?

  Race in sstable ref counting during streaming failures 
 

 Key: CASSANDRA-8815
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: sankalp kohli
Assignee: Benedict
 Fix For: 2.0.13

 Attachments: 8815.txt


 We have a seen a machine in Prod whose all read threads are blocked(spinning) 
 on trying to acquire the reference lock on stables. There are also some 
 stream sessions which are doing the same. 
 On looking at the heap dump, we could see that a live sstable which is part 
 of the View has a ref count = 0. This sstable is also not compacting or is 
 part of any failed compaction. 
 On looking through the code, we could see that if ref goes to zero and the 
 stable is part of the View, all reader threads will spin forever. 
 On further looking through the code of streaming, we could see that if 
 StreamTransferTask.complete is called after closeSession has been called due 
 to error in OutgoingMessageHandler, it will double decrement the ref count of 
 an sstable. 
 This race can happen and we see through exception in logs that closeSession 
 was triggered by OutgoingMessageHandler. 
 The fix for this is very simple i think. In StreamTransferTask.abort, we can 
 remove a file from files” before decrementing the ref count. This will avoid 
 this race. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8815) Race in sstable ref counting during streaming failures

2015-02-17 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324637#comment-14324637
 ] 

sankalp kohli commented on CASSANDRA-8815:
--

+1
Looks good. 

  Race in sstable ref counting during streaming failures 
 

 Key: CASSANDRA-8815
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: sankalp kohli
Assignee: Benedict
 Fix For: 2.0.13

 Attachments: 8815.txt


 We have a seen a machine in Prod whose all read threads are blocked(spinning) 
 on trying to acquire the reference lock on stables. There are also some 
 stream sessions which are doing the same. 
 On looking at the heap dump, we could see that a live sstable which is part 
 of the View has a ref count = 0. This sstable is also not compacting or is 
 part of any failed compaction. 
 On looking through the code, we could see that if ref goes to zero and the 
 stable is part of the View, all reader threads will spin forever. 
 On further looking through the code of streaming, we could see that if 
 StreamTransferTask.complete is called after closeSession has been called due 
 to error in OutgoingMessageHandler, it will double decrement the ref count of 
 an sstable. 
 This race can happen and we see through exception in logs that closeSession 
 was triggered by OutgoingMessageHandler. 
 The fix for this is very simple i think. In StreamTransferTask.abort, we can 
 remove a file from files” before decrementing the ref count. This will avoid 
 this race. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8815) Race in sstable ref counting during streaming failures

2015-02-17 Thread Richard Low (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325007#comment-14325007
 ] 

Richard Low commented on CASSANDRA-8815:


I think we should add some assertions that would avoid the bad effect of this. 
I'll prepare a patch and put it here.

  Race in sstable ref counting during streaming failures 
 

 Key: CASSANDRA-8815
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: sankalp kohli
Assignee: Benedict
 Fix For: 2.0.13

 Attachments: 8815.txt


 We have a seen a machine in Prod whose all read threads are blocked(spinning) 
 on trying to acquire the reference lock on stables. There are also some 
 stream sessions which are doing the same. 
 On looking at the heap dump, we could see that a live sstable which is part 
 of the View has a ref count = 0. This sstable is also not compacting or is 
 part of any failed compaction. 
 On looking through the code, we could see that if ref goes to zero and the 
 stable is part of the View, all reader threads will spin forever. 
 On further looking through the code of streaming, we could see that if 
 StreamTransferTask.complete is called after closeSession has been called due 
 to error in OutgoingMessageHandler, it will double decrement the ref count of 
 an sstable. 
 This race can happen and we see through exception in logs that closeSession 
 was triggered by OutgoingMessageHandler. 
 The fix for this is very simple i think. In StreamTransferTask.abort, we can 
 remove a file from files” before decrementing the ref count. This will avoid 
 this race. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8815) Race in sstable ref counting during streaming failures

2015-02-16 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323420#comment-14323420
 ] 

sankalp kohli commented on CASSANDRA-8815:
--

This can also be fixed by adding files.clear() to the last line of 
STT.abort(). 
Or adding if(aborted) return to start of complete method.  

  Race in sstable ref counting during streaming failures 
 

 Key: CASSANDRA-8815
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: sankalp kohli
Assignee: Benedict

 We have a seen a machine in Prod whose all read threads are blocked(spinning) 
 on trying to acquire the reference lock on stables. There are also some 
 stream sessions which are doing the same. 
 On looking at the heap dump, we could see that a live sstable which is part 
 of the View has a ref count = 0. This sstable is also not compacting or is 
 part of any failed compaction. 
 On looking through the code, we could see that if ref goes to zero and the 
 stable is part of the View, all reader threads will spin forever. 
 On further looking through the code of streaming, we could see that if 
 StreamTransferTask.complete is called after closeSession has been called due 
 to error in OutgoingMessageHandler, it will double decrement the ref count of 
 an sstable. 
 This race can happen and we see through exception in logs that closeSession 
 was triggered by OutgoingMessageHandler. 
 The fix for this is very simple i think. In StreamTransferTask.abort, we can 
 remove a file from files” before decrementing the ref count. This will avoid 
 this race. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8815) Race in sstable ref counting during streaming failures

2015-02-16 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323401#comment-14323401
 ] 

sankalp kohli commented on CASSANDRA-8815:
--

This is similar to CASSANDRA-7704

  Race in sstable ref counting during streaming failures 
 

 Key: CASSANDRA-8815
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: sankalp kohli

 We have a seen a machine in Prod whose all read threads are blocked(spinning) 
 on trying to acquire the reference lock on stables. There are also some 
 stream sessions which are doing the same. 
 On looking at the heap dump, we could see that a live sstable which is part 
 of the View has a ref count = 0. This sstable is also not compacting or is 
 part of any failed compaction. 
 On looking through the code, we could see that if ref goes to zero and the 
 stable is part of the View, all reader threads will spin forever. 
 On further looking through the code of streaming, we could see that if 
 StreamTransferTask.complete is called after closeSession has been called due 
 to error in OutgoingMessageHandler, it will double decrement the ref count of 
 an sstable. 
 This race can happen and we see through exception in logs that closeSession 
 was triggered by OutgoingMessageHandler. 
 The fix for this is very simple i think. In StreamTransferTask.abort, we can 
 remove a file from files” before decrementing the ref count. This will avoid 
 this race. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8815) Race in sstable ref counting during streaming failures

2015-02-16 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323402#comment-14323402
 ] 

sankalp kohli commented on CASSANDRA-8815:
--

cc [~benedict]

  Race in sstable ref counting during streaming failures 
 

 Key: CASSANDRA-8815
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8815
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: sankalp kohli
Assignee: sankalp kohli

 We have a seen a machine in Prod whose all read threads are blocked(spinning) 
 on trying to acquire the reference lock on stables. There are also some 
 stream sessions which are doing the same. 
 On looking at the heap dump, we could see that a live sstable which is part 
 of the View has a ref count = 0. This sstable is also not compacting or is 
 part of any failed compaction. 
 On looking through the code, we could see that if ref goes to zero and the 
 stable is part of the View, all reader threads will spin forever. 
 On further looking through the code of streaming, we could see that if 
 StreamTransferTask.complete is called after closeSession has been called due 
 to error in OutgoingMessageHandler, it will double decrement the ref count of 
 an sstable. 
 This race can happen and we see through exception in logs that closeSession 
 was triggered by OutgoingMessageHandler. 
 The fix for this is very simple i think. In StreamTransferTask.abort, we can 
 remove a file from files” before decrementing the ref count. This will avoid 
 this race. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)