[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

2015-11-27 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030201#comment-15030201
 ] 

Anuj Wadehra edited comment on CASSANDRA-7904 at 11/27/15 7:58 PM:
---

[#Michael Shuler] First of all, I would like to express my discomfort on the 
way few things are happening. In March,2015 Apache web site mentioned that 
2.0.14 is the most stable version recommended for Production and in Nov (just 7 
mths time), people get little community support for 2.0.x (sort of EOL). It 
takes months to roll out a Cassandra upgrade to all your clients and by the 
time all your clients get the latest Cassandra version, you learn that 
virtually no help is available on community for that version. I understand the 
fast paced environment but can we revisit the strategy to make sure that once 
declared "stable", a version must be supported on community for at least one 
year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.
Many times, restarting the node and then running repair is successful. May be 
there is an issue with open TCP connections used by repair. Restarting 
Cassandra creates new connections and then its successful.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?







was (Author: eanujwa):
[#Michael Shuler] First of all, I would like to express my discomfort on the 
way few things are happening. In March,2015 Apache web site mentioned that 
2.0.14 is the most stable version recommended for Production and in Nov (just 7 
mths time), people get little community support for 2.0.x (sort of EOL). It 
takes months to roll out a Cassandra upgrade to all your clients and by the 
time all your clients get the latest Cassandra version, you learn that 
virtually no help is available on community for that version. I understand the 
fast paced environment but can we revisit the strategy to make sure that once 
declared "stable", a version must be supported on community for at least one 
year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?






> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172

[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

2015-11-27 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030201#comment-15030201
 ] 

Anuj Wadehra edited comment on CASSANDRA-7904 at 11/27/15 7:55 PM:
---

[#Michael Shuler] First of all, I would like to express my discomfort on the 
way few things are happening. In March,2015 Apache web site mentioned that 
2.0.14 is the most stable version recommended for Production and in Nov (just 7 
mths time), people get little community support for 2.0.x (sort of EOL). It 
takes months to roll out a Cassandra upgrade to all your clients and by the 
time all your clients get the latest Cassandra version, you learn that 
virtually no help is available on community for that version. I understand the 
fast paced environment but can we revisit the strategy to make sure that once 
declared "stable", a version must be supported on community for at least one 
year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?







was (Author: eanujwa):
[#Michael Shuler] First of all, I would like to express my discomfort on the 
way few things are happening. In March,2015 Apache web site mentioned that 
2.0.14 is the most stable version recommended for Production and in Nov (just 7 
mths time), people get the message in community saying that 2.0.x is sort of 
EOL. It takes months to roll out a Cassandra upgrade to all your clients and by 
the time all your clients get the latest Cassandra version, you learn that 
virtually no help is available on community for that version. I understand the 
fast paced environment but can we revisit the strategy to make sure that once 
declared "stable", a version must be supported on community for at least one 
year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?






> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have no

[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

2015-11-27 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030201#comment-15030201
 ] 

Anuj Wadehra edited comment on CASSANDRA-7904 at 11/27/15 7:53 PM:
---

[#Michael Shuler] First of all, I would like to express my discomfort on the 
way few things are happening. In March,2015 Apache web site mentioned that 
2.0.14 is the most stable version recommended for Production and in Nov (just 7 
mths time), people get the message in community saying that 2.0.x is sort of 
EOL. It takes months to roll out a Cassandra upgrade to all your clients and by 
the time all your clients get the latest Cassandra version, you learn that 
virtually no help is available on community for that version. I understand the 
fast paced environment but can we revisit the strategy to make sure that once 
declared "stable", a version must be supported on community for at least one 
year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?







was (Author: eanujwa):
[#Michael Shuler]
First of all, I would like to express my discomfort on the way few things are 
happening. In March,2015 Apache web site mentioned that 2.0.14 is the most 
stable version recommended for Production and in Nov (just 7 mths time), people 
get the message in community saying that 2.0.x is sort of EOL. It takes months 
to roll out a Cassandra upgrade to all your clients and by the time all your 
clients get the latest Cassandra version, you learn that virtually no help is 
available on community for that version. I understand the fast paced 
environment but can we revisit the strategy to make sure that once declared 
"stable", a version must be supported on community for at least one year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?






> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node

[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

2015-11-27 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030201#comment-15030201
 ] 

Anuj Wadehra edited comment on CASSANDRA-7904 at 11/27/15 7:52 PM:
---

[#Michael Shuler]
First of all, I would like to express my discomfort on the way few things are 
happening. In March,2015 Apache web site mentioned that 2.0.14 is the most 
stable version recommended for Production and in Nov (just 7 mths time), people 
get the message in community saying that 2.0.x is sort of EOL. It takes months 
to roll out a Cassandra upgrade to all your clients and by the time all your 
clients get the latest Cassandra version, you learn that virtually no help is 
available on community for that version. I understand the fast paced 
environment but can we revisit the strategy to make sure that once declared 
"stable", a version must be supported on community for at least one year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?







was (Author: eanujwa):
[#mshuler]
First of all, I would like to express my discomfort on the way few things are 
happening. In March,2015 Apache web site mentioned that 2.0.14 is the most 
stable version recommended for Production and in Nov (just 7 mths time), people 
get the message in community saying that 2.0.x is sort of EOL. It takes months 
to roll out a Cassandra upgrade to all your clients and by the time all your 
clients get the latest Cassandra version, you learn that virtually no help is 
available on community for that version. I understand the fast paced 
environment but can we revisit the strategy to make sure that once declared 
"stable", a version must be supported on community for at least one year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?






> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have no

[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

2015-11-17 Thread Mark Jumaga (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009271#comment-15009271
 ] 

Mark Jumaga edited comment on CASSANDRA-7904 at 11/17/15 7:20 PM:
--

Thanks for the notification Michael regarding leaving old versions behind. 
I think the point of the discussion here is that the bug, which is also causing 
us serious problems, is not fixed in ReleaseVersion: 2.0.14.459 contrary to the 
bug resolved status 7909. 
One more thing. 
This causes repair to hang and it will eventually recover, however it is not 
known by my team if the eventual recovery has actually completed it's work. The 
next key range is listed, but in a vnode configuration the specific failed 
range communication may in fact be left behind, not restarted. It is frankly 
impossible to know if the repair has completed successfully really. Also, the 
repair hang in our case pushes the actual repair from hours to days making a 
repair on vnodes a more difficult endeavor than it already is.



was (Author: markj):
Thanks for the notification Michael regarding leaving old versions behind. 
I think the point of the discussion here is that the bug, which is also causing 
us serious problems, is not fixed in ReleaseVersion: 2.0.14.459 contrary to the 
bug resolved status 7909. 


> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool NameActive   Pending  Completed
> Commandsn/a 0   34785445
> Responses   n/a 1   38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
> Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session

[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

2015-11-11 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001053#comment-15001053
 ] 

Anuj Wadehra edited comment on CASSANDRA-7904 at 11/11/15 8:58 PM:
---

[#Aleksey Yeschenko] I am sorry. I think that this issue must be reopened. We 
are facing this issue in 2.0.14. You have marked it a duplicate of 
CASSANDRA-7909 which was fixed in 2.0.11 so the issue must not be there in 
2.0.14.

We have 2 DCs with 3 nodes each, at remote locations with 10GBps 
connectivity.We are able to complete repair on 5 nodes. On only one node in 
DC2, we are unable to complete repair (-par -pr) as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

I analyzed some code in OutboundTcpConnection.java of 2.0.14 and see multiple 
possible issues there:
1. Scenario where 2 consecutive merkle tree requests fail is not handled. No 
Exception is printed in logs in such a case, tpstats also dont display repair 
messages as dropped and repair will hang infinitely.
2. Only IOException leads to retry of a request. In case some Runtime Exception 
occurs, no retry is done and exception is written at DEBUG instead of ERROR. 
Repair should hang here too.
3. When isTimeOut method always returns false for non-droppable message such as 
Merkle Tree Request(verb=REPAIR_MESSAGE),why increasing request timeout is 
solving problem of many people -[#Duncan Sands],[#Razi Khaja] and me. Is the 
logic broken?
4. Increasing request timeout can only be a temporary workaround not a fix. 
Root Cause Analysis of problem and permanent fix is needed.
 


was (Author: eanujwa):
[#Aleksey Yeschenko] I am sorry. I think that this issue must be reopened. We 
are facing this issue in 2.0.14. You have marked it a duplicate of 
CASSANDRA-7909 which was fixed in 2.0.11 so the issue must not be there in 
2.0.14.

We have 2 DCs at remote locations with 10GBps connectivity.On only one node in 
DC2, we are unable to complete repair (-par -pr) as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

I analyzed some code in OutboundTcpConnection.java of 2.0.14 and see multiple 
possible issues there:
1. Scenario where 2 consecutive merkle tree requests fail is not handled. No 
Exception is printed in logs in such a case, tpstats also dont display repair 
messages as dropped and repair will hang infinitely.
2. Only IOException leads to retry of a request. In case some Runtime Exception 
occurs, no retry is done and exception is written at DEBUG instead of ERROR. 
Repair should hang here too.
3. When isTimeOut method always returns false for non-droppable message such as 
Merkle Tree Request(verb=REPAIR_MESSAGE),why increasing request timeout is 
solving problem of many people -[#Duncan Sands],[#Razi Khaja] and me. Is the 
logic broken?
4. Increasing request timeout can only be a temporary workaround not a fix. 
Root Cause Analysis of problem and permanent fix is needed.
 

> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active comp

[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

2015-11-11 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001053#comment-15001053
 ] 

Anuj Wadehra edited comment on CASSANDRA-7904 at 11/11/15 8:51 PM:
---

[#Aleksey Yeschenko] I am sorry. I think that this issue must be reopened. We 
are facing this issue in 2.0.14. You have marked it a duplicate of 
CASSANDRA-7909 which was fixed in 2.0.11 so the issue must not be there in 
2.0.14.

We have 2 DCs at remote locations with 10GBps connectivity.On only one node in 
DC2, we are unable to complete repair (-par -pr) as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

I analyzed some code in OutboundTcpConnection.java of 2.0.14 and see multiple 
possible issues there:
1. Scenario where 2 consecutive merkle tree requests fail is not handled. No 
Exception is printed in logs in such a case, tpstats also dont display repair 
messages as dropped and repair will hang infinitely.
2. Only IOException leads to retry of a request. In case some Runtime Exception 
occurs, no retry is done and exception is written at DEBUG instead of ERROR. 
Repair should hang here too.
3. When isTimeOut method always returns false for non-droppable message such as 
Merkle Tree Request(verb=REPAIR_MESSAGE),why increasing request timeout is 
solving problem of many people -[#Duncan Sands],[#Razi Khaja] and me. Is the 
logic broken?
4. Increasing request timeout can only be a temporary workaround not a fix. 
Root Cause Analysis of problem and permanent fix is needed.
 


was (Author: eanujwa):
[#Aleksey Yeschenko] I am sorry. I think that this issue must be reopened. We 
are facing this issue in 2.0.14. You have marked it a duplicate of 
CASSANDRA-7909 which was fixed in 2.0.11 so the issue must not be there in 
2.0.14.

We have 2 DCs at remote locations with 10GBps connectivity.On only one node in 
DC2, we are unable to complete repair (-par -pr) as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

I analyzed some code in OutboundTcpConnection.java of 2.0.14 and see multiple 
possible issues there:
1. Scenario where 2 consecutive merkle tree requests fail is not handled. No 
Exception is printed in logs in such a case, tpstats also dont display repair 
messages as dropped and repair will hang infinitely.
2. Only IOException leads to retry of a request. In case some Runtime Exception 
occurs, no retry is done and exception is written at DEBUG instead of ERROR. 
Repair should hang here too.
3. When isTimeOut method always returns false for non-droppable message such as 
Merkle Tree Request(verb=REPAIR_MESSAGE),why increasing request timeout is 
solving problem of many people? Is the logic broken?

Exception handling must be improved. Its impossible to troubleshoot such issue 
in PROD, as no relevant error is logged.

> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one

[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

2015-11-11 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001053#comment-15001053
 ] 

Anuj Wadehra edited comment on CASSANDRA-7904 at 11/11/15 8:44 PM:
---

[#Aleksey Yeschenko] I am sorry. I think that this issue must be reopened. We 
are facing this issue in 2.0.14. You have marked it a duplicate of 
CASSANDRA-7909 which was fixed in 2.0.11 so the issue must not be there in 
2.0.14.

We have 2 DCs at remote locations with 10GBps connectivity.On only one node in 
DC2, we are unable to complete repair (-par -pr) as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

I analyzed some code in OutboundTcpConnection.java of 2.0.14 and see multiple 
possible issues there:
1. Scenario where 2 consecutive merkle tree requests fail is not handled. No 
Exception is printed in logs in such a case, tpstats also dont display repair 
messages as dropped and repair will hang infinitely.
2. Only IOException leads to retry of a request. In case some Runtime Exception 
occurs, no retry is done and exception is written at DEBUG instead of ERROR. 
Repair should hang here too.
3. When isTimeOut method always returns false for non-droppable message such as 
Merkle Tree Request(verb=REPAIR_MESSAGE),why increasing request timeout is 
solving problem of many people? Is the logic broken?

Exception handling must be improved. Its impossible to troubleshoot such issue 
in PROD, as no relevant error is logged.


was (Author: eanujwa):
[#Aleksey Yeschenko] I am sorry. I think that this issue must be reopened. We 
are facing this issue in 2.0.14. You have marked it a duplicate of 
CASSANDRA-7909 which was fixed in 2.0.11 so the issue must not be there 2.0.14.

We have 2 DCs at remote locations with 10GBps connectivity.On only one node in 
DC2, we are unable to complete repair (-par -pr) as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

I analyzed some code in OutboundTcpConnection.java of 2.0.14 and see multiple 
possible issues there:
1. Scenario where 2 consecutive merkle tree requests fail is not handled. No 
Exception is printed in logs in such a case, tpstats also dont display repair 
messages as dropped and repair will hang infinitely.
2. Only IOException leads to retry of a request. In case some Runtime Exception 
occurs, no retry is done and exception is written at DEBUG instead of ERROR. 
Repair should hang here too.
3. When isTimeOut method always returns false for non-droppable message such as 
Merkle Tree Request(verb=REPAIR_MESSAGE),why increasing request timeout is 
solving problem of many people? Is the logic broken?

Exception handling must be improved. Its impossible to troubleshoot such issue 
in PROD, as no relevant error is logged.

> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (n

[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

2014-09-15 Thread Razi Khaja (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133990#comment-14133990
 ] 

Razi Khaja edited comment on CASSANDRA-7904 at 9/15/14 3:11 PM:


I increased my request_time_in_ms from 2 to 18 and repair is working 
now so far for 2 hours without *Lost notification*.  In the comment I made 
above, for my keyspace megalink, repair command #10 lost notificaion within 4 
minutes, so the fact that my current repair is still running for 2 hours is a 
good sign.


was (Author: razi.kh...@gmail.com):
I increase my request_time_in_ms from 2 to 18 and repair is working now 
so far for 2 hours without *Lost notification*.  In the comment I made above, 
for my keyspace megalink, repair command #10 lost notificaion within 4 minutes, 
so the fact that my current repair is still running for 2 hours is a good sign.

> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool NameActive   Pending  Completed
> Commandsn/a 0   34785445
> Responses   n/a 1   38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
> Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session is never mentioned in the /172.18.68.139 log.
>   Node 172.18.68.138 (data centre Z):
> The issue here seems to be repair session 
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
> tree requests, did some streaming, but seems to have stopped after finishing 
> with one table (rollups60).  I found it as follows: it is the only repair for 
> which there is no "se

[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

2014-09-12 Thread Razi Khaja (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132122#comment-14132122
 ] 

Razi Khaja edited comment on CASSANDRA-7904 at 9/12/14 9:30 PM:


We have 3 data centers, each with 4 nodes (running on physical machines not on 
EC2). We have been running Cassandra 2.0.6 and have not been able to 
successfully run *nodetool repair* on any of our nodes (except when no data or 
almost no data was loaded into our keyspaces). We upgraded to Cassandra 2.0.10 
hoping that this issue of *Lost notification* during the *nodetool repair* 
would be fixed, but as you can see from the log below, we still have not been 
able to successfully run *nodetool repair*.

{code}
[2014-09-12 11:08:02,131] Nothing to repair for keyspace 'system'
[2014-09-12 11:08:02,179] Starting repair command #10, repairing 1389 ranges 
for keyspace megalink
[2014-09-12 11:12:02,196] Lost notification. You should check server log for 
repair status of keyspace megalink
[2014-09-12 11:12:02,258] Starting repair command #11, repairing 1389 ranges 
for keyspace megalink_dev
[2014-09-12 11:12:02,331] Repair command #11 finished
[2014-09-12 11:12:02,346] Starting repair command #12, repairing 512 ranges for 
keyspace system_traces
[2014-09-12 11:13:02,349] Lost notification. You should check server log for 
repair status of keyspace system_traces
{code}

If there any more details that are needed to help solve this problem please let 
me know and I will do my best to provide them.  Please let me know how I can 
help.


was (Author: razi.kh...@gmail.com):
We have 3 data centers, each with 4 nodes (running on physical machines not on 
EC2). We have been running Cassandra 2.0.6 and have not been able to 
successfully run *nodetool repair* on any of our nodes (except when no data or 
almost no data was loaded into our keyspaces). We upgraded to Cassandra 2.0.10 
hoping that this issue of *Lost notification* during the *nodetool repair* 
would be fixed, but as you can see from the log below, we still have not been 
able to successfully run *nodetool repair*.

{code}
[2014-09-12 11:08:02,131] Nothing to repair for keyspace 'system'
[2014-09-12 11:08:02,179] Starting repair command #10, repairing 1389 ranges 
for keyspace megalink
[2014-09-12 11:12:02,196] Lost notification. You should check server log for 
repair status of keyspace megalink
[2014-09-12 11:12:02,258] Starting repair command #11, repairing 1389 ranges 
for keyspace megalink_dev
[2014-09-12 11:12:02,331] Repair command #11 finished
[2014-09-12 11:12:02,346] Starting repair command #12, repairing 512 ranges for 
keyspace system_traces
[2014-09-12 11:13:02,349] Lost notification. You should check server log for 
repair status of keyspace system_traces
{code}

If there any more details that are needed to help solve this problem please let 
me know and I will do my best to provide them.

> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool NameActive   Pending  Completed
> Commandsn/a