[jira] [Commented] (CASSANDRA-7904) Repair hangs

2015-11-28 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030622#comment-15030622
 ] 

Anuj Wadehra commented on CASSANDRA-7904:
-

If CASSANDRA-10113 is the issue,why repair message is mostly getting expired on 
only one of the nodes in one DC. And why most of the time only remote DC node 
fails to get the merkle tree message?

Also, if you see attached logs, I am seeing hintedhandoff being timeout for the 
same node for which merkle tree response was missing. Are they related?




> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: Repair_DEBUG_On_OutboundTcpConnection.txt, 
> ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool NameActive   Pending  Completed
> Commandsn/a 0   34785445
> Responses   n/a 1   38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
> Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session is never mentioned in the /172.18.68.139 log.
>   Node 172.18.68.138 (data centre Z):
> The issue here seems to be repair session 
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
> tree requests, did some streaming, but seems to have stopped after finishing 
> with one table (rollups60).  I found it as follows: it is the only repair for 
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2015-11-27 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030201#comment-15030201
 ] 

Anuj Wadehra commented on CASSANDRA-7904:
-

[#mshuler]
First of all, I would like to express my discomfort on the way few things are 
happening. In March,2015 Apache web site mentioned that 2.0.14 is the most 
stable version recommended for Production and in Nov (just 7 mths time), people 
get the message in community saying that 2.0.x is sort of EOL. It takes months 
to roll out a Cassandra upgrade to all your clients and by the time all your 
clients get the latest Cassandra version, you learn that virtually no help is 
available on community for that version. I understand the fast paced 
environment but can we revisit the strategy to make sure that once declared 
"stable", a version must be supported on community for at least one year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?






> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool NameActive   Pending  Completed
> Commandsn/a 0   34785445
> Responses   n/a 1   38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On 

[jira] [Commented] (CASSANDRA-7904) Repair hangs

2015-11-27 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030227#comment-15030227
 ] 

Yuki Morishita commented on CASSANDRA-7904:
---

CASSANDRA-10113 may be your case.

> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: Repair_DEBUG_On_OutboundTcpConnection.txt, 
> ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool NameActive   Pending  Completed
> Commandsn/a 0   34785445
> Responses   n/a 1   38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
> Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session is never mentioned in the /172.18.68.139 log.
>   Node 172.18.68.138 (data centre Z):
> The issue here seems to be repair session 
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
> tree requests, did some streaming, but seems to have stopped after finishing 
> with one table (rollups60).  I found it as follows: it is the only repair for 
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2015-11-17 Thread Mark Jumaga (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009271#comment-15009271
 ] 

Mark Jumaga commented on CASSANDRA-7904:


Thanks for the notification Michael regarding leaving old versions behind. 
I think the point of the discussion here is that the bug, which is also causing 
us serious problems, is not fixed in ReleaseVersion: 2.0.14.459 contrary to the 
bug resolved status 7909. 


> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool NameActive   Pending  Completed
> Commandsn/a 0   34785445
> Responses   n/a 1   38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
> Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session is never mentioned in the /172.18.68.139 log.
>   Node 172.18.68.138 (data centre Z):
> The issue here seems to be repair session 
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
> tree requests, did some streaming, but seems to have stopped after finishing 
> with one table (rollups60).  I found it as follows: it is the only repair for 
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2015-11-11 Thread Duncan Sands (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000700#comment-15000700
 ] 

Duncan Sands commented on CASSANDRA-7904:
-

Repair works much better in 2.1, I haven't had any issues so far.

> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool NameActive   Pending  Completed
> Commandsn/a 0   34785445
> Responses   n/a 1   38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
> Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session is never mentioned in the /172.18.68.139 log.
>   Node 172.18.68.138 (data centre Z):
> The issue here seems to be repair session 
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
> tree requests, did some streaming, but seems to have stopped after finishing 
> with one table (rollups60).  I found it as follows: it is the only repair for 
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2015-11-11 Thread Michael Shuler (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001562#comment-15001562
 ] 

Michael Shuler commented on CASSANDRA-7904:
---

The cassandra-2.0 branch is no longer under development. Does this still occur 
in the latest 2.1.X release (2.1.11, as of today), or 2.2.X (2.2.3, currently)?
(The cassandra-2.1 branch will soon be discontinued for active development, 
since 3.0.0 was released, just so you're aware.)

[~eanujwa], if you're keen on digging around the code in at least the 2.1 
branch (starting at 2.2 would be better), and seeing if you can identify your 
numbered scenarios above and open a new ticket on the topic with 2.1/2.2 as a 
starting point, that would make sense, as far as where the active development 
is occurring in Cassandra.

> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool NameActive   Pending  Completed
> Commandsn/a 0   34785445
> Responses   n/a 1   38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
> Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session is never mentioned in the /172.18.68.139 log.
>   Node 172.18.68.138 (data centre Z):
> The issue here seems to be repair session 
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
> tree requests, did some streaming, but seems to have stopped after finishing 
> with one table (rollups60).  I found it as follows: it is the only repair for 
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (CASSANDRA-7904) Repair hangs

2015-11-11 Thread Anuj Wadehra (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001053#comment-15001053
 ] 

Anuj Wadehra commented on CASSANDRA-7904:
-

[#Aleksey Yeschenko] I am sorry. I think that this issue must be reopened. We 
are facing this issue in 2.0.14. You have marked it a duplicate of 
CASSANDRA-7909 which was fixed in 2.0.11 so the issue must not be there 2.0.14.

We have 2 DCs at remote locations with 10GBps connectivity.On only one node in 
DC2, we are unable to complete repair (-par -pr) as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

I analyzed some code in OutboundTcpConnection.java of 2.0.14 and see multiple 
possible issues there:
1. Scenario where 2 consecutive merkle tree requests fail is not handled. No 
Exception is printed in logs in such a case, tpstats also dont display repair 
messages as dropped and repair will hang infinitely.
2. Only IOException leads to retry of a request. In case some Runtime Exception 
occurs, no retry is done and exception is written at DEBUG instead of ERROR. 
Repair should hang here too.
3. When isTimeOut method always returns false for non-droppable message such as 
Merkle Tree Request(verb=REPAIR_MESSAGE),why increasing request timeout is 
solving problem of many people? Is the logic broken?

Exception handling must be improved. Its impossible to troubleshoot such issue 
in PROD, as no relevant error is logged.

> Repair hangs
> 
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
>  Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
> pending tasks: 0
> Active compaction remaining time :n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool NameActive   Pending  Completed
> Commandsn/a 0   34785445
> Responses   n/a 1   38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response 

[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-15 Thread Duncan Sands (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133844#comment-14133844
 ] 

Duncan Sands commented on CASSANDRA-7904:
-

With request_timeout_in_ms increased to 10 in cassandra.yaml on all nodes, 
repair completed successfully this weekend.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-15 Thread Michael Shuler (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133901#comment-14133901
 ] 

Michael Shuler commented on CASSANDRA-7904:
---

Thanks for the update, [~baldrick]!  Considering that this ticket appears to 
not really be a bug report at this point, can we call this closed?

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-15 Thread Duncan Sands (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133925#comment-14133925
 ] 

Duncan Sands commented on CASSANDRA-7904:
-

Hi Michael, the workaround for this issue was effective, but I still think 
there is a problem here, in fact two problems:
  (1) repair hangs rather than failing, when too many repair messages time out;
  (2) the hang is silent: there is nothing in the logs saying that there is a 
problem (unless you turn on a special debug option, see previous comment).

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-15 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133942#comment-14133942
 ] 

Brandon Williams commented on CASSANDRA-7904:
-

Dramatically increasing rpc_timeout isn't the most desirable workaround, either.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-15 Thread Razi Khaja (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133990#comment-14133990
 ] 

Razi Khaja commented on CASSANDRA-7904:
---

I increase my request_time_in_ms from 2 to 18 and repair is working now 
so far for 2 hours without *Lost notification*.  In the comment I made above, 
for my keyspace megalink, repair command #10 lost notificaion within 4 minutes, 
so the fact that my current repair is still running for 2 hours is a good sign.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-15 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134057#comment-14134057
 ] 

Yuki Morishita commented on CASSANDRA-7904:
---

*Lost notification* just indicates JMX lost some notification. It is nothing to 
do with repair hanging(btw, I created CASSANDRA-7909 for not exiting in this 
situation).
You should check your system.log for repair completion in that case.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-15 Thread Duncan Sands (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134065#comment-14134065
 ] 

Duncan Sands commented on CASSANDRA-7904:
-

Razi, please open a different JIRA for your issue.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-15 Thread Duncan Sands (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134069#comment-14134069
 ] 

Duncan Sands commented on CASSANDRA-7904:
-

Brandon, it may not have been necessary to increase rpc_timeout so much, I 
didn't experiment to find out what is really needed.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-15 Thread Razi Khaja (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134473#comment-14134473
 ] 

Razi Khaja commented on CASSANDRA-7904:
---

Duncan,
Sorry for my misunderstanding, but the linked ticket: 
https://issues.apache.org/jira/browse/CASSANDRA-6651?focusedCommentId=13892345page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13892345
 states:

{quote}
 Thunder Stumpges added a comment - 05/Feb/14 12:45

FWIW we have this exact same issue. We are running 2.0.3 on a 3 node cluster. 
It has happened multiple times, and happens more times than not when running 
nodetool repair. There is nearly always one or more AntiEntropySessions 
remaining according to tpstats.

One strange thing about the behavior I see is that the output of nodetool 
compactionstats returns 0 active compactions, yet when restarting, we get the 
exception about Unfinished compactions reference missing sstables. It does 
seem like these two issues are related.

Another thing I see sometimes in the ouput from nodetool repair is the 
following message:
[2014-02-04 14:07:30,858] Starting repair command #7, repairing 768 ranges for 
keyspace thunder_test
[2014-02-04 14:08:30,862] Lost notification. You should check server log for 
repair status of keyspace thunder_test
[2014-02-04 14:08:30,870] Starting repair command #8, repairing 768 ranges for 
keyspace doan_synset
[2014-02-04 14:09:30,874] Lost notification. You should check server log for 
repair status of keyspace doan_synset

When this happens, it starts the next repair session immediately rather than 
waiting for the current one to finish. This doesn't however seem to always 
correlate to a hung session.

My logs don't look much/any different from the OP, but I'd be glad to provide 
any more details that might be helpful. We will be upgrading to 2.0.4 in the 
next couple days and I will report back if we see any difference in behavior.
{quote}

Which is why I believe I had the same issue.  I'll move my discussion to the 
newly created ticket CASSANDRA-7909



 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different 

[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-13 Thread Duncan Sands (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132605#comment-14132605
 ] 

Duncan Sands commented on CASSANDRA-7904:
-

Hi Razi, your issue seems to be very different to this one.  I think you should 
open a new ticket for it.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-12 Thread Razi Khaja (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132122#comment-14132122
 ] 

Razi Khaja commented on CASSANDRA-7904:
---

We have 3 data centers, each with 4 nodes (running on physical machines not on 
EC2). We have been running Cassandra 2.0.6 and have not been able to 
successfully run *nodetool repair* on any of our nodes (except when no data or 
almost no data was loaded into our keyspaces). We upgraded to Cassandra 2.0.10 
hoping that this issue of *Lost notification* during the *nodetool repair* 
would be fixed, but as you can see from the log below, we still have not been 
able to successfully run *nodetool repair*.

{code}
[2014-09-12 11:08:02,131] Nothing to repair for keyspace 'system'
[2014-09-12 11:08:02,179] Starting repair command #10, repairing 1389 ranges 
for keyspace megalink
[2014-09-12 11:12:02,196] Lost notification. You should check server log for 
repair status of keyspace megalink
[2014-09-12 11:12:02,258] Starting repair command #11, repairing 1389 ranges 
for keyspace megalink_dev
[2014-09-12 11:12:02,331] Repair command #11 finished
[2014-09-12 11:12:02,346] Starting repair command #12, repairing 512 ranges for 
keyspace system_traces
[2014-09-12 11:13:02,349] Lost notification. You should check server log for 
repair status of keyspace system_traces
{code}

If there any more details that are needed to help solve this problem please let 
me know and I will do my best to provide them.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, 

[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-12 Thread Razi Khaja (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132172#comment-14132172
 ] 

Razi Khaja commented on CASSANDRA-7904:
---

I think this might be related since it  mentions *RepairJob* ... I hope it is 
helpful.
{code}
 INFO [AntiEntropyStage:1] 2014-09-12 16:36:29,536 RepairSession.java (line 
166) [repair #ec6b4340-3abd-11e4-b32d-db378a0ca7f3] Received merkle tree for 
genome_protein_v10 from /XXX.XXX.XXX.XXX
ERROR [MiscStage:58] 2014-09-12 16:36:29,537 CassandraDaemon.java (line 199) 
Exception in thread Thread[MiscStage:58,5,main]
java.lang.IllegalArgumentException: Unknown keyspace/cf pair 
(megalink.probe_gene_v24)
at 
org.apache.cassandra.db.Keyspace.getColumnFamilyStore(Keyspace.java:171)
at 
org.apache.cassandra.service.SnapshotVerbHandler.doVerb(SnapshotVerbHandler.java:42)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
ERROR [RepairJobTask:7] 2014-09-12 16:36:29,537 RepairJob.java (line 125) Error 
occurred during snapshot phase
java.lang.RuntimeException: Could not create snapshot at /XXX.XXX.XXX.XXX
at 
org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:81)
at 
org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:47)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
ERROR [AntiEntropySessions:73] 2014-09-12 16:36:29,540 RepairSession.java (line 
288) [repair #ec6b4340-3abd-11e4-b32d-db378a0ca7f3] session completed with the 
following error
java.io.IOException: Failed during snapshot creation.
at 
org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:323)
at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:126)
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1160)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
ERROR [AntiEntropySessions:73] 2014-09-12 16:36:29,543 CassandraDaemon.java 
(line 199) Exception in thread Thread[AntiEntropySessions:73,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Failed during snapshot 
creation.
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed during snapshot creation.
at 
org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:323)
at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:126)
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1160)
... 3 more
{code}

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 

[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-10 Thread Duncan Sands (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128186#comment-14128186
 ] 

Duncan Sands commented on CASSANDRA-7904:
-

No, not running on EC2, it's all physical machines, except for 172.18.68.138 
and 172.18.68.139 which are virtual machines (kvm-qemu) running on our own 
server.

Data centre Z has only a 5mbaud network connection to the other data centres, 
so the connection could easily be saturated by streaming, possibly delaying 
other messages.

If a repair message was dropped, wouldn't there be an exception message about 
it in the system logs?

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-10 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128637#comment-14128637
 ] 

Yuki Morishita commented on CASSANDRA-7904:
---

bq. If a repair message was dropped, wouldn't there be an exception message 
about it in the system logs?

Only logged when DEBUG is on 
(org.apache.cassandra.net.OutboundTcpConnection=DEBUG) because it can happen 
often when a message is dropped by timeout.
You may try to increase RPC timeout in cassandra.yaml temporary on nodes in DC 
Z.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-10 Thread Duncan Sands (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128664#comment-14128664
 ] 

Duncan Sands commented on CASSANDRA-7904:
-

Are you saying that if a node sends a repair message twice and both get lost, 
then repair will hang and no complaint will be printed to the logs (unless 
logging at level DEBUG)?  If so, wouldn't it be better to abort the repair and 
print an exception to the logs?

PS: I forgot to mention that the cluster is using the hsha RPC server type.
PPS: I will try increasing the value of this option
  # The default timeout for other, miscellaneous operations
  request_timeout_in_ms: 1
in cassandra.yaml.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-09 Thread Robert Coli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127110#comment-14127110
 ] 

Robert Coli commented on CASSANDRA-7904:


linking for a breadcrumb trail between related repair hangs tickets.

 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7904) Repair hangs

2014-09-09 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127721#comment-14127721
 ] 

Yuki Morishita commented on CASSANDRA-7904:
---

Are you running on EC2?
CASSANDRA-5393 / CASSANDRA-6980 made sure repair messages are not dropped, but 
still the node gives up sending messages after it retry once.


 Repair hangs
 

 Key: CASSANDRA-7904
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
 java version 1.7.0_45
Reporter: Duncan Sands
 Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
 ls-192.168.60.136


 Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
 repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
 options: -par -pr.  There is usually some overlap in the repairs: repair on 
 one node may well still be running when repair is started on the next node.  
 Repair hangs for some of the nodes almost every weekend.  It hung last 
 weekend, here are the details:
 In the whole cluster, only one node had an exception since C* was last 
 restarted.  This node is 192.168.60.136 and the exception is harmless: a 
 client disconnected abruptly.
 tpstats
   4 nodes have a non-zero value for active or pending in 
 AntiEntropySessions.  These nodes all have Active = 1 and Pending = 1.  The 
 nodes are:
   192.168.21.13 (data centre R)
   192.168.60.134 (data centre A)
   192.168.60.136 (data centre A)
   172.18.68.138 (data centre Z)
 compactionstats:
   No compactions.  All nodes have:
 pending tasks: 0
 Active compaction remaining time :n/a
 netstats:
   All except one node have nothing.  One node (192.168.60.131, not one of the 
 nodes listed in the tpstats section above) has (note the Responses Pending 
 value of 1):
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 4233
 Mismatch (Blocking): 0
 Mismatch (Background): 243
 Pool NameActive   Pending  Completed
 Commandsn/a 0   34785445
 Responses   n/a 1   38567167
 Repair sessions
   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
 mentioned in tpstats above I found that they had sent merkle tree requests 
 and got responses from all but one node.  In the log file for the node that 
 failed to respond there is no sign that it ever received the request.  On 1 
 node (172.18.68.138) it looks like responses were received from every node, 
 some streaming was done, and then... nothing.  Details:
   Node 192.168.21.13 (data centre R):
 Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
 /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
 brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
 before this time it sent a response for the same repair session but a 
 different table, and there is no record of it receiving a request for table 
 brokers.
   Node 192.168.60.134 (data centre A):
 Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
 /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
 response from /172.18.68.138.  On /172.18.68.138, just before this time it 
 sent a response for the same repair session but a different table, and there 
 is no record of it receiving a request for table swxess_outbound.
   Node 192.168.60.136 (data centre A):
 Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
 table rollups7200, never got a response from /172.18.68.139.  This repair 
 session is never mentioned in the /172.18.68.139 log.
   Node 172.18.68.138 (data centre Z):
 The issue here seems to be repair session 
 #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
 tree requests, did some streaming, but seems to have stopped after finishing 
 with one table (rollups60).  I found it as follows: it is the only repair for 
 which there is no session completed successfully message in the log.
 Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)