[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030622#comment-15030622 ] Anuj Wadehra commented on CASSANDRA-7904: - If CASSANDRA-10113 is the issue,why repair message is mostly getting expired on only one of the nodes in one DC. And why most of the time only remote DC node fails to get the merkle tree message? Also, if you see attached logs, I am seeing hintedhandoff being timeout for the same node for which merkle tree response was missing. Are they related? > Repair hangs > > > Key: CASSANDRA-7904 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 > Project: Cassandra > Issue Type: Bug > Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, > java version "1.7.0_45" >Reporter: Duncan Sands > Attachments: Repair_DEBUG_On_OutboundTcpConnection.txt, > ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 > > > Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so > repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool > options: -par -pr. There is usually some overlap in the repairs: repair on > one node may well still be running when repair is started on the next node. > Repair hangs for some of the nodes almost every weekend. It hung last > weekend, here are the details: > In the whole cluster, only one node had an exception since C* was last > restarted. This node is 192.168.60.136 and the exception is harmless: a > client disconnected abruptly. > tpstats > 4 nodes have a non-zero value for "active" or "pending" in > AntiEntropySessions. These nodes all have Active => 1 and Pending => 1. The > nodes are: > 192.168.21.13 (data centre R) > 192.168.60.134 (data centre A) > 192.168.60.136 (data centre A) > 172.18.68.138 (data centre Z) > compactionstats: > No compactions. All nodes have: > pending tasks: 0 > Active compaction remaining time :n/a > netstats: > All except one node have nothing. One node (192.168.60.131, not one of the > nodes listed in the tpstats section above) has (note the Responses Pending > value of 1): > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 4233 > Mismatch (Blocking): 0 > Mismatch (Background): 243 > Pool NameActive Pending Completed > Commandsn/a 0 34785445 > Responses n/a 1 38567167 > Repair sessions > I looked for repair sessions that failed to complete. On 3 of the 4 nodes > mentioned in tpstats above I found that they had sent merkle tree requests > and got responses from all but one node. In the log file for the node that > failed to respond there is no sign that it ever received the request. On 1 > node (172.18.68.138) it looks like responses were received from every node, > some streaming was done, and then... nothing. Details: > Node 192.168.21.13 (data centre R): > Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, > /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table > brokers, never got a response from /172.18.68.139. On /172.18.68.139, just > before this time it sent a response for the same repair session but a > different table, and there is no record of it receiving a request for table > brokers. > Node 192.168.60.134 (data centre A): > Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, > /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a > response from /172.18.68.138. On /172.18.68.138, just before this time it > sent a response for the same repair session but a different table, and there > is no record of it receiving a request for table swxess_outbound. > Node 192.168.60.136 (data centre A): > Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for > table rollups7200, never got a response from /172.18.68.139. This repair > session is never mentioned in the /172.18.68.139 log. > Node 172.18.68.138 (data centre Z): > The issue here seems to be repair session > #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle > tree requests, did some streaming, but seems to have stopped after finishing > with one table (rollups60). I found it as follows: it is the only repair for > which there is no "session completed successfully" message in the log. > Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030201#comment-15030201 ] Anuj Wadehra commented on CASSANDRA-7904: - [#mshuler] First of all, I would like to express my discomfort on the way few things are happening. In March,2015 Apache web site mentioned that 2.0.14 is the most stable version recommended for Production and in Nov (just 7 mths time), people get the message in community saying that 2.0.x is sort of EOL. It takes months to roll out a Cassandra upgrade to all your clients and by the time all your clients get the latest Cassandra version, you learn that virtually no help is available on community for that version. I understand the fast paced environment but can we revisit the strategy to make sure that once declared "stable", a version must be supported on community for at least one year? Coming back to the issue, I see similar problem in 2.2.0 code too. We are still facing the issue. Merkle Requests across DCs are getting lost and then repair hangs. We have enabled DEBUG. When we started repair on a node, we got error messages "error writing to /X.X.X.X java.io.IOException: Connection timed out" for 2 nodes in remote DC. Merkle tree were received from these 2 nodes. But we didnt get any error for 3rd node in remote DC and strangely this was the node which never got Merkle tree request. Moreoever, we observed that hinted handoff started for 3rd node from the node being repaired but hint replay timed-out too. Please find attached logs of node 10.X.15.115. Merkle tree request never reached 10.X.14.115 and thus no response was received.Absolutely no logs got printed on 10.X.14.115. Can we reopen the ticket? > Repair hangs > > > Key: CASSANDRA-7904 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 > Project: Cassandra > Issue Type: Bug > Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, > java version "1.7.0_45" >Reporter: Duncan Sands > Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, > ls-192.168.60.136 > > > Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so > repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool > options: -par -pr. There is usually some overlap in the repairs: repair on > one node may well still be running when repair is started on the next node. > Repair hangs for some of the nodes almost every weekend. It hung last > weekend, here are the details: > In the whole cluster, only one node had an exception since C* was last > restarted. This node is 192.168.60.136 and the exception is harmless: a > client disconnected abruptly. > tpstats > 4 nodes have a non-zero value for "active" or "pending" in > AntiEntropySessions. These nodes all have Active => 1 and Pending => 1. The > nodes are: > 192.168.21.13 (data centre R) > 192.168.60.134 (data centre A) > 192.168.60.136 (data centre A) > 172.18.68.138 (data centre Z) > compactionstats: > No compactions. All nodes have: > pending tasks: 0 > Active compaction remaining time :n/a > netstats: > All except one node have nothing. One node (192.168.60.131, not one of the > nodes listed in the tpstats section above) has (note the Responses Pending > value of 1): > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 4233 > Mismatch (Blocking): 0 > Mismatch (Background): 243 > Pool NameActive Pending Completed > Commandsn/a 0 34785445 > Responses n/a 1 38567167 > Repair sessions > I looked for repair sessions that failed to complete. On 3 of the 4 nodes > mentioned in tpstats above I found that they had sent merkle tree requests > and got responses from all but one node. In the log file for the node that > failed to respond there is no sign that it ever received the request. On 1 > node (172.18.68.138) it looks like responses were received from every node, > some streaming was done, and then... nothing. Details: > Node 192.168.21.13 (data centre R): > Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, > /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table > brokers, never got a response from /172.18.68.139. On /172.18.68.139, just > before this time it sent a response for the same repair session but a > different table, and there is no record of it receiving a request for table > brokers. > Node 192.168.60.134 (data centre A): > Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, > /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a > response from /172.18.68.138. On
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030227#comment-15030227 ] Yuki Morishita commented on CASSANDRA-7904: --- CASSANDRA-10113 may be your case. > Repair hangs > > > Key: CASSANDRA-7904 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 > Project: Cassandra > Issue Type: Bug > Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, > java version "1.7.0_45" >Reporter: Duncan Sands > Attachments: Repair_DEBUG_On_OutboundTcpConnection.txt, > ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 > > > Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so > repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool > options: -par -pr. There is usually some overlap in the repairs: repair on > one node may well still be running when repair is started on the next node. > Repair hangs for some of the nodes almost every weekend. It hung last > weekend, here are the details: > In the whole cluster, only one node had an exception since C* was last > restarted. This node is 192.168.60.136 and the exception is harmless: a > client disconnected abruptly. > tpstats > 4 nodes have a non-zero value for "active" or "pending" in > AntiEntropySessions. These nodes all have Active => 1 and Pending => 1. The > nodes are: > 192.168.21.13 (data centre R) > 192.168.60.134 (data centre A) > 192.168.60.136 (data centre A) > 172.18.68.138 (data centre Z) > compactionstats: > No compactions. All nodes have: > pending tasks: 0 > Active compaction remaining time :n/a > netstats: > All except one node have nothing. One node (192.168.60.131, not one of the > nodes listed in the tpstats section above) has (note the Responses Pending > value of 1): > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 4233 > Mismatch (Blocking): 0 > Mismatch (Background): 243 > Pool NameActive Pending Completed > Commandsn/a 0 34785445 > Responses n/a 1 38567167 > Repair sessions > I looked for repair sessions that failed to complete. On 3 of the 4 nodes > mentioned in tpstats above I found that they had sent merkle tree requests > and got responses from all but one node. In the log file for the node that > failed to respond there is no sign that it ever received the request. On 1 > node (172.18.68.138) it looks like responses were received from every node, > some streaming was done, and then... nothing. Details: > Node 192.168.21.13 (data centre R): > Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, > /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table > brokers, never got a response from /172.18.68.139. On /172.18.68.139, just > before this time it sent a response for the same repair session but a > different table, and there is no record of it receiving a request for table > brokers. > Node 192.168.60.134 (data centre A): > Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, > /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a > response from /172.18.68.138. On /172.18.68.138, just before this time it > sent a response for the same repair session but a different table, and there > is no record of it receiving a request for table swxess_outbound. > Node 192.168.60.136 (data centre A): > Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for > table rollups7200, never got a response from /172.18.68.139. This repair > session is never mentioned in the /172.18.68.139 log. > Node 172.18.68.138 (data centre Z): > The issue here seems to be repair session > #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle > tree requests, did some streaming, but seems to have stopped after finishing > with one table (rollups60). I found it as follows: it is the only repair for > which there is no "session completed successfully" message in the log. > Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009271#comment-15009271 ] Mark Jumaga commented on CASSANDRA-7904: Thanks for the notification Michael regarding leaving old versions behind. I think the point of the discussion here is that the bug, which is also causing us serious problems, is not fixed in ReleaseVersion: 2.0.14.459 contrary to the bug resolved status 7909. > Repair hangs > > > Key: CASSANDRA-7904 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 > Project: Cassandra > Issue Type: Bug > Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, > java version "1.7.0_45" >Reporter: Duncan Sands > Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, > ls-192.168.60.136 > > > Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so > repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool > options: -par -pr. There is usually some overlap in the repairs: repair on > one node may well still be running when repair is started on the next node. > Repair hangs for some of the nodes almost every weekend. It hung last > weekend, here are the details: > In the whole cluster, only one node had an exception since C* was last > restarted. This node is 192.168.60.136 and the exception is harmless: a > client disconnected abruptly. > tpstats > 4 nodes have a non-zero value for "active" or "pending" in > AntiEntropySessions. These nodes all have Active => 1 and Pending => 1. The > nodes are: > 192.168.21.13 (data centre R) > 192.168.60.134 (data centre A) > 192.168.60.136 (data centre A) > 172.18.68.138 (data centre Z) > compactionstats: > No compactions. All nodes have: > pending tasks: 0 > Active compaction remaining time :n/a > netstats: > All except one node have nothing. One node (192.168.60.131, not one of the > nodes listed in the tpstats section above) has (note the Responses Pending > value of 1): > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 4233 > Mismatch (Blocking): 0 > Mismatch (Background): 243 > Pool NameActive Pending Completed > Commandsn/a 0 34785445 > Responses n/a 1 38567167 > Repair sessions > I looked for repair sessions that failed to complete. On 3 of the 4 nodes > mentioned in tpstats above I found that they had sent merkle tree requests > and got responses from all but one node. In the log file for the node that > failed to respond there is no sign that it ever received the request. On 1 > node (172.18.68.138) it looks like responses were received from every node, > some streaming was done, and then... nothing. Details: > Node 192.168.21.13 (data centre R): > Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, > /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table > brokers, never got a response from /172.18.68.139. On /172.18.68.139, just > before this time it sent a response for the same repair session but a > different table, and there is no record of it receiving a request for table > brokers. > Node 192.168.60.134 (data centre A): > Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, > /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a > response from /172.18.68.138. On /172.18.68.138, just before this time it > sent a response for the same repair session but a different table, and there > is no record of it receiving a request for table swxess_outbound. > Node 192.168.60.136 (data centre A): > Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for > table rollups7200, never got a response from /172.18.68.139. This repair > session is never mentioned in the /172.18.68.139 log. > Node 172.18.68.138 (data centre Z): > The issue here seems to be repair session > #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle > tree requests, did some streaming, but seems to have stopped after finishing > with one table (rollups60). I found it as follows: it is the only repair for > which there is no "session completed successfully" message in the log. > Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000700#comment-15000700 ] Duncan Sands commented on CASSANDRA-7904: - Repair works much better in 2.1, I haven't had any issues so far. > Repair hangs > > > Key: CASSANDRA-7904 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 > Project: Cassandra > Issue Type: Bug > Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, > java version "1.7.0_45" >Reporter: Duncan Sands > Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, > ls-192.168.60.136 > > > Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so > repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool > options: -par -pr. There is usually some overlap in the repairs: repair on > one node may well still be running when repair is started on the next node. > Repair hangs for some of the nodes almost every weekend. It hung last > weekend, here are the details: > In the whole cluster, only one node had an exception since C* was last > restarted. This node is 192.168.60.136 and the exception is harmless: a > client disconnected abruptly. > tpstats > 4 nodes have a non-zero value for "active" or "pending" in > AntiEntropySessions. These nodes all have Active => 1 and Pending => 1. The > nodes are: > 192.168.21.13 (data centre R) > 192.168.60.134 (data centre A) > 192.168.60.136 (data centre A) > 172.18.68.138 (data centre Z) > compactionstats: > No compactions. All nodes have: > pending tasks: 0 > Active compaction remaining time :n/a > netstats: > All except one node have nothing. One node (192.168.60.131, not one of the > nodes listed in the tpstats section above) has (note the Responses Pending > value of 1): > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 4233 > Mismatch (Blocking): 0 > Mismatch (Background): 243 > Pool NameActive Pending Completed > Commandsn/a 0 34785445 > Responses n/a 1 38567167 > Repair sessions > I looked for repair sessions that failed to complete. On 3 of the 4 nodes > mentioned in tpstats above I found that they had sent merkle tree requests > and got responses from all but one node. In the log file for the node that > failed to respond there is no sign that it ever received the request. On 1 > node (172.18.68.138) it looks like responses were received from every node, > some streaming was done, and then... nothing. Details: > Node 192.168.21.13 (data centre R): > Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, > /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table > brokers, never got a response from /172.18.68.139. On /172.18.68.139, just > before this time it sent a response for the same repair session but a > different table, and there is no record of it receiving a request for table > brokers. > Node 192.168.60.134 (data centre A): > Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, > /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a > response from /172.18.68.138. On /172.18.68.138, just before this time it > sent a response for the same repair session but a different table, and there > is no record of it receiving a request for table swxess_outbound. > Node 192.168.60.136 (data centre A): > Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for > table rollups7200, never got a response from /172.18.68.139. This repair > session is never mentioned in the /172.18.68.139 log. > Node 172.18.68.138 (data centre Z): > The issue here seems to be repair session > #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle > tree requests, did some streaming, but seems to have stopped after finishing > with one table (rollups60). I found it as follows: it is the only repair for > which there is no "session completed successfully" message in the log. > Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001562#comment-15001562 ] Michael Shuler commented on CASSANDRA-7904: --- The cassandra-2.0 branch is no longer under development. Does this still occur in the latest 2.1.X release (2.1.11, as of today), or 2.2.X (2.2.3, currently)? (The cassandra-2.1 branch will soon be discontinued for active development, since 3.0.0 was released, just so you're aware.) [~eanujwa], if you're keen on digging around the code in at least the 2.1 branch (starting at 2.2 would be better), and seeing if you can identify your numbered scenarios above and open a new ticket on the topic with 2.1/2.2 as a starting point, that would make sense, as far as where the active development is occurring in Cassandra. > Repair hangs > > > Key: CASSANDRA-7904 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 > Project: Cassandra > Issue Type: Bug > Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, > java version "1.7.0_45" >Reporter: Duncan Sands > Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, > ls-192.168.60.136 > > > Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so > repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool > options: -par -pr. There is usually some overlap in the repairs: repair on > one node may well still be running when repair is started on the next node. > Repair hangs for some of the nodes almost every weekend. It hung last > weekend, here are the details: > In the whole cluster, only one node had an exception since C* was last > restarted. This node is 192.168.60.136 and the exception is harmless: a > client disconnected abruptly. > tpstats > 4 nodes have a non-zero value for "active" or "pending" in > AntiEntropySessions. These nodes all have Active => 1 and Pending => 1. The > nodes are: > 192.168.21.13 (data centre R) > 192.168.60.134 (data centre A) > 192.168.60.136 (data centre A) > 172.18.68.138 (data centre Z) > compactionstats: > No compactions. All nodes have: > pending tasks: 0 > Active compaction remaining time :n/a > netstats: > All except one node have nothing. One node (192.168.60.131, not one of the > nodes listed in the tpstats section above) has (note the Responses Pending > value of 1): > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 4233 > Mismatch (Blocking): 0 > Mismatch (Background): 243 > Pool NameActive Pending Completed > Commandsn/a 0 34785445 > Responses n/a 1 38567167 > Repair sessions > I looked for repair sessions that failed to complete. On 3 of the 4 nodes > mentioned in tpstats above I found that they had sent merkle tree requests > and got responses from all but one node. In the log file for the node that > failed to respond there is no sign that it ever received the request. On 1 > node (172.18.68.138) it looks like responses were received from every node, > some streaming was done, and then... nothing. Details: > Node 192.168.21.13 (data centre R): > Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, > /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table > brokers, never got a response from /172.18.68.139. On /172.18.68.139, just > before this time it sent a response for the same repair session but a > different table, and there is no record of it receiving a request for table > brokers. > Node 192.168.60.134 (data centre A): > Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, > /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a > response from /172.18.68.138. On /172.18.68.138, just before this time it > sent a response for the same repair session but a different table, and there > is no record of it receiving a request for table swxess_outbound. > Node 192.168.60.136 (data centre A): > Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for > table rollups7200, never got a response from /172.18.68.139. This repair > session is never mentioned in the /172.18.68.139 log. > Node 172.18.68.138 (data centre Z): > The issue here seems to be repair session > #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle > tree requests, did some streaming, but seems to have stopped after finishing > with one table (rollups60). I found it as follows: it is the only repair for > which there is no "session completed successfully" message in the log. > Some log file snippets are attached. -- This message was sent by Atlassian JIRA
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001053#comment-15001053 ] Anuj Wadehra commented on CASSANDRA-7904: - [#Aleksey Yeschenko] I am sorry. I think that this issue must be reopened. We are facing this issue in 2.0.14. You have marked it a duplicate of CASSANDRA-7909 which was fixed in 2.0.11 so the issue must not be there 2.0.14. We have 2 DCs at remote locations with 10GBps connectivity.On only one node in DC2, we are unable to complete repair (-par -pr) as it always hangs. Node sends Merkle Tree requests, but one or more nodes in DC1 (remote) never show that they sent the merkle tree reply to requesting node. Repair hangs infinitely. After increasing request_timeout_in_ms on affected node, we were able to successfully run repair on one of the two occassions. I analyzed some code in OutboundTcpConnection.java of 2.0.14 and see multiple possible issues there: 1. Scenario where 2 consecutive merkle tree requests fail is not handled. No Exception is printed in logs in such a case, tpstats also dont display repair messages as dropped and repair will hang infinitely. 2. Only IOException leads to retry of a request. In case some Runtime Exception occurs, no retry is done and exception is written at DEBUG instead of ERROR. Repair should hang here too. 3. When isTimeOut method always returns false for non-droppable message such as Merkle Tree Request(verb=REPAIR_MESSAGE),why increasing request timeout is solving problem of many people? Is the logic broken? Exception handling must be improved. Its impossible to troubleshoot such issue in PROD, as no relevant error is logged. > Repair hangs > > > Key: CASSANDRA-7904 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 > Project: Cassandra > Issue Type: Bug > Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, > java version "1.7.0_45" >Reporter: Duncan Sands > Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, > ls-192.168.60.136 > > > Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so > repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool > options: -par -pr. There is usually some overlap in the repairs: repair on > one node may well still be running when repair is started on the next node. > Repair hangs for some of the nodes almost every weekend. It hung last > weekend, here are the details: > In the whole cluster, only one node had an exception since C* was last > restarted. This node is 192.168.60.136 and the exception is harmless: a > client disconnected abruptly. > tpstats > 4 nodes have a non-zero value for "active" or "pending" in > AntiEntropySessions. These nodes all have Active => 1 and Pending => 1. The > nodes are: > 192.168.21.13 (data centre R) > 192.168.60.134 (data centre A) > 192.168.60.136 (data centre A) > 172.18.68.138 (data centre Z) > compactionstats: > No compactions. All nodes have: > pending tasks: 0 > Active compaction remaining time :n/a > netstats: > All except one node have nothing. One node (192.168.60.131, not one of the > nodes listed in the tpstats section above) has (note the Responses Pending > value of 1): > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 4233 > Mismatch (Blocking): 0 > Mismatch (Background): 243 > Pool NameActive Pending Completed > Commandsn/a 0 34785445 > Responses n/a 1 38567167 > Repair sessions > I looked for repair sessions that failed to complete. On 3 of the 4 nodes > mentioned in tpstats above I found that they had sent merkle tree requests > and got responses from all but one node. In the log file for the node that > failed to respond there is no sign that it ever received the request. On 1 > node (172.18.68.138) it looks like responses were received from every node, > some streaming was done, and then... nothing. Details: > Node 192.168.21.13 (data centre R): > Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, > /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table > brokers, never got a response from /172.18.68.139. On /172.18.68.139, just > before this time it sent a response for the same repair session but a > different table, and there is no record of it receiving a request for table > brokers. > Node 192.168.60.134 (data centre A): > Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, > /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a > response from /172.18.68.138. On /172.18.68.138, just before this time it > sent a response
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133844#comment-14133844 ] Duncan Sands commented on CASSANDRA-7904: - With request_timeout_in_ms increased to 10 in cassandra.yaml on all nodes, repair completed successfully this weekend. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133901#comment-14133901 ] Michael Shuler commented on CASSANDRA-7904: --- Thanks for the update, [~baldrick]! Considering that this ticket appears to not really be a bug report at this point, can we call this closed? Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133925#comment-14133925 ] Duncan Sands commented on CASSANDRA-7904: - Hi Michael, the workaround for this issue was effective, but I still think there is a problem here, in fact two problems: (1) repair hangs rather than failing, when too many repair messages time out; (2) the hang is silent: there is nothing in the logs saying that there is a problem (unless you turn on a special debug option, see previous comment). Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133942#comment-14133942 ] Brandon Williams commented on CASSANDRA-7904: - Dramatically increasing rpc_timeout isn't the most desirable workaround, either. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133990#comment-14133990 ] Razi Khaja commented on CASSANDRA-7904: --- I increase my request_time_in_ms from 2 to 18 and repair is working now so far for 2 hours without *Lost notification*. In the comment I made above, for my keyspace megalink, repair command #10 lost notificaion within 4 minutes, so the fact that my current repair is still running for 2 hours is a good sign. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134057#comment-14134057 ] Yuki Morishita commented on CASSANDRA-7904: --- *Lost notification* just indicates JMX lost some notification. It is nothing to do with repair hanging(btw, I created CASSANDRA-7909 for not exiting in this situation). You should check your system.log for repair completion in that case. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134065#comment-14134065 ] Duncan Sands commented on CASSANDRA-7904: - Razi, please open a different JIRA for your issue. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134069#comment-14134069 ] Duncan Sands commented on CASSANDRA-7904: - Brandon, it may not have been necessary to increase rpc_timeout so much, I didn't experiment to find out what is really needed. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134473#comment-14134473 ] Razi Khaja commented on CASSANDRA-7904: --- Duncan, Sorry for my misunderstanding, but the linked ticket: https://issues.apache.org/jira/browse/CASSANDRA-6651?focusedCommentId=13892345page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13892345 states: {quote} Thunder Stumpges added a comment - 05/Feb/14 12:45 FWIW we have this exact same issue. We are running 2.0.3 on a 3 node cluster. It has happened multiple times, and happens more times than not when running nodetool repair. There is nearly always one or more AntiEntropySessions remaining according to tpstats. One strange thing about the behavior I see is that the output of nodetool compactionstats returns 0 active compactions, yet when restarting, we get the exception about Unfinished compactions reference missing sstables. It does seem like these two issues are related. Another thing I see sometimes in the ouput from nodetool repair is the following message: [2014-02-04 14:07:30,858] Starting repair command #7, repairing 768 ranges for keyspace thunder_test [2014-02-04 14:08:30,862] Lost notification. You should check server log for repair status of keyspace thunder_test [2014-02-04 14:08:30,870] Starting repair command #8, repairing 768 ranges for keyspace doan_synset [2014-02-04 14:09:30,874] Lost notification. You should check server log for repair status of keyspace doan_synset When this happens, it starts the next repair session immediately rather than waiting for the current one to finish. This doesn't however seem to always correlate to a hung session. My logs don't look much/any different from the OP, but I'd be glad to provide any more details that might be helpful. We will be upgrading to 2.0.4 in the next couple days and I will report back if we see any difference in behavior. {quote} Which is why I believe I had the same issue. I'll move my discussion to the newly created ticket CASSANDRA-7909 Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132605#comment-14132605 ] Duncan Sands commented on CASSANDRA-7904: - Hi Razi, your issue seems to be very different to this one. I think you should open a new ticket for it. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132122#comment-14132122 ] Razi Khaja commented on CASSANDRA-7904: --- We have 3 data centers, each with 4 nodes (running on physical machines not on EC2). We have been running Cassandra 2.0.6 and have not been able to successfully run *nodetool repair* on any of our nodes (except when no data or almost no data was loaded into our keyspaces). We upgraded to Cassandra 2.0.10 hoping that this issue of *Lost notification* during the *nodetool repair* would be fixed, but as you can see from the log below, we still have not been able to successfully run *nodetool repair*. {code} [2014-09-12 11:08:02,131] Nothing to repair for keyspace 'system' [2014-09-12 11:08:02,179] Starting repair command #10, repairing 1389 ranges for keyspace megalink [2014-09-12 11:12:02,196] Lost notification. You should check server log for repair status of keyspace megalink [2014-09-12 11:12:02,258] Starting repair command #11, repairing 1389 ranges for keyspace megalink_dev [2014-09-12 11:12:02,331] Repair command #11 finished [2014-09-12 11:12:02,346] Starting repair command #12, repairing 512 ranges for keyspace system_traces [2014-09-12 11:13:02,349] Lost notification. You should check server log for repair status of keyspace system_traces {code} If there any more details that are needed to help solve this problem please let me know and I will do my best to provide them. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139,
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132172#comment-14132172 ] Razi Khaja commented on CASSANDRA-7904: --- I think this might be related since it mentions *RepairJob* ... I hope it is helpful. {code} INFO [AntiEntropyStage:1] 2014-09-12 16:36:29,536 RepairSession.java (line 166) [repair #ec6b4340-3abd-11e4-b32d-db378a0ca7f3] Received merkle tree for genome_protein_v10 from /XXX.XXX.XXX.XXX ERROR [MiscStage:58] 2014-09-12 16:36:29,537 CassandraDaemon.java (line 199) Exception in thread Thread[MiscStage:58,5,main] java.lang.IllegalArgumentException: Unknown keyspace/cf pair (megalink.probe_gene_v24) at org.apache.cassandra.db.Keyspace.getColumnFamilyStore(Keyspace.java:171) at org.apache.cassandra.service.SnapshotVerbHandler.doVerb(SnapshotVerbHandler.java:42) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ERROR [RepairJobTask:7] 2014-09-12 16:36:29,537 RepairJob.java (line 125) Error occurred during snapshot phase java.lang.RuntimeException: Could not create snapshot at /XXX.XXX.XXX.XXX at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:81) at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:47) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ERROR [AntiEntropySessions:73] 2014-09-12 16:36:29,540 RepairSession.java (line 288) [repair #ec6b4340-3abd-11e4-b32d-db378a0ca7f3] session completed with the following error java.io.IOException: Failed during snapshot creation. at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:323) at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:126) at com.google.common.util.concurrent.Futures$4.run(Futures.java:1160) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ERROR [AntiEntropySessions:73] 2014-09-12 16:36:29,543 CassandraDaemon.java (line 199) Exception in thread Thread[AntiEntropySessions:73,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Failed during snapshot creation. at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed during snapshot creation. at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:323) at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:126) at com.google.common.util.concurrent.Futures$4.run(Futures.java:1160) ... 3 more {code} Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active =
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128186#comment-14128186 ] Duncan Sands commented on CASSANDRA-7904: - No, not running on EC2, it's all physical machines, except for 172.18.68.138 and 172.18.68.139 which are virtual machines (kvm-qemu) running on our own server. Data centre Z has only a 5mbaud network connection to the other data centres, so the connection could easily be saturated by streaming, possibly delaying other messages. If a repair message was dropped, wouldn't there be an exception message about it in the system logs? Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128637#comment-14128637 ] Yuki Morishita commented on CASSANDRA-7904: --- bq. If a repair message was dropped, wouldn't there be an exception message about it in the system logs? Only logged when DEBUG is on (org.apache.cassandra.net.OutboundTcpConnection=DEBUG) because it can happen often when a message is dropped by timeout. You may try to increase RPC timeout in cassandra.yaml temporary on nodes in DC Z. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128664#comment-14128664 ] Duncan Sands commented on CASSANDRA-7904: - Are you saying that if a node sends a repair message twice and both get lost, then repair will hang and no complaint will be printed to the logs (unless logging at level DEBUG)? If so, wouldn't it be better to abort the repair and print an exception to the logs? PS: I forgot to mention that the cluster is using the hsha RPC server type. PPS: I will try increasing the value of this option # The default timeout for other, miscellaneous operations request_timeout_in_ms: 1 in cassandra.yaml. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127110#comment-14127110 ] Robert Coli commented on CASSANDRA-7904: linking for a breadcrumb trail between related repair hangs tickets. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7904) Repair hangs
[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127721#comment-14127721 ] Yuki Morishita commented on CASSANDRA-7904: --- Are you running on EC2? CASSANDRA-5393 / CASSANDRA-6980 made sure repair messages are not dropped, but still the node gives up sending messages after it retry once. Repair hangs Key: CASSANDRA-7904 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version 1.7.0_45 Reporter: Duncan Sands Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136 Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool options: -par -pr. There is usually some overlap in the repairs: repair on one node may well still be running when repair is started on the next node. Repair hangs for some of the nodes almost every weekend. It hung last weekend, here are the details: In the whole cluster, only one node had an exception since C* was last restarted. This node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly. tpstats 4 nodes have a non-zero value for active or pending in AntiEntropySessions. These nodes all have Active = 1 and Pending = 1. The nodes are: 192.168.21.13 (data centre R) 192.168.60.134 (data centre A) 192.168.60.136 (data centre A) 172.18.68.138 (data centre Z) compactionstats: No compactions. All nodes have: pending tasks: 0 Active compaction remaining time :n/a netstats: All except one node have nothing. One node (192.168.60.131, not one of the nodes listed in the tpstats section above) has (note the Responses Pending value of 1): Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4233 Mismatch (Blocking): 0 Mismatch (Background): 243 Pool NameActive Pending Completed Commandsn/a 0 34785445 Responses n/a 1 38567167 Repair sessions I looked for repair sessions that failed to complete. On 3 of the 4 nodes mentioned in tpstats above I found that they had sent merkle tree requests and got responses from all but one node. In the log file for the node that failed to respond there is no sign that it ever received the request. On 1 node (172.18.68.138) it looks like responses were received from every node, some streaming was done, and then... nothing. Details: Node 192.168.21.13 (data centre R): Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from /172.18.68.139. On /172.18.68.139, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table brokers. Node 192.168.60.134 (data centre A): Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138. On /172.18.68.138, just before this time it sent a response for the same repair session but a different table, and there is no record of it receiving a request for table swxess_outbound. Node 192.168.60.136 (data centre A): Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200, never got a response from /172.18.68.139. This repair session is never mentioned in the /172.18.68.139 log. Node 172.18.68.138 (data centre Z): The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle tree requests, did some streaming, but seems to have stopped after finishing with one table (rollups60). I found it as follows: it is the only repair for which there is no session completed successfully message in the log. Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)