[ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133990#comment-14133990 ]
Razi Khaja edited comment on CASSANDRA-7904 at 9/15/14 3:11 PM: ---------------------------------------------------------------- I increased my request_time_in_ms from 20000 to 180000 and repair is working now so far for 2 hours without *Lost notification*. In the comment I made above, for my keyspace megalink, repair command #10 lost notificaion within 4 minutes, so the fact that my current repair is still running for 2 hours is a good sign. was (Author: razi.kh...@gmail.com): I increase my request_time_in_ms from 20000 to 180000 and repair is working now so far for 2 hours without *Lost notification*. In the comment I made above, for my keyspace megalink, repair command #10 lost notificaion within 4 minutes, so the fact that my current repair is still running for 2 hours is a good sign. > Repair hangs > ------------ > > Key: CASSANDRA-7904 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7904 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, > java version "1.7.0_45" > Reporter: Duncan Sands > Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, > ls-192.168.60.136 > > > Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so > repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool > options: -par -pr. There is usually some overlap in the repairs: repair on > one node may well still be running when repair is started on the next node. > Repair hangs for some of the nodes almost every weekend. It hung last > weekend, here are the details: > In the whole cluster, only one node had an exception since C* was last > restarted. This node is 192.168.60.136 and the exception is harmless: a > client disconnected abruptly. > tpstats > 4 nodes have a non-zero value for "active" or "pending" in > AntiEntropySessions. These nodes all have Active => 1 and Pending => 1. The > nodes are: > 192.168.21.13 (data centre R) > 192.168.60.134 (data centre A) > 192.168.60.136 (data centre A) > 172.18.68.138 (data centre Z) > compactionstats: > No compactions. All nodes have: > pending tasks: 0 > Active compaction remaining time : n/a > netstats: > All except one node have nothing. One node (192.168.60.131, not one of the > nodes listed in the tpstats section above) has (note the Responses Pending > value of 1): > Mode: NORMAL > Not sending any streams. > Read Repair Statistics: > Attempted: 4233 > Mismatch (Blocking): 0 > Mismatch (Background): 243 > Pool Name Active Pending Completed > Commands n/a 0 34785445 > Responses n/a 1 38567167 > Repair sessions > I looked for repair sessions that failed to complete. On 3 of the 4 nodes > mentioned in tpstats above I found that they had sent merkle tree requests > and got responses from all but one node. In the log file for the node that > failed to respond there is no sign that it ever received the request. On 1 > node (172.18.68.138) it looks like responses were received from every node, > some streaming was done, and then... nothing. Details: > Node 192.168.21.13 (data centre R): > Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, > /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table > brokers, never got a response from /172.18.68.139. On /172.18.68.139, just > before this time it sent a response for the same repair session but a > different table, and there is no record of it receiving a request for table > brokers. > Node 192.168.60.134 (data centre A): > Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, > /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a > response from /172.18.68.138. On /172.18.68.138, just before this time it > sent a response for the same repair session but a different table, and there > is no record of it receiving a request for table swxess_outbound. > Node 192.168.60.136 (data centre A): > Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for > table rollups7200, never got a response from /172.18.68.139. This repair > session is never mentioned in the /172.18.68.139 log. > Node 172.18.68.138 (data centre Z): > The issue here seems to be repair session > #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle > tree requests, did some streaming, but seems to have stopped after finishing > with one table (rollups60). I found it as follows: it is the only repair for > which there is no "session completed successfully" message in the log. > Some log file snippets are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)