Duncan Sands created CASSANDRA-7904:
---------------------------------------

             Summary: Repair hangs
                 Key: CASSANDRA-7904
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
java version "1.7.0_45"
            Reporter: Duncan Sands
         Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
ls-192.168.60.136

Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
options: -par -pr.  There is usually some overlap in the repairs: repair on one 
node may well still be running when repair is started on the next node.  Repair 
hangs for some of the nodes almost every weekend.  It hung last weekend, here 
are the details:

In the whole cluster, only one node had an exception since C* was last 
restarted.  This node is 192.168.60.136 and the exception is harmless: a client 
disconnected abruptly.

tpstats
  4 nodes have a non-zero value for "active" or "pending" in 
AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
nodes are:
  192.168.21.13 (data centre R)
  192.168.60.134 (data centre A)
  192.168.60.136 (data centre A)
  172.18.68.138 (data centre Z)

compactionstats:
  No compactions.  All nodes have:
    pending tasks: 0
    Active compaction remaining time :        n/a

netstats:
  All except one node have nothing.  One node (192.168.60.131, not one of the 
nodes listed in the tpstats section above) has (note the Responses Pending 
value of 1):
    Mode: NORMAL
    Not sending any streams.
    Read Repair Statistics:
    Attempted: 4233
    Mismatch (Blocking): 0
    Mismatch (Background): 243
    Pool Name                    Active   Pending      Completed
    Commands                        n/a         0       34785445
    Responses                       n/a         1       38567167

Repair sessions
  I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
mentioned in tpstats above I found that they had sent merkle tree requests and 
got responses from all but one node.  In the log file for the node that failed 
to respond there is no sign that it ever received the request.  On 1 node 
(172.18.68.138) it looks like responses were received from every node, some 
streaming was done, and then... nothing.  Details:

  Node 192.168.21.13 (data centre R):
    Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
/172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
before this time it sent a response for the same repair session but a different 
table, and there is no record of it receiving a request for table brokers.

  Node 192.168.60.134 (data centre A):
    Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
/192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a response 
from /172.18.68.138.  On /172.18.68.138, just before this time it sent a 
response for the same repair session but a different table, and there is no 
record of it receiving a request for table swxess_outbound.

  Node 192.168.60.136 (data centre A):
    Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
table rollups7200, never got a response from /172.18.68.139.  This repair 
session is never mentioned in the /172.18.68.139 log.

  Node 172.18.68.138 (data centre Z):
    The issue here seems to be repair session 
#a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
tree requests, did some streaming, but seems to have stopped after finishing 
with one table (rollups60).  I found it as follows: it is the only repair for 
which there is no "session completed successfully" message in the log.

Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to