[jira] [Commented] (CASSANDRA-7904) Repair hangs

Razi Khaja (JIRA) Fri, 12 Sep 2014 14:51:12 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132172#comment-14132172
 ]


Razi Khaja commented on CASSANDRA-7904:
---------------------------------------

I think this might be related since it  mentions *RepairJob* ... I hope it is 
helpful.
{code}
 INFO [AntiEntropyStage:1] 2014-09-12 16:36:29,536 RepairSession.java (line 
166) [repair #ec6b4340-3abd-11e4-b32d-db378a0ca7f3] Received merkle tree for 
genome_protein_v10 from /XXX.XXX.XXX.XXX
ERROR [MiscStage:58] 2014-09-12 16:36:29,537 CassandraDaemon.java (line 199) 
Exception in thread Thread[MiscStage:58,5,main]
java.lang.IllegalArgumentException: Unknown keyspace/cf pair 
(megalink.probe_gene_v24)
        at 
org.apache.cassandra.db.Keyspace.getColumnFamilyStore(Keyspace.java:171)
        at 
org.apache.cassandra.service.SnapshotVerbHandler.doVerb(SnapshotVerbHandler.java:42)
        at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
ERROR [RepairJobTask:7] 2014-09-12 16:36:29,537 RepairJob.java (line 125) Error 
occurred during snapshot phase
java.lang.RuntimeException: Could not create snapshot at /XXX.XXX.XXX.XXX
        at 
org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:81)
        at 
org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:47)
        at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
ERROR [AntiEntropySessions:73] 2014-09-12 16:36:29,540 RepairSession.java (line 
288) [repair #ec6b4340-3abd-11e4-b32d-db378a0ca7f3] session completed with the 
following error
java.io.IOException: Failed during snapshot creation.
        at 
org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:323)
        at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:126)
        at com.google.common.util.concurrent.Futures$4.run(Futures.java:1160)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
ERROR [AntiEntropySessions:73] 2014-09-12 16:36:29,543 CassandraDaemon.java 
(line 199) Exception in thread Thread[AntiEntropySessions:73,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Failed during snapshot 
creation.
        at com.google.common.base.Throwables.propagate(Throwables.java:160)
        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed during snapshot creation.
        at 
org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:323)
        at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:126)
        at com.google.common.util.concurrent.Futures$4.run(Futures.java:1160)
        ... 3 more
{code}

> Repair hangs
> ------------
>
>                 Key: CASSANDRA-7904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>            Reporter: Duncan Sands
>         Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
>     pending tasks: 0
>     Active compaction remaining time :        n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
>     Mode: NORMAL
>     Not sending any streams.
>     Read Repair Statistics:
>     Attempted: 4233
>     Mismatch (Blocking): 0
>     Mismatch (Background): 243
>     Pool Name                    Active   Pending      Completed
>     Commands                        n/a         0       34785445
>     Responses                       n/a         1       38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
>     Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
>     Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
>     Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session is never mentioned in the /172.18.68.139 log.
>   Node 172.18.68.138 (data centre Z):
>     The issue here seems to be repair session 
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
> tree requests, did some streaming, but seems to have stopped after finishing 
> with one table (rollups60).  I found it as follows: it is the only repair for 
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7904) Repair hangs

Reply via email to