[ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030201#comment-15030201
 ] 

Anuj Wadehra edited comment on CASSANDRA-7904 at 11/27/15 7:58 PM:
-------------------------------------------------------------------

[#Michael Shuler] First of all, I would like to express my discomfort on the 
way few things are happening. In March,2015 Apache web site mentioned that 
2.0.14 is the most stable version recommended for Production and in Nov (just 7 
mths time), people get little community support for 2.0.x (sort of EOL). It 
takes months to roll out a Cassandra upgrade to all your clients and by the 
time all your clients get the latest Cassandra version, you learn that 
virtually no help is available on community for that version. I understand the 
fast paced environment but can we revisit the strategy to make sure that once 
declared "stable", a version must be supported on community for at least one 
year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.
Many times, restarting the node and then running repair is successful. May be 
there is an issue with open TCP connections used by repair. Restarting 
Cassandra creates new connections and then its successful.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?







was (Author: eanujwa):
[#Michael Shuler] First of all, I would like to express my discomfort on the 
way few things are happening. In March,2015 Apache web site mentioned that 
2.0.14 is the most stable version recommended for Production and in Nov (just 7 
mths time), people get little community support for 2.0.x (sort of EOL). It 
takes months to roll out a Cassandra upgrade to all your clients and by the 
time all your clients get the latest Cassandra version, you learn that 
virtually no help is available on community for that version. I understand the 
fast paced environment but can we revisit the strategy to make sure that once 
declared "stable", a version must be supported on community for at least one 
year?

Coming back to the issue, I see similar problem in 2.2.0 code too. We are still 
facing the issue. Merkle Requests across DCs are getting lost and then repair 
hangs. We have enabled DEBUG. When we started repair on a node, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for  2 nodes in remote DC. Merkle 
tree were received from these 2 nodes. But we didnt get any error for 3rd node 
in remote DC and strangely this was the node which never got Merkle tree 
request. Moreoever, we observed that hinted handoff started for 3rd node from 
the node being repaired but hint replay timed-out too.

Please find attached logs of node 10.X.15.115. Merkle tree request never 
reached 10.X.14.115 and thus no response was received.Absolutely no logs got 
printed on 10.X.14.115.

Can we reopen the ticket?






> Repair hangs
> ------------
>
>                 Key: CASSANDRA-7904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>            Reporter: Duncan Sands
>         Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
>     pending tasks: 0
>     Active compaction remaining time :        n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
>     Mode: NORMAL
>     Not sending any streams.
>     Read Repair Statistics:
>     Attempted: 4233
>     Mismatch (Blocking): 0
>     Mismatch (Background): 243
>     Pool Name                    Active   Pending      Completed
>     Commands                        n/a         0       34785445
>     Responses                       n/a         1       38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
>     Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
>     Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
>     Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session is never mentioned in the /172.18.68.139 log.
>   Node 172.18.68.138 (data centre Z):
>     The issue here seems to be repair session 
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
> tree requests, did some streaming, but seems to have stopped after finishing 
> with one table (rollups60).  I found it as follows: it is the only repair for 
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to