Please find attached netstat -t -as output for the node on which repair hung and the node which never got Merkle Tree Request. ThanksAnuj
On Sunday, 29 November 2015 11:13 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi All, I am summarizing the setup, problem & key observations till now: Setup: Cassandra 2.0.14. 2 DCs with 3 nodes each connected via 10Gbps VPN. We run repair with -par and -pr option. Problem: Repair Hangs. Merkle Tree Responses are not received from one or more nodes in remote DC. Observations till now: 1. Repair hangs intermittently on one node of DC2.. Only on one occasion, repair hung on one other node in DC2 too. 2. Mostly, the node from which Merkle tree was not received does NOT have any message "Sending completed merkle tree .." in logs. 3. Often Hinted Handoffs get triggered across DCs and hint replays time-out. 4. Many times, when repair is run after long time it FAILS initially. But, if we restart Cassandra and re-run repair , it SUCCEEDS. Logs: DEBUG logs Attached. Observations from Log:1. When we started repair on 10.X.15.115, we got error messages "error writing to /X.X.X.X java.io.IOException: Connection timed out" for 2 nodes in remote DC: 10.X.14.113 and 10.X.14.111. Merkle tree were received from these 2 nodes. 2. Merkle Tree reponse was not received from 3rd node in remote DC: 10.X.14.115 (for which no error occurred) 3. Hinted handoff started for 3rd node (10.X.14.115 ) but hint replay timed-out. If it's a network issue then why the issue is only in DC2 and mostly observed on one node. ThanksAnuj On Sunday, 29 November 2015 10:44 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Yes. I think you are correct, problem might have resolved via Cassandra restart rather than increasing request timeout. We are NOT on EC2. We have 2 interfaces on each node: one private and one public. We have strange configuration and we need to correct it as per the recommendation at https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html . AS-IS config: We use broadcast address=listen address=PUBLIC IP address. In seeds, we put PUBLIC IP of other nodes but private IP for the local node. There were some issues if we tried to access local node via its public IP. Thanks Anuj -------------------------------------------- On Tue, 24/11/15, Paulo Motta <pauloricard...@gmail.com> wrote: Subject: Re: Repair Hangs while requesting Merkle Trees To: "user@cassandra.apache.org" <user@cassandra.apache.org>, "Anuj Wadehra" <anujw_2...@yahoo.co.in> Date: Tuesday, 24 November, 2015, 12:38 AM The issue might be related to the ESTABLISHED connections just in one end. I don't think it might be related to inter_dc_tcp_nodelay or request_timeout_in_ms options. Did you restart the process when you changed the request_timeout_in_ms option? This might be why the problem got fixed and not the option change. This seem like a network issue or a misconfiguration of this specific node. Are you using EC2? Is listen_address == broadcast_address? Are all nodes using the same configuration? What java are you using? You may want to enable TRACE on OutgoingTcpConnection and IncomingTcpConnection and compare the outputs of healthy nodes with the faulty node. 2015-11-23 10:04 GMT-08:00 Anuj Wadehra <anujw_2...@yahoo.co.in>: Any comments on ESTABLISHED connections at one end? Moreover, inter_dc_tcp_nodelay is false. Can this be the reason that latency between two DC is more and repair messages are getting dropped? Can increasing request_timeout_in_ms deal with the latency issue.. I see some hinted handoffs being triggered for cross DC nodes..and hints replay being timed-out..Is that an indication of a network issue? I am getting in tough with network team to capture netstats and tcpdump too.. Thanks Anuj -------------------------------------------- On Wed, 18/11/15, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Subject: Re: Repair Hangs while requesting Merkle Trees To: "user@cassandra.apache.org" <user@cassandra.apache.org> Date: Wednesday, 18 November, 2015, 7:57 AM Thanks Bryan !! Connection is in ESTBLISHED state on on end and completely missing at other end (in another dc). Yes, we can revisit TCP tuning.But the problem is node specific. So not sure whether tuning is the culprit. ThanksAnuj Sent from Yahoo Mail on Android From:"Bryan Cheng" <br...@blockcypher.com> Date:Wed, 18 Nov, 2015 at 2:04 am Subject:Re: Repair Hangs while requesting Merkle Trees Ah OK, might have misunderstood you. Streaming socket should not be in play during merkle tree generation (validation compaction). They may come in play during merkle tree exchange- that I'm not sure about. You can read a bit more here: https://issues.apache.org/jira/browse/CASSANDRA-8611. Regardless, you should have it set- 1 hr is usually a good conservative estimate, but you can go much lower safely. What state is the connection on that only shows on one side? Is it ESTABLISHED, or something like CLOSE_WAIT? Here's a good place to start for tuning, though it doesn't have as much about network tuning: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html. More generally, TCP tuning usually revolves around a balance between latency and bandwidth. Over long connections (we're talking 10s of ms, instead of the sub 1ms you usually see in a good dc network), your expectations will shift greatly. Stuff like NODELAY on tcp is very nice for cutting your latencies when you're inside a DC, but will generate lots of small packets that will hurt your bandwidth over longer connections due to the need to wait for acks. otc_coalescing_strategy is on a similar vein, bundling together nearby messages to trade latency for throughput. You'll also probably want to tune your tcp buffers and window sizes, since that determines how much data can be in-flight between acknowledgements, and the default size is pitiful for any decent network size. Google around for TCP tuning/buffer tuning and you should find some good resources. On Mon, Nov 16, 2015 at 5:23 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi Bryan, Thanks for the reply !!I didnt mean streaming_socket_tomeout_in_ms. I meant when you run netstats (Linux cmnd) on node A in DC1, you will notice that there is connection in established state with node B in DC2. But when you run netstats on node B, you wont find any connection with node A. Such connections are there across dc? Is it a problem. We havent set streaming_socket_timeout_in_ms which I know must be set. But I am not sure wtheher setting this property has any impact on merkle tree requests. I thought its valid for data streaming if some mismatch is found and data needs to be streamed.Please confirm. Whats the value you use for streaming socket timeout? Morever, if socket timeout is the issue, that should happen on other nodes too...repair is not running on just one node, as merkle tree request is getting lost n not transmitted to one or more nodes in remote dc. I am not sure about exact distance. But they are connected with a very high speed 10gbps link. When you say different TCP stack tuning..do u have any document/blog/link describing recommendations for multi Dc Cassandra setup? Can you elaborate what all settings need to be different? ThanksAnuj Sent from Yahoo Mail on Android From:"Bryan Cheng" <br...@blockcypher.com> Date:Tue, 17 Nov, 2015 at 5:54 am Subject:Re: Repair Hangs while requesting Merkle Trees Hi Anuj, Did you mean streaming_socket_timeout_in_ms? If not, then you definitely want that set. Even the best network connections will break occasionally, and in Cassandra < 2.1.10 (I believe) this would leave those connections hanging indefinitely on one end. How far away are your two DC's from a network perspective, out of curiosity? You'll almost certainly be doing different TCP stack tuning for cross-DC, notably your buffer sizes, window params, cassandra-specific stuff like otc_coalescing_strategy, inter_dc_tcp_nodelay, etc. On Sat, Nov 14, 2015 at 10:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: One more observation.We observed that there are few TCP connections which node shows as Established but when we go to node at other end,connection is not there. They are called "phantom" connections I guess. Can this be a possible cause? ThanksAnuj Sent from Yahoo Mail on Android From:"Anuj Wadehra" <anujw_2...@yahoo.co.in> Date:Sat, 14 Nov, 2015 at 11:59 pm Subject:Re: Repair Hangs while requesting Merkle Trees Thanks Daemeon !! I wil capture the output of netstats and share in next few days. We were thinking of taking tcp dumps also. If its a network issue and increasing request timeout worked, not sure how Cassandra is dropping messages based on timeout.Repair messages are non droppable and not supposed to be timedout. 2 of the 3 nodes in the DC are able to complete repair without any issue. Just one node is problematic. I also observed frequent messages in logs of other nodes which say that hints replay timedout..and the node where hints were being replayed is always a remote dc node. Is it related some how? ThanksAnujSent from Yahoo Mail on Android From:"daemeon reiydelle" <daeme...@gmail.com> Date:Thu, 12 Nov, 2015 at 10:34 am Subject:Re: Repair Hangs while requesting Merkle Trees Have you checked the network statistics on that machine? (netstats -tas) while attempting to repair ... if netstats show ANY issues you have a problem. If you can put the command in a loop running every 60 seconds for maybe 15 minutes and post back? Out of curiousity, how many remote DC nodes are getting successfully repaired? ....... “Life should not be a journey to the grave with the intention of arriving safely in a pretty and well preserved body, but rather to skid in broadside in a cloud of smoke, thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a Ride!” - Hunter Thompson Daemeon C.M. Reiydelle USA (+1) 415.501.0198 London (+44) (0) 20 8144 9872 On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi, we are using 2.0.14. We have 2 DCs at remote locations with 10GBps connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are unable to complete repair as it always hangs. Node sends Merkle Tree requests, but one or more nodes in DC1 (remote) never show that they sent the merkle tree reply to requesting node. Repair hangs infinitely. After increasing request_timeout_in_ms on affected node, we were able to successfully run repair on one of the two occassions. Any comments, why this is happening on just one node? In OutboundTcpConnection.java, when isTimeOut method always returns false for non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why increasing request timeout solved problem on one occasion ? Thanks Anuj Wadehra On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: Hi, We have 2 DCs at remote locations with 10GBps connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are unable to complete repair as it always hangs. Node sends Merkle Tree requests, but one or more nodes in DC1 (remote) never show that they sent the merkle tree reply to requesting node. Repair hangs infinitely. After increasing request_timeout_in_ms on affected node, we were able to successfully run repair on one of the two occassions. Any comments, why this is happening on just one node? In OutboundTcpConnection.java, when isTimeOut method always returns false for non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why increasing request timeout solved problem on one occasion ? Thanks Anuj Wadehra
netstat where repair was run ----------------------------------- [root@10.X.15.115 ~]# /usr/share/cassandra/bin/nodetool -h localhost -u vs -pw acce55mtnSik@ -p 7199 compactionstats pending tasks: 0 Active compaction remaining time : n/a [root@10.X.15.115 ~]# netstat -as -t IcmpMsg: InType0: 6 InType3: 921142 InType8: 2548 InType11: 4 InType13: 2 InType17: 3 OutType0: 2547 OutType3: 930601 OutType8: 89 OutType14: 1 Tcp: 10277983 active connections openings 314928 passive connection openings 9964031 failed connection attempts 352 connection resets received 145 connections established 805764248 segments received 795374523 segments send out 7801 segments retransmited 10 bad segments received. 9979214 resets sent UdpLite: TcpExt: 42 invalid SYN cookies received 150 resets received for embryonic SYN_RECV sockets 71 packets pruned from receive queue because of socket buffer overrun 12 ICMP packets dropped because they were out-of-window 74 ICMP packets dropped because socket was locked 314552 TCP sockets finished time wait in fast timer 15 packets rejects in established connections because of timestamp 579239 delayed acks sent 317 delayed acks further delayed because of locked socket Quick ack mode was activated 330 times 269778292 packets directly queued to recvmsg prequeue. 586790295 packets directly received from backlog 6119347587 packets directly received from prequeue 326548987 packets header predicted 2937344 packets header predicted and directly queued to user 4598778 acknowledgments not containing data received 416769415 predicted acknowledgments 130 times recovered from packet loss due to SACK data Detected reordering 24 times using SACK Detected reordering 24 times using time stamp 74 congestion windows fully recovered 286 congestion windows partially recovered using Hoe heuristic 270 congestion windows recovered after partial ack 41 TCP data loss events 58 timeouts after SACK recovery 209 fast retransmits 58 forward retransmits 2 retransmits in slow start 1971 other TCP timeouts 1 sack retransmits failed 24 times receiver scheduled too late for direct processing 9377 packets collapsed in receive queue due to low socket buffer 387 DSACKs sent for old packets 444 DSACKs received 1 DSACKs for out of order packets received 4 connections reset due to unexpected data 160 connections reset due to early user close 1267 connections aborted due to timeout TCPDSACKIgnoredNoUndo: 147 TCPSackShifted: 2149 TCPSackMerged: 4165 TCPSackShiftFallback: 33581 TCPChallengeACK: 2 TCPSYNChallenge: 2 IpExt: InMcastPkts: 3249403 OutMcastPkts: 656933 InBcastPkts: 19 InOctets: 223747810430 OutOctets: 178956706487 InMcastOctets: 310302917 OutMcastOctets: 75420949 InBcastOctets: 1334 netstat of the remote node 10.X.14.115 which didnt get Merkle Tree Request ------------------------------------------------------------------- # netstat -as -t IcmpMsg: InType0: 84 InType3: 482823 InType8: 41478 InType13: 1 InType17: 3 OutType0: 978 OutType3: 489300 OutType8: 154 OutType14: 1 Tcp: 5051110 active connections openings 1767157 passive connection openings 4889003 failed connection attempts 1613465 connection resets received 173 connections established 722191123 segments received 701374596 segments send out 76959 segments retransmited 0 bad segments received. 4889033 resets sent UdpLite: TcpExt: 231 invalid SYN cookies received 158 resets received for embryonic SYN_RECV sockets 28 packets pruned from receive queue because of socket buffer overrun 12 ICMP packets dropped because they were out-of-window 152472 TCP sockets finished time wait in fast timer 1663029 delayed acks sent 1012 delayed acks further delayed because of locked socket Quick ack mode was activated 369 times 318763619 packets directly queued to recvmsg prequeue. 1208873334 packets directly received from backlog 24533188191 packets directly received from prequeue 354059635 packets header predicted 42895337 packets header predicted and directly queued to user 9163142 acknowledgments not containing data received 318352481 predicted acknowledgments 2367 times recovered from packet loss due to SACK data TCPDSACKUndo: 56 3 congestion windows recovered after partial ack 46372 TCP data loss events TCPLostRetransmit: 494 2 timeouts after SACK recovery 24614 fast retransmits 3367 forward retransmits 552 retransmits in slow start 22968 other TCP timeouts 70 sack retransmits failed 4 times receiver scheduled too late for direct processing 2203 packets collapsed in receive queue due to low socket buffer 371 DSACKs sent for old packets 1303 DSACKs received 3 connections reset due to unexpected data 10221 connections reset due to early user close 1281 connections aborted due to timeout TCPSACKDiscard: 418 TCPDSACKIgnoredOld: 446 TCPDSACKIgnoredNoUndo: 453 TCPSpuriousRTOs: 7 TCPSackShifted: 81212 TCPSackMerged: 67868 TCPSackShiftFallback: 15704 TCPChallengeACK: 1 IpExt: InMcastPkts: 2117185 OutMcastPkts: 808732 InBcastPkts: 40500 InOctets: 146930455537 OutOctets: 219454781573 InMcastOctets: 215958192 OutMcastOctets: 91377640 InBcastOctets: 3402000