Please find attached netstat -t -as output for the node on which repair hung 
and the node which never got Merkle Tree Request.
ThanksAnuj
 


    On Sunday, 29 November 2015 11:13 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> 
wrote:
 

 Hi All,

I am summarizing the setup, problem & key observations till now:

Setup: Cassandra 2.0.14. 2 DCs with 3 nodes each connected via 10Gbps VPN. We 
run repair with -par and -pr option.
Problem: Repair Hangs. Merkle Tree Responses are not received from one or more 
nodes in remote DC.

Observations till now:
1. Repair hangs intermittently on one node of  DC2.. Only on one occasion, 
repair hung on one other node in DC2 too.
2. Mostly, the node from which Merkle tree was not received does NOT have any 
message "Sending completed merkle tree .." in logs.
3. Often Hinted Handoffs get triggered across DCs and hint replays time-out.
4. Many times, when repair is run after long time it FAILS initially. But, if 
we restart Cassandra and re-run repair , it SUCCEEDS.

Logs: DEBUG logs Attached.

Observations from Log:1. When we started repair on 10.X.15.115, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for 2 nodes in remote DC: 
10.X.14.113 and 10.X.14.111. Merkle tree were received from these 2 nodes.

2. Merkle Tree reponse was not received from 3rd node in remote DC: 10.X.14.115 
(for which no error occurred)

3. Hinted handoff started for 3rd node (10.X.14.115 ) but hint replay timed-out.
If it's a network issue then why the issue is only in DC2 and mostly observed 
on one node.

ThanksAnuj 


    On Sunday, 29 November 2015 10:44 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> 
wrote:
 

 Yes. I think you are correct, problem might have resolved via Cassandra 
restart rather than increasing request timeout.

We are NOT on EC2. We have 2 interfaces on each node: one private and one 
public.
We have strange configuration and we need to correct it as per the 
recommendation at 
https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html
 . 

AS-IS config:
We use broadcast address=listen address=PUBLIC IP address. 
In seeds, we put PUBLIC IP of other nodes but private IP for the local node. 
There were some issues if we tried to access local node via its public IP.


Thanks
Anuj
 
--------------------------------------------
On Tue, 24/11/15, Paulo Motta <pauloricard...@gmail.com> wrote:

 Subject: Re: Repair Hangs while requesting Merkle Trees
 To: "user@cassandra.apache.org" <user@cassandra.apache.org>, "Anuj Wadehra" 
<anujw_2...@yahoo.co.in>
 Date: Tuesday, 24 November, 2015, 12:38 AM
 
 The issue might be related to the
 ESTABLISHED connections just in one end. I don't think
 it might be related to inter_dc_tcp_nodelay or
 request_timeout_in_ms options. Did you restart the process
 when you changed the request_timeout_in_ms option? This
 might be why the problem got fixed and not the option
 change.
 
 This seem
 like a network issue or a misconfiguration of this specific
 node. Are you using EC2? Is listen_address ==
 broadcast_address? Are all nodes using the same
 configuration? What java are you using?
 
 You may want to enable TRACE on
 OutgoingTcpConnection and IncomingTcpConnection and compare
 the outputs of healthy nodes with the faulty node.
 
 2015-11-23 10:04 GMT-08:00
 Anuj Wadehra <anujw_2...@yahoo.co.in>:
 Any
 comments on ESTABLISHED connections at one end?
 
 
 
 Moreover, inter_dc_tcp_nodelay is false. Can this be the
 reason that  latency between two DC is more and repair
 messages are getting dropped?
 
 
 
 Can increasing request_timeout_in_ms deal with the latency
 issue..
 
 
 
 I see some hinted handoffs being triggered for cross DC
 nodes..and hints replay being timed-out..Is that an
 indication of a network issue?
 
 
 
 I am getting in tough with network team to capture netstats
 and tcpdump too..
 
 
 
 Thanks
 
 Anuj
 
 
 
 
 
 --------------------------------------------
 
 On Wed, 18/11/15, Anuj Wadehra
 <anujw_2...@yahoo.co.in>
 wrote:
 
 
 
  Subject: Re: Repair Hangs while requesting Merkle Trees
 
  To: "user@cassandra.apache.org"
 <user@cassandra.apache.org>
 
  Date: Wednesday, 18 November, 2015, 7:57 AM
 
 
 
  Thanks Bryan !!
 
  Connection
 
  is in ESTBLISHED state on on end and completely missing
 at
 
  other end (in another dc).
 
  Yes,
 
  we can revisit TCP tuning.But the problem is node
 specific.
 
  So not sure whether tuning is the culprit.
 
 
 
  ThanksAnuj
 
  Sent
 
  from Yahoo Mail on Android  From:"Bryan
 
  Cheng" <br...@blockcypher.com>
 
  Date:Wed, 18 Nov, 2015 at
 
   2:04 am
 
  Subject:Re: Repair Hangs
 
  while requesting Merkle Trees
 
 
 
   Ah OK, might
 
  have misunderstood you. Streaming socket should not be
 in
 
  play during merkle tree generation (validation
 compaction).
 
  They may come in play during merkle tree exchange- that
 
  I'm not sure about. You can read a bit more here: 
https://issues.apache.org/jira/browse/CASSANDRA-8611.
 
  Regardless, you should have it set-
 
  1 hr is usually a good conservative estimate, but you can
 go
 
  much lower safely.
 
  What state is the connection on that
 
  only shows on one side? Is it ESTABLISHED, or something
 like
 
  CLOSE_WAIT?
 
  Here's
 
  a good place to start for tuning, though it doesn't
 have
 
  as much about network tuning: 
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html.
 
  More generally, TCP tuning usually revolves around a
 balance
 
  between latency and bandwidth. Over long connections
 
  (we're talking 10s of ms, instead of the sub 1ms
 you
 
  usually see in a good dc network), your expectations
 will
 
  shift greatly. Stuff like NODELAY on tcp is very nice
 for
 
  cutting your latencies when you're inside a DC, but
 will
 
  generate lots of small packets that will hurt your
 bandwidth
 
  over longer connections due to the need to wait for
 acks.
 
  otc_coalescing_strategy is on a similar vein, bundling
 
  together nearby messages to trade latency for
 throughput.
 
  You'll also probably want to tune your tcp buffers
 and
 
  window sizes, since that determines how much data can
 be
 
  in-flight between acknowledgements, and the default size
 is
 
  pitiful for any decent  network size. Google
 
   around for TCP tuning/buffer tuning and you should
 find
 
  some good resources.
 
  On Mon, Nov 16, 2015 at
 
  5:23 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
 wrote:
 
  Hi Bryan,
 
  Thanks for the reply !!I
 
  didnt mean streaming_socket_tomeout_in_ms. I meant when
 you
 
  run netstats (Linux cmnd) on  node A in DC1, you will
 
  notice that there is connection in established state
 with
 
  node B in DC2. But when you run netstats on node B, you
 wont
 
   find any connection with node A. Such connections are
 there
 
  across dc? Is it a problem.
 
  We havent set
 
  streaming_socket_timeout_in_ms which I know must be set.
 But
 
  I am not  sure wtheher setting this property has any
 impact
 
  on merkle tree requests. I thought its valid for data
 
  streaming if some mismatch is
 
   found and data needs to be streamed.Please confirm.
 Whats
 
  the value you use for streaming socket
 
  timeout?
 
  Morever, if
 
  socket timeout is the issue, that should happen on
 other
 
  nodes too...repair is not running on just one node, as
 
  merkle tree request is getting lost n not transmitted to
 one
 
  or more nodes in remote dc.
 
  I am not sure about exact distance.
 
  But they are connected with a very high speed 10gbps
 
  link.
 
  When you say
 
  different TCP stack tuning..do u have any
 document/blog/link
 
  describing recommendations for multi Dc Cassandra
 setup? 
 
  Can you elaborate what all settings
 
   need to be different? 
 
 
 
  ThanksAnuj
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  Sent
 
  from Yahoo Mail on Android  From:"Bryan
 
  Cheng" <br...@blockcypher.com>
 
  Date:Tue, 17 Nov, 2015 at 5:54
 
  am
 
  Subject:Re: Repair
 
   Hangs while requesting Merkle Trees
 
 
 
   Hi Anuj,
 
  Did you mean
 
  streaming_socket_timeout_in_ms? If not, then you
 definitely
 
  want that set. Even the best network connections will
 break
 
  occasionally, and in Cassandra < 2.1.10 (I believe)
 this
 
  would leave those connections hanging indefinitely on
 one
 
  end.
 
  How far away are
 
  your two DC's from a network perspective, out of
 
  curiosity? You'll almost certainly be doing
 different
 
  TCP stack tuning for cross-DC, notably your buffer
 sizes,
 
  window params, cassandra-specific stuff like
 
  otc_coalescing_strategy, inter_dc_tcp_nodelay,
 
  etc.
 
  On Sat, Nov 14, 2015 at
 
  10:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in>
 wrote:
 
  One more observation.We observed
 
  that there are few TCP connections which node shows as
 
  Established but when we go to node at other
 end,connection
 
  is not there. They are called "phantom"
 
  connections I guess. Can this be a possible cause?
 
  ThanksAnuj
 
 
 
  Sent
 
  from Yahoo Mail on Android  From:"Anuj
 
  Wadehra" <anujw_2...@yahoo.co.in>
 
  Date:Sat, 14 Nov, 2015 at 11:59
 
  pm
 
  Subject:Re: Repair Hangs
 
  while
 
   requesting Merkle Trees
 
 
 
   Thanks Daemeon
 
  !!
 
  I wil capture the output
 
  of netstats and share in next few days. We were thinking
 of
 
  taking tcp dumps also. If its a network issue and
 increasing
 
  request timeout worked, not sure how Cassandra is
 dropping
 
  messages based on timeout.Repair messages are non
 droppable
 
  and not supposed to be timedout.
 
  2 of the 3 nodes in the DC are able
 
  to complete repair without any issue. Just one node is
 
  problematic.
 
  I also observed
 
  frequent messages in logs of other
 
   nodes which say that hints replay timedout..and the
 node
 
  where hints were being replayed is always a remote dc
 
   node. Is it related some how?
 
  ThanksAnujSent
 
  from Yahoo Mail on Android  From:"daemeon
 
  reiydelle" <daeme...@gmail.com>
 
  Date:Thu, 12 Nov, 2015 at 10:34 am
 
  Subject:Re: Repair Hangs while
 
  requesting Merkle Trees
 
 
 
 
 
   Have you checked the network
 
  statistics on that machine? (netstats -tas) while
 attempting
 
  to repair ... if netstats show ANY issues
 
   you have a problem. If you can put the command in a
 loop
 
  running every 60 seconds for maybe 15 minutes and post
 
  back?
 
 
 
  Out of curiousity,
 
  how many remote DC nodes are getting successfully
 
  repaired?
 
 
 
 
 
  .......
 
  “Life should not be a journey to the
 
  grave with the intention of
 
   arriving safely in a
 
  pretty and well
 
  preserved body, but rather to skid
 
   in broadside in a cloud of smoke,
 
  thoroughly used up, totally worn out,
 
   and loudly proclaiming “Wow! What a Ride!”
 
  - Hunter Thompson
 
 
 
  Daemeon C.M. Reiydelle
 
  USA (+1)
 415.501.0198
 
  London (+44) (0)
 
  20 8144 9872
 
 
 
 
 
  On Wed, Nov 11, 2015 at
 
  1:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
 wrote:
 
  Hi,
 
  we are using 2.0.14. We
 
   have 2 DCs at remote locations with 10GBps
 connectivity.We
 
  are able to
 
  complete repair (-par -pr) on 5 nodes. On only one node
 in
 
  DC2, we are
 
  unable to complete repair as it always hangs. Node
 sends
 
  Merkle Tree
 
  requests, but one or more nodes in DC1 (remote) never
 show
 
  that they
 
  sent the merkle tree reply to requesting node.
 
  Repair hangs infinitely.
 
 
 
  After increasing request_timeout_in_ms on
 
  affected node, we were able to successfully run repair
 on
 
  one of the two occassions.
 
 
 
  Any
 
   comments, why this is happening on just one node? In
 
  OutboundTcpConnection.java,  when isTimeOut method
 always
 
  returns false
 
  for non-droppable verb such as Merkle Tree
 
  Request(verb=REPAIR_MESSAGE),why increasing request
 timeout
 
  solved
 
  problem on one occasion ?
 
 
 
  Thanks
 
  Anuj Wadehra
 
 
 
 
 
 
 
       On Thursday, 12
 
  November 2015 2:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in>
 wrote:
 
 
 
 
 
   Hi,
 
  We have 2 DCs at remote
 
  locations with 10GBps connectivity.We are able to
 complete
 
  repair (-par -pr) on 5 nodes. On only one node in DC2,
 we
 
  are unable to complete repair as it always hangs. Node
 sends
 
  Merkle Tree requests, but one or more nodes in DC1
 (remote)
 
  never show that they sent the merkle tree reply to
 
  requesting node.
 
  Repair hangs infinitely.
 
 
 
 
 
  After increasing
 
  request_timeout_in_ms on affected node, we were able to
 
  successfully run repair on one of the two occassions.
 
 
 
  Any comments, why this is
 
  happening on just one node? In
 OutboundTcpConnection.java, 
 
  when isTimeOut method always returns false for
 non-droppable
 
  verb such as Merkle Tree
 Request(verb=REPAIR_MESSAGE),why
 
  increasing
 
   request timeout solved problem on one occasion ?
 
 
 
  Thanks
 
  Anuj Wadehra
 
 
 
 
 
 
 
 
 
 
 
 
 
 

   

  
netstat where repair was run
-----------------------------------

[root@10.X.15.115 ~]# /usr/share/cassandra/bin/nodetool -h localhost -u vs -pw 
acce55mtnSik@ -p 7199 compactionstats
pending tasks: 0
Active compaction remaining time :        n/a
[root@10.X.15.115 ~]# netstat -as -t
IcmpMsg:
    InType0: 6
    InType3: 921142
    InType8: 2548
    InType11: 4
    InType13: 2
    InType17: 3
    OutType0: 2547
    OutType3: 930601
    OutType8: 89
    OutType14: 1
Tcp:
    10277983 active connections openings
    314928 passive connection openings
    9964031 failed connection attempts
    352 connection resets received
    145 connections established
    805764248 segments received
    795374523 segments send out
    7801 segments retransmited
    10 bad segments received.
    9979214 resets sent
UdpLite:
TcpExt:
    42 invalid SYN cookies received
    150 resets received for embryonic SYN_RECV sockets
    71 packets pruned from receive queue because of socket buffer overrun
    12 ICMP packets dropped because they were out-of-window
    74 ICMP packets dropped because socket was locked
    314552 TCP sockets finished time wait in fast timer
    15 packets rejects in established connections because of timestamp
    579239 delayed acks sent
    317 delayed acks further delayed because of locked socket
    Quick ack mode was activated 330 times
    269778292 packets directly queued to recvmsg prequeue.
    586790295 packets directly received from backlog
    6119347587 packets directly received from prequeue
    326548987 packets header predicted
    2937344 packets header predicted and directly queued to user
    4598778 acknowledgments not containing data received
    416769415 predicted acknowledgments
    130 times recovered from packet loss due to SACK data
    Detected reordering 24 times using SACK
    Detected reordering 24 times using time stamp
    74 congestion windows fully recovered
    286 congestion windows partially recovered using Hoe heuristic
    270 congestion windows recovered after partial ack
    41 TCP data loss events
    58 timeouts after SACK recovery
    209 fast retransmits
    58 forward retransmits
    2 retransmits in slow start
    1971 other TCP timeouts
    1 sack retransmits failed
    24 times receiver scheduled too late for direct processing
    9377 packets collapsed in receive queue due to low socket buffer
    387 DSACKs sent for old packets
    444 DSACKs received
    1 DSACKs for out of order packets received
    4 connections reset due to unexpected data
    160 connections reset due to early user close
    1267 connections aborted due to timeout
    TCPDSACKIgnoredNoUndo: 147
    TCPSackShifted: 2149
    TCPSackMerged: 4165
    TCPSackShiftFallback: 33581
    TCPChallengeACK: 2
    TCPSYNChallenge: 2
IpExt:
    InMcastPkts: 3249403
    OutMcastPkts: 656933
    InBcastPkts: 19
    InOctets: 223747810430
    OutOctets: 178956706487
    InMcastOctets: 310302917
    OutMcastOctets: 75420949
    InBcastOctets: 1334
        
netstat of the remote node 10.X.14.115 which didnt get Merkle Tree Request
-------------------------------------------------------------------
# netstat -as -t
IcmpMsg:
    InType0: 84
    InType3: 482823
    InType8: 41478
    InType13: 1
    InType17: 3
    OutType0: 978
    OutType3: 489300
    OutType8: 154
    OutType14: 1
Tcp:
    5051110 active connections openings
    1767157 passive connection openings
    4889003 failed connection attempts
    1613465 connection resets received
    173 connections established
    722191123 segments received
    701374596 segments send out
    76959 segments retransmited
    0 bad segments received.
    4889033 resets sent
UdpLite:
TcpExt:
    231 invalid SYN cookies received
    158 resets received for embryonic SYN_RECV sockets
    28 packets pruned from receive queue because of socket buffer overrun
    12 ICMP packets dropped because they were out-of-window
    152472 TCP sockets finished time wait in fast timer
    1663029 delayed acks sent
    1012 delayed acks further delayed because of locked socket
    Quick ack mode was activated 369 times
    318763619 packets directly queued to recvmsg prequeue.
    1208873334 packets directly received from backlog
    24533188191 packets directly received from prequeue
    354059635 packets header predicted
    42895337 packets header predicted and directly queued to user
    9163142 acknowledgments not containing data received
    318352481 predicted acknowledgments
    2367 times recovered from packet loss due to SACK data
    TCPDSACKUndo: 56
    3 congestion windows recovered after partial ack
    46372 TCP data loss events
    TCPLostRetransmit: 494
    2 timeouts after SACK recovery
    24614 fast retransmits
    3367 forward retransmits
    552 retransmits in slow start
    22968 other TCP timeouts
    70 sack retransmits failed
    4 times receiver scheduled too late for direct processing
    2203 packets collapsed in receive queue due to low socket buffer
    371 DSACKs sent for old packets
    1303 DSACKs received
    3 connections reset due to unexpected data
    10221 connections reset due to early user close
    1281 connections aborted due to timeout
    TCPSACKDiscard: 418
    TCPDSACKIgnoredOld: 446
    TCPDSACKIgnoredNoUndo: 453
    TCPSpuriousRTOs: 7
    TCPSackShifted: 81212
    TCPSackMerged: 67868
    TCPSackShiftFallback: 15704
    TCPChallengeACK: 1
IpExt:
    InMcastPkts: 2117185
    OutMcastPkts: 808732
    InBcastPkts: 40500
    InOctets: 146930455537
    OutOctets: 219454781573
    InMcastOctets: 215958192
    OutMcastOctets: 91377640
    InBcastOctets: 3402000

Reply via email to