Re: Repair Hangs while requesting Merkle Trees

Anuj Wadehra Sun, 29 Nov 2015 09:54:52 -0800

Please find attached netstat -t -as output for the node on which repair hung 
and the node which never got Merkle Tree Request.
ThanksAnuj

On Sunday, 29 November 2015 11:13 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
wrote:

Hi All,

I am summarizing the setup, problem & key observations till now:

Setup: Cassandra 2.0.14. 2 DCs with 3 nodes each connected via 10Gbps VPN. We
run repair with -par and -pr option.
Problem: Repair Hangs. Merkle Tree Responses are not received from one or more
nodes in remote DC.

Observations till now:
1. Repair hangs intermittently on one node of DC2.. Only on one occasion,
repair hung on one other node in DC2 too.
2. Mostly, the node from which Merkle tree was not received does NOT have any
message "Sending completed merkle tree .." in logs.
3. Often Hinted Handoffs get triggered across DCs and hint replays time-out.
4. Many times, when repair is run after long time it FAILS initially. But, if
we restart Cassandra and re-run repair , it SUCCEEDS.

Logs: DEBUG logs Attached.

Observations from Log:1. When we started repair on 10.X.15.115, we got error
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for 2 nodes in remote DC:
10.X.14.113 and 10.X.14.111. Merkle tree were received from these 2 nodes.

2. Merkle Tree reponse was not received from 3rd node in remote DC: 10.X.14.115
(for which no error occurred)

3. Hinted handoff started for 3rd node (10.X.14.115 ) but hint replay timed-out.
If it's a network issue then why the issue is only in DC2 and mostly observed
on one node.

ThanksAnuj

On Sunday, 29 November 2015 10:44 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
wrote:

Yes. I think you are correct, problem might have resolved via Cassandra
restart rather than increasing request timeout.

We are NOT on EC2. We have 2 interfaces on each node: one private and one
public.
We have strange configuration and we need to correct it as per the
recommendation at
https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html
.

AS-IS config:
We use broadcast address=listen address=PUBLIC IP address.
In seeds, we put PUBLIC IP of other nodes but private IP for the local node.
There were some issues if we tried to access local node via its public IP.

Thanks
Anuj

--------------------------------------------
On Tue, 24/11/15, Paulo Motta <pauloricard...@gmail.com> wrote:

Subject: Re: Repair Hangs while requesting Merkle Trees
To: "user@cassandra.apache.org" <user@cassandra.apache.org>, "Anuj Wadehra"
<anujw_2...@yahoo.co.in>
Date: Tuesday, 24 November, 2015, 12:38 AM

The issue might be related to the
ESTABLISHED connections just in one end. I don't think
it might be related to inter_dc_tcp_nodelay or
request_timeout_in_ms options. Did you restart the process
when you changed the request_timeout_in_ms option? This
might be why the problem got fixed and not the option
change.

This seem
like a network issue or a misconfiguration of this specific
node. Are you using EC2? Is listen_address ==
broadcast_address? Are all nodes using the same
configuration? What java are you using?

You may want to enable TRACE on
OutgoingTcpConnection and IncomingTcpConnection and compare
the outputs of healthy nodes with the faulty node.

2015-11-23 10:04 GMT-08:00
Anuj Wadehra <anujw_2...@yahoo.co.in>:
Any
comments on ESTABLISHED connections at one end?

Moreover, inter_dc_tcp_nodelay is false. Can this be the
reason that latency between two DC is more and repair
messages are getting dropped?

Can increasing request_timeout_in_ms deal with the latency
issue..

I see some hinted handoffs being triggered for cross DC
nodes..and hints replay being timed-out..Is that an
indication of a network issue?

I am getting in tough with network team to capture netstats
and tcpdump too..

Thanks

Anuj

--------------------------------------------

On Wed, 18/11/15, Anuj Wadehra
<anujw_2...@yahoo.co.in>
wrote:

Subject: Re: Repair Hangs while requesting Merkle Trees

To: "user@cassandra.apache.org"
<user@cassandra.apache.org>

Date: Wednesday, 18 November, 2015, 7:57 AM

Thanks Bryan !!

Connection

is in ESTBLISHED state on on end and completely missing
at

other end (in another dc).

Yes,

we can revisit TCP tuning.But the problem is node
specific.

So not sure whether tuning is the culprit.

ThanksAnuj

Sent

from Yahoo Mail on Android From:"Bryan

Cheng" <br...@blockcypher.com>

Date:Wed, 18 Nov, 2015 at

2:04 am

Subject:Re: Repair Hangs

while requesting Merkle Trees

Ah OK, might

have misunderstood you. Streaming socket should not be
in

play during merkle tree generation (validation
compaction).

They may come in play during merkle tree exchange- that

I'm not sure about. You can read a bit more here:
https://issues.apache.org/jira/browse/CASSANDRA-8611.

Regardless, you should have it set-

1 hr is usually a good conservative estimate, but you can
go

much lower safely.

What state is the connection on that

only shows on one side? Is it ESTABLISHED, or something
like

CLOSE_WAIT?

Here's

a good place to start for tuning, though it doesn't
have

as much about network tuning:
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html.

More generally, TCP tuning usually revolves around a
balance

between latency and bandwidth. Over long connections

(we're talking 10s of ms, instead of the sub 1ms
you

usually see in a good dc network), your expectations
will

shift greatly. Stuff like NODELAY on tcp is very nice
for

cutting your latencies when you're inside a DC, but
will

generate lots of small packets that will hurt your
bandwidth

over longer connections due to the need to wait for
acks.

otc_coalescing_strategy is on a similar vein, bundling

together nearby messages to trade latency for
throughput.

You'll also probably want to tune your tcp buffers
and

window sizes, since that determines how much data can
be

in-flight between acknowledgements, and the default size
is

pitiful for any decent network size. Google

around for TCP tuning/buffer tuning and you should
find

some good resources.

On Mon, Nov 16, 2015 at

5:23 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
wrote:

Hi Bryan,

Thanks for the reply !!I

didnt mean streaming_socket_tomeout_in_ms. I meant when
you

run netstats (Linux cmnd) on node A in DC1, you will

notice that there is connection in established state
with

node B in DC2. But when you run netstats on node B, you
wont

find any connection with node A. Such connections are
there

across dc? Is it a problem.

We havent set

streaming_socket_timeout_in_ms which I know must be set.
But

I am not sure wtheher setting this property has any
impact

on merkle tree requests. I thought its valid for data

streaming if some mismatch is

found and data needs to be streamed.Please confirm.
Whats

the value you use for streaming socket

timeout?

Morever, if

socket timeout is the issue, that should happen on
other

nodes too...repair is not running on just one node, as

merkle tree request is getting lost n not transmitted to
one

or more nodes in remote dc.

I am not sure about exact distance.

But they are connected with a very high speed 10gbps

link.

When you say

different TCP stack tuning..do u have any
document/blog/link

describing recommendations for multi Dc Cassandra
setup?

Can you elaborate what all settings

need to be different?

ThanksAnuj

Sent

from Yahoo Mail on Android From:"Bryan

Cheng" <br...@blockcypher.com>

Date:Tue, 17 Nov, 2015 at 5:54

Subject:Re: Repair

Hangs while requesting Merkle Trees

Hi Anuj,

Did you mean

streaming_socket_timeout_in_ms? If not, then you
definitely

want that set. Even the best network connections will
break

occasionally, and in Cassandra < 2.1.10 (I believe)
this

would leave those connections hanging indefinitely on
one

end.

How far away are

your two DC's from a network perspective, out of

curiosity? You'll almost certainly be doing
different

TCP stack tuning for cross-DC, notably your buffer
sizes,

window params, cassandra-specific stuff like

otc_coalescing_strategy, inter_dc_tcp_nodelay,

etc.

On Sat, Nov 14, 2015 at

10:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in>
wrote:

One more observation.We observed

that there are few TCP connections which node shows as

Established but when we go to node at other
end,connection

is not there. They are called "phantom"

connections I guess. Can this be a possible cause?

ThanksAnuj

Sent

from Yahoo Mail on Android From:"Anuj

Wadehra" <anujw_2...@yahoo.co.in>

Date:Sat, 14 Nov, 2015 at 11:59

Subject:Re: Repair Hangs

while

requesting Merkle Trees

Thanks Daemeon

I wil capture the output

of netstats and share in next few days. We were thinking
of

taking tcp dumps also. If its a network issue and
increasing

request timeout worked, not sure how Cassandra is
dropping

messages based on timeout.Repair messages are non
droppable

and not supposed to be timedout.

2 of the 3 nodes in the DC are able

to complete repair without any issue. Just one node is

problematic.

I also observed

frequent messages in logs of other

nodes which say that hints replay timedout..and the
node

where hints were being replayed is always a remote dc

node. Is it related some how?

ThanksAnujSent

from Yahoo Mail on Android From:"daemeon

reiydelle" <daeme...@gmail.com>

Date:Thu, 12 Nov, 2015 at 10:34 am

Subject:Re: Repair Hangs while

requesting Merkle Trees

Have you checked the network

statistics on that machine? (netstats -tas) while
attempting

to repair ... if netstats show ANY issues

you have a problem. If you can put the command in a
loop

running every 60 seconds for maybe 15 minutes and post

back?

Out of curiousity,

how many remote DC nodes are getting successfully

repaired?

.......

“Life should not be a journey to the

grave with the intention of

arriving safely in a

pretty and well

preserved body, but rather to skid

in broadside in a cloud of smoke,

thoroughly used up, totally worn out,

and loudly proclaiming “Wow! What a Ride!”

- Hunter Thompson

Daemeon C.M. Reiydelle

USA (+1)
415.501.0198

London (+44) (0)

20 8144 9872

On Wed, Nov 11, 2015 at

1:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
wrote:

Hi,

we are using 2.0.14. We

have 2 DCs at remote locations with 10GBps
connectivity.We

are able to

complete repair (-par -pr) on 5 nodes. On only one node
in

DC2, we are

unable to complete repair as it always hangs. Node
sends

Merkle Tree

requests, but one or more nodes in DC1 (remote) never
show

that they

sent the merkle tree reply to requesting node.

Repair hangs infinitely.

After increasing request_timeout_in_ms on

affected node, we were able to successfully run repair
on

one of the two occassions.

Any

comments, why this is happening on just one node? In

OutboundTcpConnection.java, when isTimeOut method
always

returns false

for non-droppable verb such as Merkle Tree

Request(verb=REPAIR_MESSAGE),why increasing request
timeout

solved

problem on one occasion ?

Thanks

Anuj Wadehra

On Thursday, 12

November 2015 2:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in>
wrote:

Hi,

We have 2 DCs at remote

locations with 10GBps connectivity.We are able to
complete

repair (-par -pr) on 5 nodes. On only one node in DC2,
we

are unable to complete repair as it always hangs. Node
sends

Merkle Tree requests, but one or more nodes in DC1
(remote)

never show that they sent the merkle tree reply to

requesting node.

Repair hangs infinitely.

After increasing

request_timeout_in_ms on affected node, we were able to

successfully run repair on one of the two occassions.

Any comments, why this is

happening on just one node? In
OutboundTcpConnection.java,

when isTimeOut method always returns false for
non-droppable

verb such as Merkle Tree
Request(verb=REPAIR_MESSAGE),why

increasing

request timeout solved problem on one occasion ?

Thanks

Anuj Wadehra

netstat where repair was run
-----------------------------------

[root@10.X.15.115 ~]# /usr/share/cassandra/bin/nodetool -h localhost -u vs -pw 
acce55mtnSik@ -p 7199 compactionstats
pending tasks: 0
Active compaction remaining time :        n/a
[root@10.X.15.115 ~]# netstat -as -t
IcmpMsg:
    InType0: 6
    InType3: 921142
    InType8: 2548
    InType11: 4
    InType13: 2
    InType17: 3
    OutType0: 2547
    OutType3: 930601
    OutType8: 89
    OutType14: 1
Tcp:
    10277983 active connections openings
    314928 passive connection openings
    9964031 failed connection attempts
    352 connection resets received
    145 connections established
    805764248 segments received
    795374523 segments send out
    7801 segments retransmited
    10 bad segments received.
    9979214 resets sent
UdpLite:
TcpExt:
    42 invalid SYN cookies received
    150 resets received for embryonic SYN_RECV sockets
    71 packets pruned from receive queue because of socket buffer overrun
    12 ICMP packets dropped because they were out-of-window
    74 ICMP packets dropped because socket was locked
    314552 TCP sockets finished time wait in fast timer
    15 packets rejects in established connections because of timestamp
    579239 delayed acks sent
    317 delayed acks further delayed because of locked socket
    Quick ack mode was activated 330 times
    269778292 packets directly queued to recvmsg prequeue.
    586790295 packets directly received from backlog
    6119347587 packets directly received from prequeue
    326548987 packets header predicted
    2937344 packets header predicted and directly queued to user
    4598778 acknowledgments not containing data received
    416769415 predicted acknowledgments
    130 times recovered from packet loss due to SACK data
    Detected reordering 24 times using SACK
    Detected reordering 24 times using time stamp
    74 congestion windows fully recovered
    286 congestion windows partially recovered using Hoe heuristic
    270 congestion windows recovered after partial ack
    41 TCP data loss events
    58 timeouts after SACK recovery
    209 fast retransmits
    58 forward retransmits
    2 retransmits in slow start
    1971 other TCP timeouts
    1 sack retransmits failed
    24 times receiver scheduled too late for direct processing
    9377 packets collapsed in receive queue due to low socket buffer
    387 DSACKs sent for old packets
    444 DSACKs received
    1 DSACKs for out of order packets received
    4 connections reset due to unexpected data
    160 connections reset due to early user close
    1267 connections aborted due to timeout
    TCPDSACKIgnoredNoUndo: 147
    TCPSackShifted: 2149
    TCPSackMerged: 4165
    TCPSackShiftFallback: 33581
    TCPChallengeACK: 2
    TCPSYNChallenge: 2
IpExt:
    InMcastPkts: 3249403
    OutMcastPkts: 656933
    InBcastPkts: 19
    InOctets: 223747810430
    OutOctets: 178956706487
    InMcastOctets: 310302917
    OutMcastOctets: 75420949
    InBcastOctets: 1334
        
netstat of the remote node 10.X.14.115 which didnt get Merkle Tree Request
-------------------------------------------------------------------
# netstat -as -t
IcmpMsg:
    InType0: 84
    InType3: 482823
    InType8: 41478
    InType13: 1
    InType17: 3
    OutType0: 978
    OutType3: 489300
    OutType8: 154
    OutType14: 1
Tcp:
    5051110 active connections openings
    1767157 passive connection openings
    4889003 failed connection attempts
    1613465 connection resets received
    173 connections established
    722191123 segments received
    701374596 segments send out
    76959 segments retransmited
    0 bad segments received.
    4889033 resets sent
UdpLite:
TcpExt:
    231 invalid SYN cookies received
    158 resets received for embryonic SYN_RECV sockets
    28 packets pruned from receive queue because of socket buffer overrun
    12 ICMP packets dropped because they were out-of-window
    152472 TCP sockets finished time wait in fast timer
    1663029 delayed acks sent
    1012 delayed acks further delayed because of locked socket
    Quick ack mode was activated 369 times
    318763619 packets directly queued to recvmsg prequeue.
    1208873334 packets directly received from backlog
    24533188191 packets directly received from prequeue
    354059635 packets header predicted
    42895337 packets header predicted and directly queued to user
    9163142 acknowledgments not containing data received
    318352481 predicted acknowledgments
    2367 times recovered from packet loss due to SACK data
    TCPDSACKUndo: 56
    3 congestion windows recovered after partial ack
    46372 TCP data loss events
    TCPLostRetransmit: 494
    2 timeouts after SACK recovery
    24614 fast retransmits
    3367 forward retransmits
    552 retransmits in slow start
    22968 other TCP timeouts
    70 sack retransmits failed
    4 times receiver scheduled too late for direct processing
    2203 packets collapsed in receive queue due to low socket buffer
    371 DSACKs sent for old packets
    1303 DSACKs received
    3 connections reset due to unexpected data
    10221 connections reset due to early user close
    1281 connections aborted due to timeout
    TCPSACKDiscard: 418
    TCPDSACKIgnoredOld: 446
    TCPDSACKIgnoredNoUndo: 453
    TCPSpuriousRTOs: 7
    TCPSackShifted: 81212
    TCPSackMerged: 67868
    TCPSackShiftFallback: 15704
    TCPChallengeACK: 1
IpExt:
    InMcastPkts: 2117185
    OutMcastPkts: 808732
    InBcastPkts: 40500
    InOctets: 146930455537
    OutOctets: 219454781573
    InMcastOctets: 215958192
    OutMcastOctets: 91377640
    InBcastOctets: 3402000

Re: Repair Hangs while requesting Merkle Trees

Reply via email to