Almost 15 minutes, that sounds suspiciously like blocking on a default TCP 
socket timeout.

From: Rahul Reddy <rahulreddy1...@gmail.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wednesday, November 6, 2019 at 12:12 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Aws instance stop and star with ebs

Message from External Sender
Thank you.
I have stopped instance in east. i see that all other instances can gossip to 
that instance and only one instance in west having issues gossiping to that 
node.  when i enable debug mode i see below on the west node

i see bellow messages from 16:32 to 16:47
DEBUG [RMI TCP Connection(272)-127.0.0.1] 2019-11-06 16:44:50,
417 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a response from 
everybody:
424 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a response from 
everybody:

later i see timeout
DEBUG [MessagingService-Outgoing-/eastip-Gossip] 2019-11-06 16:47:04,831 
OutboundTcpConnection.java:350 - Error writing to /eastip
java.io.IOException: Connection timed out

then  INFO  [GossipStage:1] 2019-11-06 16:47:05,792 StorageService.j
ava:2289 - Node /eastip state jump to NORMAL

DEBUG [GossipStage:1] 2019-11-06 16:47:06,244 MigrationManager
.java:99 - Not pulling schema from /eastip, because sche
ma versions match: local/real=cdbb639b-1675-31b3-8a0d-84aca18e
86bf, local/compatible=49bf1daa-d585-38e0-a72b-b36ce82da9cb, r
emote=cdbb639b-1675-31b3-8a0d-84aca18e86bf

i tried running some tcpdump during that time i dont see any packet loss during 
that time.  still unsure why east instance which was stopped and started 
unreachable to west node almost for 15 minutes.


On Tue, Nov 5, 2019 at 10:14 PM daemeon reiydelle 
<daeme...@gmail.com<mailto:daeme...@gmail.com>> wrote:
10 minutes is 600 seconds, and there are several timeouts that are set to that, 
including the data center timeout as I recall.

You may be forced to tcpdump the interface(s) to see where the chatter is. Out 
of curiosity, when you restart the node, have you snapped the jvm's memory to 
see if e.g. heap is even in use?


On Tue, Nov 5, 2019 at 7:03 PM Rahul Reddy 
<rahulreddy1...@gmail.com<mailto:rahulreddy1...@gmail.com>> wrote:
Thanks Ben,
Before stoping the ec2 I did run nodetool drain .so i ruled it out and 
system.log also doesn't show commitlogs being applied.




On Tue, Nov 5, 2019, 7:51 PM Ben Slater 
<ben.sla...@instaclustr.com<mailto:ben.sla...@instaclustr.com>> wrote:
The logs between first start and handshaking should give you a clue but my 
first guess would be replaying commit logs.

Cheers
Ben

---

Ben Slater
Chief Product Officer

Error! Filename not 
specified.<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instaclustr.com_platform_&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=pB9NkrC9vcJH-DXa3a0Sq0fMF8vCsXXf6d2xKYp_eeU&s=BJmlhz72HpglJFrx2wa5VSibRfIF9TlDjUleaTuqkWc&e=>

Error! Filename not 
specified.<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_instaclustr&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=pB9NkrC9vcJH-DXa3a0Sq0fMF8vCsXXf6d2xKYp_eeU&s=2czosRNl-zC351JcZxorgS4c3iAV1aFTHA_7QoQnLE0&e=>
  Error! Filename not 
specified.<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_instaclustr&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=pB9NkrC9vcJH-DXa3a0Sq0fMF8vCsXXf6d2xKYp_eeU&s=XGMH_a0nLNqHTL2-YQ7PxQ-VyBCXbHqjuRnCtNFza20&e=>
  Error! Filename not 
specified.<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_instaclustr&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=pB9NkrC9vcJH-DXa3a0Sq0fMF8vCsXXf6d2xKYp_eeU&s=tne7X0m91hpeee8kiIfJdvBzqWFzatgiLycX20h6b0I&e=>

Read our latest technical blog posts 
here<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instaclustr.com_blog_&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=pB9NkrC9vcJH-DXa3a0Sq0fMF8vCsXXf6d2xKYp_eeU&s=VLfQ4QwOxiuommD3HG8CSuU4cY_OFVDnrsB_X6Ax6LQ&e=>.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and 
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally privileged 
information.  If you are not the intended recipient, do not copy or disclose 
its content, but please reply to this email immediately and highlight the error 
to the sender and then immediately delete the message.


On Wed, 6 Nov 2019 at 04:36, Rahul Reddy 
<rahulreddy1...@gmail.com<mailto:rahulreddy1...@gmail.com>> wrote:
I can reproduce the issue.

I did drain Cassandra node then stop and started Cassandra instance . Cassandra 
instance comes up but other nodes will be in DN state around 10 minutes.

I don't see error in the systemlog

DN  xx.xx.xx.59   420.85 MiB  256          48.2%             id  2
UN  xx.xx.xx.30   432.14 MiB  256          50.0%             id  0
UN  xx.xx.xx.79   447.33 MiB  256          51.1%             id  4
DN  xx.xx.xx.144  452.59 MiB  256          51.6%             id  1
DN  xx.xx.xx.19   431.7 MiB  256          50.1%             id  5
UN  xx.xx.xx.6    421.79 MiB  256          48.9%

when i do nodetool status 3 nodes still showing down. and i dont see errors in 
system.log

and after 10 mins it shows the other node is up as well.


INFO  
[HANDSHAKE-/10.72.100.156<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.72.100.156&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=pB9NkrC9vcJH-DXa3a0Sq0fMF8vCsXXf6d2xKYp_eeU&s=h1L32haNGnn1i83DNKCOcR3Zs7JIrkyiwjZU5JZkw0U&e=>]
 2019-11-05 15:05:09,133 OutboundTcpConnection.java:561 - Handshaking version 
with /stopandstarted node
INFO  [RequestResponseStage-7] 2019-11-05 15:16:27,166 Gossiper.java:1019 - 
InetAddress /nodewhichitwasshowing down is now UP

what is causing delay for 10mins to be able to say that node is reachable

On Wed, Oct 30, 2019, 8:37 AM Rahul Reddy 
<rahulreddy1...@gmail.com<mailto:rahulreddy1...@gmail.com>> wrote:
And also aws ec2 stop and start comes with new instance with same ip and all 
our file systems are in ebs mounted fine.  Does coming new instance with same 
ip cause any gossip issues?

On Tue, Oct 29, 2019, 6:16 PM Rahul Reddy 
<rahulreddy1...@gmail.com<mailto:rahulreddy1...@gmail.com>> wrote:
Thanks Alex. We have 6 nodes in each DC with RF=3  with CL local qourum . and 
we stopped and started only one instance at a time . Tough nodetool status says 
all nodes UN and system.log says canssandra started and started listening . Jmx 
explrter shows instance stayed down longer how do we determine what caused  the 
Cassandra unavialbe though log says its stared and listening ?

On Tue, Oct 29, 2019, 4:44 PM Oleksandr Shulgin 
<oleksandr.shul...@zalando.de<mailto:oleksandr.shul...@zalando.de>> wrote:
On Tue, Oct 29, 2019 at 9:34 PM Rahul Reddy 
<rahulreddy1...@gmail.com<mailto:rahulreddy1...@gmail.com>> wrote:

We have our infrastructure on aws and we use ebs storage . And aws was retiring 
on of the node. Since our storage was persistent we did nodetool drain and 
stopped and start the instance . This caused 500 errors in the service. We have 
local_quorum and rf=3 why does stopping one instance cause application to have 
issues?

Can you still look up what was the underlying error from Cassandra driver in 
the application logs?  Was it request timeout or not enough replicas?

For example, if you only had 3 Cassandra nodes, restarting one of them reduces 
your cluster capacity by 33% temporarily.

Cheers,
--
Alex

Reply via email to