Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-11-01 Thread Sergio
Hi Reid,

Thank you for your extensive response. I don't think that we have such a
person and in any case, even if I am a Software Engineer I would be curious
to deep dive into the problem and understand the reason. The only
observation that I have right now is that I have in the same cluster 2
keyspaces and 3 datacenters.
Only the Cassandra Nodes that are serving a particular Datacenter and
Keyspace is having thousands TCP connections established and I see these
connections only from some clients.
We have 2 kinds of clients and those have been built with 2 different
approaches: Spring Cassandra Reactive and the other one with the Java
Cassandra driver without any wrapper.
I don't know a lot about it the latter one since I didn't write that code.
I want to share that just one note I asked to add LatencyAwarePolicy in the
JAVA Cassandra Driver and this decreased tremendously the CPU LOAD for any
new Cassandra node joining the cluster. I am thinking that there could be
some driver configuration that is not correct?!
I will verify my theory and I will share the results later on for the
interested reader or maybe to help someone that found the same bizarre
behavior.
However, even with thousands connection opened the load is below 3 in a 4
CPU machine and the latency is good.


Thanks and have a great weekend
Sergio




Il giorno ven 1 nov 2019 alle ore 07:56 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> Hi Sergio,
>
>
>
> I’m definitely not enough of a network wonk to make definitive statements
> on network configuration, finding your in-company network expert is
> definitely going to be a lot more productive.  I’ve forgotten if you are
> on-prem or in AWS, so if in AWS replace “your network wonk” with “your AWS
> support contact” if you’re paying for support.  I will make two more
> concrete observations though, and you can run these notions down as
> appropriate.
>
>
>
> When C* starts up, see if the logs contain a warning about jemalloc not
> being detected.  That’s something we missed in our 3.11.4 setup and is on
> my todo list to circle back around to evaluate later.  JVMs have some
> rather complicated memory management that relates to efficient allocation
> of memory to threads (this isn’t strictly a JVM thing, but JVMs definitely
> care).  If you have high connection counts, I can see that likely mattering
> to you.  Also, as part of that, the memory arena setting of 4 that is
> Cassandra’s default may not be the right one for you.  The more concurrency
> you have, the more that number may need to bump up to avoid contention on
> memory allocations.  We haven’t played with it because our simultaneous
> connection counts are modest.  Note that Cassandra can create a lot of
> threads but many of them have low activity so I think it’s more about how
> many area actually active.  Large connection counts will move the needle up
> on you and may motivate tuning the arena count.
>
>
>
> When talking to your network person, I’d see what they think about C*’s
> defaults on TCP_NODELAY vs delayed ACKs.  The Datastax docs say that the
> TCP_NODELAY default setting is false in C*, but I looked in the 3.11.4
> source and the default is coded as true.  It’s only via the config file
> samples that bounce around that it typically gets set to false.  There are
> times where Nagle and delayed ACKs don’t play well together and induce
> stalls.  I’m not the person to help you investigate that because it gets a
> bit gnarly on the details (for example, a refinement to the Nagle algorithm
> was proposed in the 1990’s that exists in some OS’s and can make my
> comments here moot).  Somebody who lives this stuff will be a more
> definitive source, but you are welcome to copy-paste my thoughts to them
> for context.
>
>
>
> R
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Wednesday, October 30, 2019 at 5:56 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Cassandra 3.11.4 Node the load starts to increase after
> few minutes to 40 on 4 CPU machine
>
>
>
> *Message from External Sender*
>
> Hi Reid,
>
> I don't have anymore this loading problem.
> I solved by changing the Cassandra Driver Configuration.
> Now my cluster is pretty stable and I don't have machines with crazy CPU
> Load.
> The only thing not urgent but I need to investigate is the number of
> ESTABLISHED TCP connections. I see just one node having 7K TCP connections
> ESTABLISHED while the others are having around 4-6K connection opened. So
> the newest nodes added into the cluster have a higher number of ESTABLISHED
> TCP connections.
>
> default['cassandra']['sysctl'] = {
> 'net.ipv4.tcp_keepalive_time' => 60,
> 'net.ipv4.tcp_keepalive_probes' => 3,
> 'net.ipv4.tcp_keepalive_intvl' => 10,
> 'net.core.rmem_max' => 16777216,
> 'net.core.wmem_max' => 16777216,
> 'net.core.rmem_default' => 16777216,
> 'net.core.wmem_default' => 16777216,
> 'net.core.optmem_max' => 40960,
> 'net.ipv4.tcp_rmem' => '4096 87380 

Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-11-01 Thread Reid Pinchback
Hi Sergio,

I’m definitely not enough of a network wonk to make definitive statements on 
network configuration, finding your in-company network expert is definitely 
going to be a lot more productive.  I’ve forgotten if you are on-prem or in 
AWS, so if in AWS replace “your network wonk” with “your AWS support contact” 
if you’re paying for support.  I will make two more concrete observations 
though, and you can run these notions down as appropriate.

When C* starts up, see if the logs contain a warning about jemalloc not being 
detected.  That’s something we missed in our 3.11.4 setup and is on my todo 
list to circle back around to evaluate later.  JVMs have some rather 
complicated memory management that relates to efficient allocation of memory to 
threads (this isn’t strictly a JVM thing, but JVMs definitely care).  If you 
have high connection counts, I can see that likely mattering to you.  Also, as 
part of that, the memory arena setting of 4 that is Cassandra’s default may not 
be the right one for you.  The more concurrency you have, the more that number 
may need to bump up to avoid contention on memory allocations.  We haven’t 
played with it because our simultaneous connection counts are modest.  Note 
that Cassandra can create a lot of threads but many of them have low activity 
so I think it’s more about how many area actually active.  Large connection 
counts will move the needle up on you and may motivate tuning the arena count.

When talking to your network person, I’d see what they think about C*’s 
defaults on TCP_NODELAY vs delayed ACKs.  The Datastax docs say that the 
TCP_NODELAY default setting is false in C*, but I looked in the 3.11.4 source 
and the default is coded as true.  It’s only via the config file samples that 
bounce around that it typically gets set to false.  There are times where Nagle 
and delayed ACKs don’t play well together and induce stalls.  I’m not the 
person to help you investigate that because it gets a bit gnarly on the details 
(for example, a refinement to the Nagle algorithm was proposed in the 1990’s 
that exists in some OS’s and can make my comments here moot).  Somebody who 
lives this stuff will be a more definitive source, but you are welcome to 
copy-paste my thoughts to them for context.

R

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, October 30, 2019 at 5:56 PM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra 3.11.4 Node the load starts to increase after few 
minutes to 40 on 4 CPU machine

Message from External Sender
Hi Reid,

I don't have anymore this loading problem.
I solved by changing the Cassandra Driver Configuration.
Now my cluster is pretty stable and I don't have machines with crazy CPU Load.
The only thing not urgent but I need to investigate is the number of 
ESTABLISHED TCP connections. I see just one node having 7K TCP connections 
ESTABLISHED while the others are having around 4-6K connection opened. So the 
newest nodes added into the cluster have a higher number of ESTABLISHED TCP 
connections.

default['cassandra']['sysctl'] = {
'net.ipv4.tcp_keepalive_time' => 60,
'net.ipv4.tcp_keepalive_probes' => 3,
'net.ipv4.tcp_keepalive_intvl' => 10,
'net.core.rmem_max' => 16777216,
'net.core.wmem_max' => 16777216,
'net.core.rmem_default' => 16777216,
'net.core.wmem_default' => 16777216,
'net.core.optmem_max' => 40960,
'net.ipv4.tcp_rmem' => '4096 87380 16777216',
'net.ipv4.tcp_wmem' => '4096 65536 16777216',
'net.ipv4.ip_local_port_range' => '1 65535',
'net.ipv4.tcp_window_scaling' => 1,
  'net.core.netdev_max_backlog' => 2500,
  'net.core.somaxconn' => 65000,
'vm.max_map_count' => 1048575,
'vm.swappiness' => 0
}

These are my tweaked value and I used the values recommended from datastax.

Do you have something different?

Best,
Sergio

Il giorno mer 30 ott 2019 alle ore 13:27 Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> ha scritto:
Oh nvm, didn't see the later msg about just posting what your fix was.

R


On 10/30/19, 4:24 PM, "Reid Pinchback" 
mailto:rpinchb...@tripadvisor.com>> wrote:

 Message from External Sender

Hi Sergio,

Assuming nobody is actually mounting a SYN flood attack, then this sounds 
like you're either being hammered with connection requests in very short 
periods of time, or your TCP backlog tuning is off.   At least, that's where 
I'd start looking.  If you take that log message and google it (Possible SYN 
flooding... Sending cookies") you'll find explanations.  Or just googling "TCP 
backlog tuning".

R


On 10/30/19, 3:29 PM, "Sergio Bilello" 
mailto:lapostadiser...@gmail.com>> wrote:

>
>Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: 
TCP: request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. 
Check SNMP counters.




-
To unsubscribe, e-mail: 

RE: [EXTERNAL] Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-31 Thread Durity, Sean R
There is definitely a resource risk to having thousands of open connections to 
each node. Some of the drivers have (had?) less than optimal default settings, 
like acquiring 50 connections per Cassandra node. This is usually overkill. I 
think 5-10/node is much more reasonable. It depends on your app architecture 
and cluster node count. If there are lots of small micro-services, maybe they 
only need 2 connections per node.


Sean Durity – Staff Systems Engineer, Cassandra

From: Sergio 
Sent: Wednesday, October 30, 2019 5:39 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Cassandra 3.11.4 Node the load starts to increase after 
few minutes to 40 on 4 CPU machine

Hi Reid,

I don't have anymore this loading problem.
I solved by changing the Cassandra Driver Configuration.
Now my cluster is pretty stable and I don't have machines with crazy CPU Load.
The only thing not urgent but I need to investigate is the number of 
ESTABLISHED TCP connections. I see just one node having 7K TCP connections 
ESTABLISHED while the others are having around 4-6K connection opened. So the 
newest nodes added into the cluster have a higher number of ESTABLISHED TCP 
connections.

default['cassandra']['sysctl'] = {
'net.ipv4.tcp_keepalive_time' => 60,
'net.ipv4.tcp_keepalive_probes' => 3,
'net.ipv4.tcp_keepalive_intvl' => 10,
'net.core.rmem_max' => 16777216,
'net.core.wmem_max' => 16777216,
'net.core.rmem_default' => 16777216,
'net.core.wmem_default' => 16777216,
'net.core.optmem_max' => 40960,
'net.ipv4.tcp_rmem' => '4096 87380 16777216',
'net.ipv4.tcp_wmem' => '4096 65536 16777216',
'net.ipv4.ip_local_port_range' => '1 65535',
'net.ipv4.tcp_window_scaling' => 1,
  'net.core.netdev_max_backlog' => 2500,
  'net.core.somaxconn' => 65000,
'vm.max_map_count' => 1048575,
'vm.swappiness' => 0
}

These are my tweaked value and I used the values recommended from datastax.

Do you have something different?

Best,
Sergio

Il giorno mer 30 ott 2019 alle ore 13:27 Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> ha scritto:
Oh nvm, didn't see the later msg about just posting what your fix was.

R


On 10/30/19, 4:24 PM, "Reid Pinchback" 
mailto:rpinchb...@tripadvisor.com>> wrote:

 Message from External Sender

Hi Sergio,

Assuming nobody is actually mounting a SYN flood attack, then this sounds 
like you're either being hammered with connection requests in very short 
periods of time, or your TCP backlog tuning is off.   At least, that's where 
I'd start looking.  If you take that log message and google it (Possible SYN 
flooding... Sending cookies") you'll find explanations.  Or just googling "TCP 
backlog tuning".

R


On 10/30/19, 3:29 PM, "Sergio Bilello" 
mailto:lapostadiser...@gmail.com>> wrote:

>
>Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: 
TCP: request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. 
Check SNMP counters.




-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 
user-h...@cassandra.apache.org




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-30 Thread Sergio
Hi Reid,

I don't have anymore this loading problem.
I solved by changing the Cassandra Driver Configuration.
Now my cluster is pretty stable and I don't have machines with crazy CPU
Load.
The only thing not urgent but I need to investigate is the number of
ESTABLISHED TCP connections. I see just one node having 7K TCP connections
ESTABLISHED while the others are having around 4-6K connection opened. So
the newest nodes added into the cluster have a higher number of ESTABLISHED
TCP connections.

default['cassandra']['sysctl'] = {
'net.ipv4.tcp_keepalive_time' => 60,
'net.ipv4.tcp_keepalive_probes' => 3,
'net.ipv4.tcp_keepalive_intvl' => 10,
'net.core.rmem_max' => 16777216,
'net.core.wmem_max' => 16777216,
'net.core.rmem_default' => 16777216,
'net.core.wmem_default' => 16777216,
'net.core.optmem_max' => 40960,
'net.ipv4.tcp_rmem' => '4096 87380 16777216',
'net.ipv4.tcp_wmem' => '4096 65536 16777216',
'net.ipv4.ip_local_port_range' => '1 65535',
'net.ipv4.tcp_window_scaling' => 1,
  'net.core.netdev_max_backlog' => 2500,
  'net.core.somaxconn' => 65000,
'vm.max_map_count' => 1048575,
'vm.swappiness' => 0
}

These are my tweaked value and I used the values recommended from datastax.

Do you have something different?

Best,
Sergio

Il giorno mer 30 ott 2019 alle ore 13:27 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> Oh nvm, didn't see the later msg about just posting what your fix was.
>
> R
>
>
> On 10/30/19, 4:24 PM, "Reid Pinchback" 
> wrote:
>
>  Message from External Sender
>
> Hi Sergio,
>
> Assuming nobody is actually mounting a SYN flood attack, then this
> sounds like you're either being hammered with connection requests in very
> short periods of time, or your TCP backlog tuning is off.   At least,
> that's where I'd start looking.  If you take that log message and google it
> (Possible SYN flooding... Sending cookies") you'll find explanations.  Or
> just googling "TCP backlog tuning".
>
> R
>
>
> On 10/30/19, 3:29 PM, "Sergio Bilello" 
> wrote:
>
> >
> >Oct 17 00:23:03 prod-personalization-live-data-cassandra-08
> kernel: TCP: request_sock_TCP: Possible SYN flooding on port 9042. Sending
> cookies. Check SNMP counters.
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>


Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-30 Thread Reid Pinchback
Oh nvm, didn't see the later msg about just posting what your fix was.

R


On 10/30/19, 4:24 PM, "Reid Pinchback"  wrote:

 Message from External Sender

Hi Sergio,

Assuming nobody is actually mounting a SYN flood attack, then this sounds 
like you're either being hammered with connection requests in very short 
periods of time, or your TCP backlog tuning is off.   At least, that's where 
I'd start looking.  If you take that log message and google it (Possible SYN 
flooding... Sending cookies") you'll find explanations.  Or just googling "TCP 
backlog tuning".

R


On 10/30/19, 3:29 PM, "Sergio Bilello"  wrote:

>
>Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: 
TCP: request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. 
Check SNMP counters.




-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-30 Thread Reid Pinchback
Hi Sergio,

Assuming nobody is actually mounting a SYN flood attack, then this sounds like 
you're either being hammered with connection requests in very short periods of 
time, or your TCP backlog tuning is off.   At least, that's where I'd start 
looking.  If you take that log message and google it (Possible SYN flooding... 
Sending cookies") you'll find explanations.  Or just googling "TCP backlog 
tuning".

R


On 10/30/19, 3:29 PM, "Sergio Bilello"  wrote:

>
>Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: TCP: 
request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. Check 
SNMP counters.




-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-30 Thread Sergio Bilello
https://docs.datastax.com/en/drivers/java/2.2/com/datastax/driver/core/policies/LatencyAwarePolicy.html
 I had to change the Policy in the Cassandra Driver. I solved this problem few 
weeks ago. I am just posting the solution for anyone that could hit the same 
issue.
Best,
Sergio

On 2019/10/17 02:46:01, Sergio Bilello  wrote: 
> Hello guys!
> 
> I performed a thread dump 
> https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMTAvMTcvLS1kdW1wLnR4dC0tMC0zMC00MA==;
>  while try to join the node with
> 
> -Dcassandra.join_ring=false
> 
> OR
> -Dcassandra.join.ring=false
> 
> OR
> 
> -Djoin.ring=false
> 
> because the node spiked in load and latency was affecting the clients.
> 
> With or without that flag the node is high in latency and I see the load sky 
> rocketing when the number of TCP established connections increases
> 
> Analyzing the /var/log/messages I am able to read
> 
> Oct 17 00:23:39 prod-personalization-live-data-cassandra-08 cassandra: INFO 
> [Service Thread] 2019-10-17 00:23:39,030 GCInspector.java:284 - G1 Young 
> Generation GC in 255ms. G1 Eden Space: 361758720 -> 0; G1 Old Gen: 1855455944 
> -> 1781007048; G1 Survivor Space: 39845888 -> 32505856;
> 
> Oct 17 00:23:40 prod-personalization-live-data-cassandra-08 cassandra: INFO 
> [ScheduledTasks:1] 2019-10-17 00:23:40,352 NoSpamLogger.java:91 - Some 
> operations were slow, details available at debug level (debug.log)
> 
> 
> Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: TCP: 
> request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. Check 
> SNMP counters.
> 
> I don't see anything on debug.log that looks to be relevant
> 
> The machine is on aws with 4 cpu with 32GB Ram and 1 TB SSD i3.xlarge
> 
> 
> 
> 
> 
> [sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$ nodetool 
> tpstats
> 
> Pool Name Active Pending Completed Blocked All time blocked
> 
> ReadStage 32 53 559304 0 0
> 
> MiscStage 0 0 0 0 0
> 
> CompactionExecutor 1 107 118 0 0
> 
> MutationStage 0 0 2695 0 0
> 
> MemtableReclaimMemory 0 0 11 0 0
> 
> PendingRangeCalculator 0 0 33 0 0
> 
> GossipStage 0 0 4314 0 0
> 
> SecondaryIndexManagement 0 0 0 0 0
> 
> HintsDispatcher 0 0 0 0 0
> 
> RequestResponseStage 0 0 421865 0 0
> 
> Native-Transport-Requests 22 0 1903400 0 0
> 
> ReadRepairStage 0 0 59078 0 0
> 
> CounterMutationStage 0 0 0 0 0
> 
> MigrationStage 0 0 0 0 0
> 
> MemtablePostFlush 0 0 32 0 0
> 
> PerDiskMemtableFlushWriter_0 0 0 11 0 0
> 
> ValidationExecutor 0 0 0 0 0
> 
> Sampler 0 0 0 0 0
> 
> MemtableFlushWriter 0 0 11 0 0
> 
> InternalResponseStage 0 0 0 0 0
> 
> ViewMutationStage 0 0 0 0 0
> 
> AntiEntropyStage 0 0 0 0 0
> 
> CacheCleanupExecutor 0 0 0 0 0
> 
> 
> 
> Message type Dropped
> 
> READ 0
> 
> RANGE_SLICE 0
> 
> _TRACE 0
> 
> HINT 0
> 
> MUTATION 0
> 
> COUNTER_MUTATION 0
> 
> BATCH_STORE 0
> 
> BATCH_REMOVE 0
> 
> REQUEST_RESPONSE 0
> 
> PAGED_RANGE 0
> 
> READ_REPAIR 0
> 
> [sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$
> 
> 
> 
> 
> 
> top - 01:44:15 up 2 days, 1:45, 4 users, load average: 34.45, 27.71, 15.37
> 
> Tasks: 140 total, 1 running, 74 sleeping, 0 stopped, 0 zombie
> 
> %Cpu(s): 90.0 us, 4.5 sy, 3.0 ni, 1.1 id, 0.0 wa, 0.0 hi, 1.4 si, 0.0 st
> 
> KiB Mem : 31391772 total, 250504 free, 10880364 used, 20260904 buff/cache
> 
> KiB Swap: 0 total, 0 free, 0 used. 19341960 avail Mem
> 
> 
> 
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 
> 20712 cassand+ 20 0 194.1g 14.4g 4.6g S 392.0 48.2 74:50.48 java
> 
> 20823 sergio.+ 20 0 124856 6304 3136 S 1.7 0.0 0:13.51 htop
> 
> 7865 root 20 0 1062684 39880 11428 S 0.7 0.1 4:06.02 ir_agent
> 
> 3557 consul 20 0 41568 30192 18832 S 0.3 0.1 13:16.37 consul
> 
> 7600 root 20 0 2082700 46624 11880 S 0.3 0.1 4:14.60 ir_agent
> 
> 1 root 20 0 193660 7740 5220 S 0.0 0.0 0:56.36 systemd
> 
> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 kthreadd
> 
> 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
> 
> 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
> 
> 7 root 20 0 0 0 0 S 0.0 0.0 0:06.04 ksoftirqd/0
> 
> 
> 
> 
> 
> [sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$ free
> 
> total used free shared buff/cache available
> 
> Mem: 31391772 10880916 256732 426552 20254124 19341768
> 
> Swap: 0 0 0
> 
> [sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$
> 
> 
> 
> 
> 
> 
> 
> bash-4.2$ java -jar sjk.jar ttop -p 20712
> 
> Monitoring threads ...
> 
> 
> 
> 2019-10-17T01:45:33.352+ Process summary
> 
> process cpu=363.58%
> 
> application cpu=261.91% (user=248.65% sys=13.26%)
> 
> other: cpu=101.67%
> 
> thread count: 474
> 
> heap allocation rate 583mb/s
> 
> [39] user=13.56% sys=-0.59% alloc= 11mb/s - OptionalTasks:1
> 
> [000379] user= 8.57% sys=-0.27% alloc= 18mb/s - ReadStage-19
> 
> [000380] user= 7.85% sys= 0.22% alloc= 19mb/s - Native-Transport-Requests-21
> 
> [000295] user= 7.14% sys= 0.23% alloc= 14mb/s - Native-Transport-Requests-5
> 
> [000378] user= 7.14% sys=-0.03% alloc= 22mb/s - 

Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-16 Thread Sergio Bilello
Hello guys!

I performed a thread dump 
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMTAvMTcvLS1kdW1wLnR4dC0tMC0zMC00MA==;
 while try to join the node with

-Dcassandra.join_ring=false

OR
-Dcassandra.join.ring=false

OR

-Djoin.ring=false

because the node spiked in load and latency was affecting the clients.

With or without that flag the node is high in latency and I see the load sky 
rocketing when the number of TCP established connections increases

Analyzing the /var/log/messages I am able to read

Oct 17 00:23:39 prod-personalization-live-data-cassandra-08 cassandra: INFO 
[Service Thread] 2019-10-17 00:23:39,030 GCInspector.java:284 - G1 Young 
Generation GC in 255ms. G1 Eden Space: 361758720 -> 0; G1 Old Gen: 1855455944 
-> 1781007048; G1 Survivor Space: 39845888 -> 32505856;

Oct 17 00:23:40 prod-personalization-live-data-cassandra-08 cassandra: INFO 
[ScheduledTasks:1] 2019-10-17 00:23:40,352 NoSpamLogger.java:91 - Some 
operations were slow, details available at debug level (debug.log)


Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: TCP: 
request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. Check 
SNMP counters.

I don't see anything on debug.log that looks to be relevant

The machine is on aws with 4 cpu with 32GB Ram and 1 TB SSD i3.xlarge





[sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$ nodetool tpstats

Pool Name Active Pending Completed Blocked All time blocked

ReadStage 32 53 559304 0 0

MiscStage 0 0 0 0 0

CompactionExecutor 1 107 118 0 0

MutationStage 0 0 2695 0 0

MemtableReclaimMemory 0 0 11 0 0

PendingRangeCalculator 0 0 33 0 0

GossipStage 0 0 4314 0 0

SecondaryIndexManagement 0 0 0 0 0

HintsDispatcher 0 0 0 0 0

RequestResponseStage 0 0 421865 0 0

Native-Transport-Requests 22 0 1903400 0 0

ReadRepairStage 0 0 59078 0 0

CounterMutationStage 0 0 0 0 0

MigrationStage 0 0 0 0 0

MemtablePostFlush 0 0 32 0 0

PerDiskMemtableFlushWriter_0 0 0 11 0 0

ValidationExecutor 0 0 0 0 0

Sampler 0 0 0 0 0

MemtableFlushWriter 0 0 11 0 0

InternalResponseStage 0 0 0 0 0

ViewMutationStage 0 0 0 0 0

AntiEntropyStage 0 0 0 0 0

CacheCleanupExecutor 0 0 0 0 0



Message type Dropped

READ 0

RANGE_SLICE 0

_TRACE 0

HINT 0

MUTATION 0

COUNTER_MUTATION 0

BATCH_STORE 0

BATCH_REMOVE 0

REQUEST_RESPONSE 0

PAGED_RANGE 0

READ_REPAIR 0

[sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$





top - 01:44:15 up 2 days, 1:45, 4 users, load average: 34.45, 27.71, 15.37

Tasks: 140 total, 1 running, 74 sleeping, 0 stopped, 0 zombie

%Cpu(s): 90.0 us, 4.5 sy, 3.0 ni, 1.1 id, 0.0 wa, 0.0 hi, 1.4 si, 0.0 st

KiB Mem : 31391772 total, 250504 free, 10880364 used, 20260904 buff/cache

KiB Swap: 0 total, 0 free, 0 used. 19341960 avail Mem



PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

20712 cassand+ 20 0 194.1g 14.4g 4.6g S 392.0 48.2 74:50.48 java

20823 sergio.+ 20 0 124856 6304 3136 S 1.7 0.0 0:13.51 htop

7865 root 20 0 1062684 39880 11428 S 0.7 0.1 4:06.02 ir_agent

3557 consul 20 0 41568 30192 18832 S 0.3 0.1 13:16.37 consul

7600 root 20 0 2082700 46624 11880 S 0.3 0.1 4:14.60 ir_agent

1 root 20 0 193660 7740 5220 S 0.0 0.0 0:56.36 systemd

2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 kthreadd

4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H

6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq

7 root 20 0 0 0 0 S 0.0 0.0 0:06.04 ksoftirqd/0





[sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$ free

total used free shared buff/cache available

Mem: 31391772 10880916 256732 426552 20254124 19341768

Swap: 0 0 0

[sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$







bash-4.2$ java -jar sjk.jar ttop -p 20712

Monitoring threads ...



2019-10-17T01:45:33.352+ Process summary

process cpu=363.58%

application cpu=261.91% (user=248.65% sys=13.26%)

other: cpu=101.67%

thread count: 474

heap allocation rate 583mb/s

[39] user=13.56% sys=-0.59% alloc= 11mb/s - OptionalTasks:1

[000379] user= 8.57% sys=-0.27% alloc= 18mb/s - ReadStage-19

[000380] user= 7.85% sys= 0.22% alloc= 19mb/s - Native-Transport-Requests-21

[000295] user= 7.14% sys= 0.23% alloc= 14mb/s - Native-Transport-Requests-5

[000378] user= 7.14% sys=-0.03% alloc= 22mb/s - Native-Transport-Requests-17

[000514] user= 6.42% sys= 0.12% alloc= 20mb/s - Native-Transport-Requests-85

[000293] user= 6.66% sys=-0.32% alloc= 12mb/s - Native-Transport-Requests-2

[000392] user= 6.19% sys= 0.14% alloc= 9545kb/s - Native-Transport-Requests-12

[000492] user= 5.71% sys=-0.24% alloc= 15mb/s - Native-Transport-Requests-24

[000294] user= 5.23% sys=-0.25% alloc= 14mb/s - Native-Transport-Requests-3

[000381] user= 5.47% sys=-0.52% alloc= 7430kb/s - Native-Transport-Requests-23

[000672] user= 4.52% sys= 0.25% alloc= 14mb/s - Native-Transport-Requests-270

[000296] user= 5.23% sys=-0.47% alloc= 13mb/s - ReadStage-7

[000673] user= 4.52% sys= 0.05% alloc= 13mb/s - Native-Transport-Requests-269

[000118] user= 4.28% sys=