Re: Soon After Starting c* Process: CPU 100% for java Process

2021-07-02 Thread Bowen Song



On 01/07/2021 23:41, Elliott Sims wrote:
Also for narrowing down performance issues, I've had good luck with 
the "ttop" module of Swiss Java Knife and with the async-profiler 
tool: https://github.com/jvm-profiling-tools/async-profiler 




I like the SJK's ttop too, it's a very useful tool indeed. However, I 
have to admit that very often I find the "SharedPool-Worker-##" uses the 
most CPU or allocates the most heap memory, and that makes the 
troubleshooting more complicated, because those workers do a wide range 
of things. Sometimes I really wish Cassandra uses separate thread pools 
for those.



Anyway, the general advise for anyone running a Cassandra 2.x cluster 
and having issues with it is to upgrade. There's many improvements and 
bugfixes in the newer versions, and even the upgrade itself didn't fix 
the issues you have, it will at least make troubleshooting easier.




Re: Soon After Starting c* Process: CPU 100% for java Process

2021-07-01 Thread Kane Wilson
>
> Eventually, the shared pool worker crashes
> --
> b-3223-big-Data.db
> WARN  [SharedPool-Worker-55] 2021-06-30 19:55:41,677
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread
> Thread[SharedPool-Worker-55,5,main]: {}
> java.lang.IllegalArgumentException: Not enough bytes. Offset: 39289.
> Length: 20585. Buffer size: 57984
> at
> org.apache.cassandra.db.composites.AbstractCType.checkRemaining(AbstractCType.java:362)
> ~[apache-cassandra-2.2.8.jar:2.2.8]
>

That looks like you're hitting a bug. I feel like maybe it's one I've seen
mentioned before. You could probably find something if you searched the C*
JIRA for the error, however the advice will undoubtedly be that you need to
upgrade. I'd consider taking Elliott's advice and upgrading to the latest
3.11.x.

-- 
raft.so - Cassandra consulting, support, and managed services


Re: Soon After Starting c* Process: CPU 100% for java Process

2021-07-01 Thread Elliott Sims
As more general advice, I'd strongly encourage you to update to 3.11.x from
2.2.8.  My personal experience is that it's significantly faster and more
space-efficient, and the garbage collection behavior under pressure is
drastically better.  There's also improved tooling for diagnosing
performance issues.

Also for narrowing down performance issues, I've had good luck with the
"ttop" module of Swiss Java Knife and with the async-profiler tool:
https://github.com/jvm-profiling-tools/async-profiler

On Thu, Jul 1, 2021 at 5:42 AM Fred Habash  wrote:

> Great. Thanks.
>
> I examined the debug logs and from the time c* starts till it crashes the
> 30 minute duration log spitting the exact same message.
>
> Google research leads to no clear answers as to ...
>
> - What is 'evicting cold readers' mean?
> - Why is evicting the same sstable throughout i.e. from start to crash?
> - What is it mean for the AbstractQueryPager to say 'remaining rows to
> page: 2147483646'.
> - How is all the above related?
>
> My Google research revealed a similar inquiry where the response eluded to
> a race condition and recommendation to upgrade to c* 3
>
> Depending on the actual version, you may be running into a race condition
> or NPE that's not allowing the files to close properly. Try upgrading to
> the latest version of 3.x.
>
> In another hit it referenced a page callback.
>
> * page callback that does not have an executor assigned to it*
>
>
> This message repeats from start to crash
> ---
> DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,542
> AbstractQueryPager.java:95 - Fetched 1 live rows
> DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,542
> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
> considering pager exhausted
> DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
> DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:95 - Fetched 1 live rows
> DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
> considering pager exhausted
> DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
> DEBUG [SharedPool-Worker-2] 2021-06-30 12:34:51,543
> FileCacheService.java:102 - Evicting cold readers for
> /data/cassandra//Y-cf0c43b028e811e68f2b1b695a8d5b2c/
>
> Eventually, the shared pool worker crashes
> --
> b-3223-big-Data.db
> WARN  [SharedPool-Worker-55] 2021-06-30 19:55:41,677
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread
> Thread[SharedPool-Worker-55,5,main]: {}
> java.lang.IllegalArgumentException: Not enough bytes. Offset: 39289.
> Length: 20585. Buffer size: 57984
> at
> org.apache.cassandra.db.composites.AbstractCType.checkRemaining(AbstractCType.java:362)
> ~[apache-cassandra-2.2.8.jar:2.2.8]
>
> On Wed, Jun 30, 2021 at 7:53 PM Kane Wilson  wrote:
>
>> Looks like it's doing a lot of reads immediately on startup
>> (AbstractQueryPager) which is potentially causing a lot of GC (guessing
>> that's what caused the StatusLogger).
>>
>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
>>
>> is quite suspicious. You'll want to find out what query is causing a
>> massive scan at startup, you probably need to have a look through the start
>> of the logs to get a better idea at what's happening at startup.
>>
>> On Thu, Jul 1, 2021 at 5:14 AM Fred Habash  wrote:
>>
>>> I have node in cluster when I start c, the cpu reaches 100% with java
>>> process on top. Within a few minutes, jvm crashes (jvm instability)
>>> messages in system.log and c* crashes.
>>>
>>> Once c* is up, cluster average read latency reaches multi-seconds and
>>> client apps are unhappy. For now, the only way out is to drain the node and
>>> let the cluster latency settle.
>>>
>>> None of these measures helped ...
>>> 1. Rebooting the ec2
>>> 2. Replacing the ec2 altogether (new ec2/ new c* install/ etc).
>>> 3. Stopping compactions (as a diagnostic measure)
>>> Trying to understand why the java process is chewing much cpu i.e. what
>>> is actually happening ...
>>>
>>> I see these error messages in the debug.log. What functional task do
>>> these messages relate to e.g. compactions?
>>>
>>>
>>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>>> AbstractQueryPager.java:95 - Fetched 1 live rows
>>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>>> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
>>> considering pager exhausted
>>> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
>>> MemtablePostFlush 0 0 29 0
>>> 0
>>>
>>> DEBUG [SharedPool-Work

Re: Soon After Starting c* Process: CPU 100% for java Process

2021-07-01 Thread Fred Habash
Great. Thanks.

I examined the debug logs and from the time c* starts till it crashes the
30 minute duration log spitting the exact same message.

Google research leads to no clear answers as to ...

- What is 'evicting cold readers' mean?
- Why is evicting the same sstable throughout i.e. from start to crash?
- What is it mean for the AbstractQueryPager to say 'remaining rows to
page: 2147483646'.
- How is all the above related?

My Google research revealed a similar inquiry where the response eluded to
a race condition and recommendation to upgrade to c* 3

Depending on the actual version, you may be running into a race condition
or NPE that's not allowing the files to close properly. Try upgrading to
the latest version of 3.x.

In another hit it referenced a page callback.

* page callback that does not have an executor assigned to it*


This message repeats from start to crash
---
DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,542
AbstractQueryPager.java:95 - Fetched 1 live rows
DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,542
AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
considering pager exhausted
DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,543
AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
AbstractQueryPager.java:95 - Fetched 1 live rows
DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
considering pager exhausted
DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
DEBUG [SharedPool-Worker-2] 2021-06-30 12:34:51,543
FileCacheService.java:102 - Evicting cold readers for
/data/cassandra//Y-cf0c43b028e811e68f2b1b695a8d5b2c/

Eventually, the shared pool worker crashes
--
b-3223-big-Data.db
WARN  [SharedPool-Worker-55] 2021-06-30 19:55:41,677
AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread
Thread[SharedPool-Worker-55,5,main]: {}
java.lang.IllegalArgumentException: Not enough bytes. Offset: 39289.
Length: 20585. Buffer size: 57984
at
org.apache.cassandra.db.composites.AbstractCType.checkRemaining(AbstractCType.java:362)
~[apache-cassandra-2.2.8.jar:2.2.8]

On Wed, Jun 30, 2021 at 7:53 PM Kane Wilson  wrote:

> Looks like it's doing a lot of reads immediately on startup
> (AbstractQueryPager) which is potentially causing a lot of GC (guessing
> that's what caused the StatusLogger).
>
> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
>
> is quite suspicious. You'll want to find out what query is causing a
> massive scan at startup, you probably need to have a look through the start
> of the logs to get a better idea at what's happening at startup.
>
> On Thu, Jul 1, 2021 at 5:14 AM Fred Habash  wrote:
>
>> I have node in cluster when I start c, the cpu reaches 100% with java
>> process on top. Within a few minutes, jvm crashes (jvm instability)
>> messages in system.log and c* crashes.
>>
>> Once c* is up, cluster average read latency reaches multi-seconds and
>> client apps are unhappy. For now, the only way out is to drain the node and
>> let the cluster latency settle.
>>
>> None of these measures helped ...
>> 1. Rebooting the ec2
>> 2. Replacing the ec2 altogether (new ec2/ new c* install/ etc).
>> 3. Stopping compactions (as a diagnostic measure)
>> Trying to understand why the java process is chewing much cpu i.e. what
>> is actually happening ...
>>
>> I see these error messages in the debug.log. What functional task do
>> these messages relate to e.g. compactions?
>>
>>
>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>> AbstractQueryPager.java:95 - Fetched 1 live rows
>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
>> considering pager exhausted
>> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
>> MemtablePostFlush 0 0 29 0
>> 0
>>
>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>> SliceQueryPager.java:92 - Querying next page of slice query; new filter:
>> SliceQueryFilter [reversed=false, slices=[[, ]], count=5000, toGroup = 0]
>> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
>> ValidationExecutor0 0  0 0
>> 0
>> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
>> Sampler   0 0  0 0
>> 0
>> INFO  [Service Thread] 2021-06-30 13:39:04,767 StatusLogger.ja

Re: Soon After Starting c* Process: CPU 100% for java Process

2021-06-30 Thread Kane Wilson
Looks like it's doing a lot of reads immediately on startup
(AbstractQueryPager) which is potentially causing a lot of GC (guessing
that's what caused the StatusLogger).

DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
AbstractQueryPager.java:133 - Remaining rows to page: 2147483646

is quite suspicious. You'll want to find out what query is causing a
massive scan at startup, you probably need to have a look through the start
of the logs to get a better idea at what's happening at startup.

On Thu, Jul 1, 2021 at 5:14 AM Fred Habash  wrote:

> I have node in cluster when I start c, the cpu reaches 100% with java
> process on top. Within a few minutes, jvm crashes (jvm instability)
> messages in system.log and c* crashes.
>
> Once c* is up, cluster average read latency reaches multi-seconds and
> client apps are unhappy. For now, the only way out is to drain the node and
> let the cluster latency settle.
>
> None of these measures helped ...
> 1. Rebooting the ec2
> 2. Replacing the ec2 altogether (new ec2/ new c* install/ etc).
> 3. Stopping compactions (as a diagnostic measure)
> Trying to understand why the java process is chewing much cpu i.e. what is
> actually happening ...
>
> I see these error messages in the debug.log. What functional task do these
> messages relate to e.g. compactions?
>
>
> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
> AbstractQueryPager.java:95 - Fetched 1 live rows
> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
> considering pager exhausted
> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
> MemtablePostFlush 0 0 29 0
> 0
>
> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
> SliceQueryPager.java:92 - Querying next page of slice query; new filter:
> SliceQueryFilter [reversed=false, slices=[[, ]], count=5000, toGroup = 0]
> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
> ValidationExecutor0 0  0 0
> 0
> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
> Sampler   0 0  0 0
> 0
> INFO  [Service Thread] 2021-06-30 13:39:04,767 StatusLogger.java:56 -
> MemtableFlushWriter   0 0  6 0
> 0
> INFO  [Service Thread] 2021-06-30 13:39:04,767 StatusLogger.java:56 -
> InternalResponseStage 0 0  4 0
> 0
> DEBUG [SharedPool-Worker-131] 2021-06-30 13:39:05,078
> StorageProxy.java:1467 - Read timeout; received 1 of 2 responses (only
> digests)
> DEBUG [SharedPool-Worker-131] 2021-06-30 13:39:05,079
> SliceQueryPager.java:92 - Querying next page of slice query; new filter:
> SliceQueryFilter [reversed=false, slices=[[, ]], count=5000, toGroup = 0]
> DEBUG [SharedPool-Worker-158] 2021-06-30 13:39:05,079
> StorageProxy.java:1467 - Read timeout; received 1 of 2 responses (only
> digests)
> DEBUG [SharedPool-Worker-158] 2021-06-30 13:39:05,079
> SliceQueryPager.java:92 - Querying next page of slice query; new filter:
> SliceQueryFilter [reversed=false, slices=[[, ]], count=5000, toGroup = 0]
> DEBUG [SharedPool-Worker-90] 2021-06-30 13:39:05,080 StorageProxy.jav
> 
> EBUG [SharedPool-Worker-26] 2021-06-30 13:39:01,842
> FileCacheService.java:102 - Evicting cold readers for
> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-5069-big-Data.db
> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,847
> FileCacheService.java:102 - Evicting cold readers for
> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-5592-big-Data.db
> DEBUG [SharedPool-Worker-5] 2021-06-30 13:39:01,849
> FileCacheService.java:102 - Evicting cold readers for
> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-3993-big-Data.db
> DEBUG [SharedPool-Worker-5] 2021-06-30 13:39:01,849
> FileCacheService.java:102 - Evicting cold readers for
> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-5927-big-Data.db
> DEBUG [SharedPool-Worker-5] 2021-06-30 13:39:01,849
> FileCacheService.java:102 - Evicting cold readers for
> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-1276-big-Data.db
> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,854
> FileCacheService.java:102 - Evicting cold readers for
> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-5949-big-Data.db
> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,854
> FileCacheService.java:102 - Evicting cold readers for
> /data/cassandra/mykeyspace/mytable-cf0c43b028e811e68f2b1b695a8d5b2c/lb-865-big-Data.db
> DEBUG [SharedPool-Worker-12] 2021-06-30 13:39:01,854
> FileCacheService.java:102 - Evict