Re: Cassandra commitlog corruption on hard shutdown

2021-08-03 Thread Leon Zaruvinsky
Following up, I've found that we tend to encounter one of three types of
exceptions/commitlog corruptions:

1.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Mutation checksum failure at ... in CommitLog-5-1531150627243.log
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:638)

2.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Could not read commit log descriptor in file CommitLog-5-1550003067433.log
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:638)

3.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Encountered bad header at position ... of commit log
CommitLog-5-1603991140803.log, with invalid CRC. The end of segment marker
should be zero.
at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)

I believe exception (2) is mitigated by
https://issues.apache.org/jira/browse/CASSANDRA-11995 and
https://issues.apache.org/jira/browse/CASSANDRA-13918

But it's not clear to me how (1) and (3) can be mitigated.

On Mon, Jul 26, 2021 at 6:40 PM Leon Zaruvinsky 
wrote:

> Thanks for the links/comments Jeff and Bowen.
>
> We run xfs. Not sure that we can switch to zfs, so a different solution
> would be preferred.
>
> I’ll take a look through that patch – maybe I’ll try to backport and
> replicate.  We’ve seen both cases where the commitlog is just 0s (empty)
> and where it has had real data in it.
>
> Leon
>
> On Mon, Jul 26, 2021 at 6:38 PM Jeff Jirsa  wrote:
>
>> The commitlog code has changed DRASTICALLY between 2.x and trunk.
>>
>> If it's really a bunch of trailing 0s as was suggested later, then
>> https://issues.apache.org/jira/browse/CASSANDRA-11995 addresses at least
>> one cause/case of that particular bug.
>>
>>
>>
>> On Mon, Jul 26, 2021 at 3:11 PM Leon Zaruvinsky 
>> wrote:
>>
>>> And for completeness, a sample stack trace:
>>>
>>> ERROR [2021-07-21T02:11:01.994Z] 
>>> org.apache.cassandra.db.commitlog.CommitLog: Failed commit log replay. 
>>> Commit disk failure policy is stop_on_startup; terminating thread 
>>> (throwable0_message: Mutation checksum failure at 15167277 in 
>>> CommitLog-5-1626828286977.log)
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
>>>  Mutation checksum failure at 15167277 in CommitLog-5-1626828286977.log
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:647)
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:519)
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:401)
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:143)
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:175)
>>> at 
>>> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:155)
>>> at 
>>> org.apache.cassandra.service.CassandraDaemon.recoverCommitlogAndCompleteSetup(CassandraDaemon.java:296)
>>> at 
>>> org.apache.cassandra.service.CassandraDaemon.completeSetupMayThrowSstableException(CassandraDaemon.java:289)
>>> at 
>>> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222)
>>> at 
>>> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630)
>>> at 
>>> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:741)
>>>
>>>
>>> On Mon, Jul 26, 2021 at 6:08 PM Leon Zaruvinsky <
>>> leonzaruvin...@gmail.com> wrote:
>>>
 Currently we're using commitlog_batch:

 commitlog_sync: batch
 commitlog_sync_batch_window_in_ms: 2
 commitlog_segment_size_in_mb: 32

 durable_writes is also true.

 Unfortunately we are still using Cassandra 2.2.x :( Though I'd be
 curious if much in this space has changed since then (I've looked through
 the changelogs and nothing stood out).

 On Mon, Jul 26, 2021 at 5:20 PM Jeff Jirsa  wrote:

> What commitlog settings are you using?
>
> Default is periodic with 10s sync. That leaves you a 10s window on
> hard poweroff/crash.
>
> I would also expect cassandra to cleanup and start cleanly, which
> version are you running?
>
>
>
> On Mon, Jul 26, 2021 at 1:00 PM Leon Zaruvinsky <
> leonzaruvin...@gmail.com> wrote:
>
>> Hi Cassandra community,
>>
>> We (and others) regularly run into commit log corruptions that are
>> caused by Cassandra, or the underlying infrastructure, being hard
>> restarted.  I suspect that this is because it happens in the middle of a
>> commitlog file write to disk.
>>
>> Could anyone point me at resources / code to understand why this is
>> happening?  Shouldn't Cassandra not be acking writes until the commitlog 

Re: High memory usage during nodetool repair

2021-08-03 Thread Amandeep Srivastava
Thanks. I guess some earlier thread got truncated.

I already applied Erick's recommendations and that seem to have worked in
reducing the ram consumption by around 50%.

Regarding cheap memory and hardware, we are already running 96GB boxes and
getting multiple larger ones might be a little difficult at this point.
Hence I wanted to understand cons of disabling mmap use for data.

Besides degraded read performance, wouldn't we be putting more pressure on
heap memory, when disabling mmap, which might cause frequent GCs and OOM
errors at some point? Since currently whatever was being served by mmap
would be loaded over heap now and processed/stored further.

Also, we've disabled the swap usage on hosts as recommended to optimize
performance so cass won't be able to enter that too in case memory starts
to fill up.

On Tue, 3 Aug, 2021, 6:33 pm Jim Shaw,  wrote:

> I think Erick posted https://community.datastax.com/questions/6947/.
> explained very clearly.
>
> We hit same issue only on a huge table when upgrade, and we changed back
> after done.
> My understanding,  Which option to chose,  shall depend on your user case.
> If chasing high performance on a big table, then go default one, and
> increase capacity in memory, nowadays hardware is cheaper.
>
> Thanks,
> Jim
>
> On Mon, Aug 2, 2021 at 7:12 PM Amandeep Srivastava <
> amandeep.srivastava1...@gmail.com> wrote:
>
>> Can anyone please help with the above questions? To summarise:
>>
>> 1) What is the impact of using mmap only for indices besides a
>> degradation in read performance?
>> 2) Why does the off heap consumed during Cassandra full repair remains
>> occupied 12+ hours after the repair completion and is there a
>> manual/configuration driven way to clear that earlier?
>>
>> Thanks,
>> Aman
>>
>> On Thu, 29 Jul, 2021, 6:47 pm Amandeep Srivastava, <
>> amandeep.srivastava1...@gmail.com> wrote:
>>
>>> Hi Erick,
>>>
>>> Limiting mmap to index only seems to have resolved the issue. The max
>>> ram usage remained at 60% this time. Could you please point me to the
>>> limitations for setting this param? - For starters, I can see read
>>> performance getting reduced up to 30% (CASSANDRA-8464
>>> )
>>>
>>> Also if you could please shed light on extended questions in my earlier
>>> email.
>>>
>>> Thanks a lot.
>>>
>>> Regards,
>>> Aman
>>>
>>> On Thu, Jul 29, 2021 at 12:52 PM Amandeep Srivastava <
>>> amandeep.srivastava1...@gmail.com> wrote:
>>>
 Thanks, Bowen, don't think that's an issue - but yes I can try
 upgrading to 3.11.5 and limit the merkle tree size to bring down the memory
 utilization.

 Thanks, Erick, let me try that.

 Can someone please share documentation relating to internal functioning
 of full repairs - if there exists one? Wanted to understand the role of the
 heap and off-heap memory separately during the process.

 Also, for my case, once the nodes reach the 95% memory usage, it stays
 there for almost 10-12 hours after the repair is complete, before falling
 back to 65%. Any pointers on what might be consuming off-heap for so long
 and can something be done to clear it earlier?

 Thanks,
 Aman



>>>
>>> --
>>> Regards,
>>> Aman
>>>
>>


Re: Long GC pauses during repair

2021-08-03 Thread Jim Shaw
CMS heap too large will have long GC.  you may try reduce heap on 1 node to
see.  or go GC1 if it is easy way.

Thanks,
Jim

On Tue, Aug 3, 2021 at 3:33 AM manish khandelwal <
manishkhandelwa...@gmail.com> wrote:

> Long GC (1 seconds /2 seconds) pauses seen during repair on the
> coordinator. Running full repair with partition range option. GC collector
> is CMS and heap is 14G. Cluster is 7+7. Cassandra version is 3.11.2.  Not
> much traffic when repair is running. What could be the probable cause of
> long gc pauses? What things should I look into?
>
> Regards
> Manish
>


Re: High memory usage during nodetool repair

2021-08-03 Thread Jim Shaw
I think Erick posted https://community.datastax.com/questions/6947/.
explained very clearly.

We hit same issue only on a huge table when upgrade, and we changed back
after done.
My understanding,  Which option to chose,  shall depend on your user case.
If chasing high performance on a big table, then go default one, and
increase capacity in memory, nowadays hardware is cheaper.

Thanks,
Jim

On Mon, Aug 2, 2021 at 7:12 PM Amandeep Srivastava <
amandeep.srivastava1...@gmail.com> wrote:

> Can anyone please help with the above questions? To summarise:
>
> 1) What is the impact of using mmap only for indices besides a degradation
> in read performance?
> 2) Why does the off heap consumed during Cassandra full repair remains
> occupied 12+ hours after the repair completion and is there a
> manual/configuration driven way to clear that earlier?
>
> Thanks,
> Aman
>
> On Thu, 29 Jul, 2021, 6:47 pm Amandeep Srivastava, <
> amandeep.srivastava1...@gmail.com> wrote:
>
>> Hi Erick,
>>
>> Limiting mmap to index only seems to have resolved the issue. The max ram
>> usage remained at 60% this time. Could you please point me to the
>> limitations for setting this param? - For starters, I can see read
>> performance getting reduced up to 30% (CASSANDRA-8464
>> )
>>
>> Also if you could please shed light on extended questions in my earlier
>> email.
>>
>> Thanks a lot.
>>
>> Regards,
>> Aman
>>
>> On Thu, Jul 29, 2021 at 12:52 PM Amandeep Srivastava <
>> amandeep.srivastava1...@gmail.com> wrote:
>>
>>> Thanks, Bowen, don't think that's an issue - but yes I can try upgrading
>>> to 3.11.5 and limit the merkle tree size to bring down the memory
>>> utilization.
>>>
>>> Thanks, Erick, let me try that.
>>>
>>> Can someone please share documentation relating to internal functioning
>>> of full repairs - if there exists one? Wanted to understand the role of the
>>> heap and off-heap memory separately during the process.
>>>
>>> Also, for my case, once the nodes reach the 95% memory usage, it stays
>>> there for almost 10-12 hours after the repair is complete, before falling
>>> back to 65%. Any pointers on what might be consuming off-heap for so long
>>> and can something be done to clear it earlier?
>>>
>>> Thanks,
>>> Aman
>>>
>>>
>>>
>>
>> --
>> Regards,
>> Aman
>>
>


Long GC pauses during repair

2021-08-03 Thread manish khandelwal
Long GC (1 seconds /2 seconds) pauses seen during repair on the
coordinator. Running full repair with partition range option. GC collector
is CMS and heap is 14G. Cluster is 7+7. Cassandra version is 3.11.2.  Not
much traffic when repair is running. What could be the probable cause of
long gc pauses? What things should I look into?

Regards
Manish