Hi,

To be honest I've never seen the OOM in action on those instances. My Xmx
was 8GB just like yours and that let me think you have some process that is
competing for memory, is it? Do you have any cron, any backup, anything
that can trick the OOMKiller ?

My unresponsiveness was seconds long. This is/was bad becasue gossip
protocol was going crazy by marking nodes down and all the consequences
this can lead in distributed system, think about hints, dynamic snitch, and
whatever depends on node availability ...
Can you share some number about your `tpstats` or system load in general?

On the tuning side I just went through the following article:
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html

No rollbacks, just moving forward! Right now we are upgrading the instance
size to something more recent than m1.xlarge (for many different reasons,
including security, ECU and network).Nevertheless it might be a good idea
to upgrade to the 3.X branch to leverage on better off-heap memory
management.

Best,


On Thu, Dec 6, 2018 at 2:33 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Thu, Dec 6, 2018 at 11:14 AM Riccardo Ferrari <ferra...@gmail.com>
> wrote:
>
>>
>> I had few instances in the past that were showing that unresponsivveness
>> behaviour. Back then I saw with iotop/htop/dstat ... the system was stuck
>> on a single thread processing (full throttle) for seconds. According to
>> iotop that was the kswapd0 process. That system was an ubuntu 16.04
>> actually "Ubuntu 16.04.4 LTS".
>>
>
> Riccardo,
>
> Did you by chance also observe Linux OOM?  How long did the
> unresponsiveness last in your case?
>
> From there I started to dig what kswap process was involved in a system
>> with no swap and found that is used for mmapping. This erratic (allow me to
>> say erratic) behaviour was not showing up when I was on 3.0.6 but started
>> to right after upgrading to 3.0.17.
>>
>> By "load" I refer to the load as reported by the `nodetool status`. On my
>> systems, when disk_access_mode is auto (read mmap), it is the sum of the
>> node load plus the jmv heap size. Of course this is just what I noted on my
>> systems not really sure if that should be the case on yours too.
>>
>
> I've checked and indeed we are using disk_access_mode=auto (well,
> implicitly because it's not even part of config file anymore):
> DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap.
>
> I hope someone with more experience than me will add a comment about your
>> settings. Reading the configuration file, writers and compactors should be
>> 2 at minimum. I can confirm when I tried in the past to change the
>> concurrent_compactors to 1 I had really bad things happenings (high system
>> load, high message drop rate, ...)
>>
>
> As I've mentioned, we did not observe any other issues with the current
> setup: system load is reasonable, no dropped messages, no big number of
> hints, request latencies are OK, no big number of pending compactions.
> Also during repair everything looks fine.
>
> I have the "feeling", when running on constrained hardware the underlaying
>> kernel optimization is a must. I agree with Jonathan H. that you should
>> think about increasing the instance size, CPU and memory mathters a lot.
>>
>
> How did you solve your issue in the end?  You didn't rollback to 3.0.6?
> Did you tune kernel parameters?  Which ones?
>
> Thank you!
> --
> Alex
>
>

Reply via email to