Re: kswapd0 causing read timeouts

ruslan usifov Thu, 14 Jun 2012 13:31:24 -0700

2012/6/14 Gurpreet Singh <gurpreet.si...@gmail.com>:
> JNA is installed. swappiness was 0. vfs_cache_pressure was 100. 2 questions
> on this..
> 1. Is there a way to find out if mlockall really worked other than just the
> mlockall successful log message?
yes you must see something like this (from our test server):


 INFO [main] 2012-06-14 02:03:14,745 DatabaseDescriptor.java (line
233) Global memtable threshold is enabled at 512MB


> 2. Does cassandra only mlock the jvm heap or also the mmaped memory?

Cassandra obviously mlock only heap, and doesn't mmaped sstables


>
> I disabled mmap completely, and things look so much better.
> latency is surprisingly half of what i see when i have mmap enabled.
> Its funny that i keep reading tall claims abt mmap, but in practise a lot of
> ppl have problems with it, especially when it uses up all the memory. We
> have tried mmap for different purposes in our company before,and had finally
> ended up disabling it, because it just doesnt handle things right when
> memory is low. Maybe the proc/sys/vm needs to be configured right, but thats
> not the easiest of configurations to get right.
>
> Right now, i am handling only 80 gigs of data. kernel version is 2.6.26.
> java version is 1.6.21
> /G
>
>
> On Wed, Jun 13, 2012 at 8:42 PM, Al Tobey <a...@ooyala.com> wrote:
>>
>> I would check /etc/sysctl.conf and get the values of
>> /proc/sys/vm/swappiness and /proc/sys/vm/vfs_cache_pressure.
>>
>> If you don't have JNA enabled (which Cassandra uses to fadvise) and
>> swappiness is at its default of 60, the Linux kernel will happily swap out
>> your heap for cache space.  Set swappiness to 1 or 'swapoff -a' and kswapd
>> shouldn't be doing much unless you have a too-large heap or some other app
>> using up memory on the system.
>>
>>
>> On Wed, Jun 13, 2012 at 11:30 AM, ruslan usifov <ruslan.usi...@gmail.com>
>> wrote:
>>>
>>> Hm, it's very strange what amount of you data? You linux kernel
>>> version? Java version?
>>>
>>> PS: i can suggest switch diskaccessmode to standart in you case
>>> PS:PS also upgrade you linux to latest, and javahotspot to 1.6.32
>>> (from oracle site)
>>>
>>> 2012/6/13 Gurpreet Singh <gurpreet.si...@gmail.com>:
>>> > Alright, here it goes again...
>>> > Even with mmap_index_only, once the RES memory hit 15 gigs, the read
>>> > latency
>>> > went berserk. This happens in 12 hours if diskaccessmode is mmap, abt
>>> > 48 hrs
>>> > if its mmap_index_only.
>>> >
>>> > only reads happening at 50 reads/second
>>> > row cache size: 730 mb, row cache hit ratio: 0.75
>>> > key cache size: 400 mb, key cache hit ratio: 0.4
>>> > heap size (max 8 gigs): used 6.1-6.9 gigs
>>> >
>>> > No messages about reducing cache sizes in the logs
>>> >
>>> > stats:
>>> > vmstat 1 : no swapping here, however high sys cpu utilization
>>> > iostat (looks great) - avg-qu-sz = 8, avg await = 7 ms, svc time = 0.6,
>>> > util
>>> > = 15-30%
>>> > top - VIRT - 19.8g, SHR - 6.1g, RES - 15g, high cpu, buffers - 2mb
>>> > cfstats - 70-100 ms. This number used to be 20-30 ms.
>>> >
>>> > The value of the SHR keeps increasing (owing to mmap i guess), while at
>>> > the
>>> > same time buffers keeps decreasing. buffers starts as high as 50 mb,
>>> > and
>>> > goes down to 2 mb.
>>> >
>>> >
>>> > This is very easily reproducible for me. Every time the RES memory hits
>>> > abt
>>> > 15 gigs, the client starts getting timeouts from cassandra, the sys cpu
>>> > jumps a lot. All this, even though my row cache hit ratio is almost
>>> > 0.75.
>>> >
>>> > Other than just turning off mmap completely, is there any other
>>> > solution or
>>> > setting to avoid a cassandra restart every cpl of days. Something to
>>> > keep
>>> > the RES memory to hit such a high number. I have been constantly
>>> > monitoring
>>> > the RES, was not seeing issues when RES was at 14 gigs.
>>> > /G
>>> >
>>> > On Fri, Jun 8, 2012 at 10:02 PM, Gurpreet Singh
>>> > <gurpreet.si...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Aaron, Ruslan,
>>> >> I changed the disk access mode to mmap_index_only, and it has been
>>> >> stable
>>> >> ever since, well at least for the past 20 hours. Previously, in abt
>>> >> 10-12
>>> >> hours, as soon as the resident memory was full, the client would start
>>> >> timing out on all its reads. It looks fine for now, i am going to let
>>> >> it
>>> >> continue to see how long it lasts and if the problem comes again.
>>> >>
>>> >> Aaron,
>>> >> yes, i had turned swap off.
>>> >>
>>> >> The total cpu utilization was at 700% roughly.. It looked like kswapd0
>>> >> was
>>> >> using just 1 cpu, but cassandra (jsvc) cpu utilization increased quite
>>> >> a
>>> >> bit. top was reporting high system cpu, and low user cpu.
>>> >> vmstat was not showing swapping. java heap size max is 8 gigs. while
>>> >> only
>>> >> 4 gigs was in use, so java heap was doing great. no gc in the logs.
>>> >> iostat
>>> >> was doing ok from what i remember, i will have to reproduce the issue
>>> >> for
>>> >> the exact numbers.
>>> >>
>>> >> cfstats latency had gone very high, but that is partly due to high cpu
>>> >> usage.
>>> >>
>>> >> One thing was clear, that the SHR was inching higher (due to the mmap)
>>> >> while buffer cache which started at abt 20-25mb reduced to 2 MB by the
>>> >> end,
>>> >> which probably means that pagecache was being evicted by the kswapd0.
>>> >> Is
>>> >> there a way to fix the size of the buffer cache and not let system
>>> >> evict it
>>> >> in favour of mmap?
>>> >>
>>> >> Also, mmapping data files would basically cause not only the data
>>> >> (asked
>>> >> for) to be read into main memory, but also a bunch of extra pages
>>> >> (readahead), which would not be very useful, right? The same thing for
>>> >> index
>>> >> would actually be more useful, as there would be more index entries in
>>> >> the
>>> >> readahead part.. and the index files being small wouldnt cause memory
>>> >> pressure that page cache would be evicted. mmapping the data files
>>> >> would
>>> >> make sense if the data size is smaller than the RAM or the hot data
>>> >> set is
>>> >> smaller than the RAM, otherwise just the index would probably be a
>>> >> better
>>> >> thing to mmap, no?. In my case data size is 85 gigs, while available
>>> >> RAM is
>>> >> 16 gigs (only 8 gigs after heap).
>>> >>
>>> >> /G
>>> >>
>>> >>
>>> >> On Fri, Jun 8, 2012 at 11:44 AM, aaron morton
>>> >> <aa...@thelastpickle.com>
>>> >> wrote:
>>> >>>
>>> >>> Ruslan,
>>> >>> Why did you suggest changing the disk_access_mode ?
>>> >>>
>>> >>> Gurpreet,
>>> >>> I would leave the disk_access_mode with the default until you have a
>>> >>> reason to change it.
>>> >>>
>>> >>>> > 8 core, 16 gb ram, 6 data disks raid0, no swap configured
>>> >>>
>>> >>> is swap disabled ?
>>> >>>
>>> >>>> Gradually,
>>> >>>> > the system cpu becomes high almost 70%, and the client starts
>>> >>>> > getting
>>> >>>> > continuous timeouts
>>> >>>
>>> >>> 70% of one core or 70% of all cores ?
>>> >>> Check the server logs, is there GC activity ?
>>> >>> check nodetool cfstats to see the read latency for the cf.
>>> >>>
>>> >>> Take a look at vmstat to see if you are swapping, and look at iostats
>>> >>> to
>>> >>> see if io is the problem
>>> >>> http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html
>>> >>>
>>> >>> Cheers
>>> >>>
>>> >>> -----------------
>>> >>> Aaron Morton
>>> >>> Freelance Developer
>>> >>> @aaronmorton
>>> >>> http://www.thelastpickle.com
>>> >>>
>>> >>> On 8/06/2012, at 9:00 PM, Gurpreet Singh wrote:
>>> >>>
>>> >>> Thanks Ruslan.
>>> >>> I will try the mmap_index_only.
>>> >>> Is there any guideline as to when to leave it to auto and when to use
>>> >>> mmap_index_only?
>>> >>>
>>> >>> /G
>>> >>>
>>> >>> On Fri, Jun 8, 2012 at 1:21 AM, ruslan usifov
>>> >>> <ruslan.usi...@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> disk_access_mode: mmap??
>>> >>>>
>>> >>>> set to disk_access_mode: mmap_index_only in cassandra yaml
>>> >>>>
>>> >>>> 2012/6/8 Gurpreet Singh <gurpreet.si...@gmail.com>:
>>> >>>> > Hi,
>>> >>>> > I am testing cassandra 1.1 on a 1 node cluster.
>>> >>>> > 8 core, 16 gb ram, 6 data disks raid0, no swap configured
>>> >>>> >
>>> >>>> > cassandra 1.1.1
>>> >>>> > heap size: 8 gigs
>>> >>>> > key cache size in mb: 800 (used only 200mb till now)
>>> >>>> > memtable_total_space_in_mb : 2048
>>> >>>> >
>>> >>>> > I am running a read workload.. about 30 reads/second. no writes at
>>> >>>> > all.
>>> >>>> > The system runs fine for roughly 12 hours.
>>> >>>> >
>>> >>>> > jconsole shows that my heap size has hardly touched 4 gigs.
>>> >>>> > top shows -
>>> >>>> >   SHR increasing slowly from 100 mb to 6.6 gigs in  these 12 hrs
>>> >>>> >   RES increases slowly from 6 gigs all the way to 15 gigs
>>> >>>> >   buffers are at a healthy 25 mb at some point and that goes down
>>> >>>> > to 2
>>> >>>> > mb in
>>> >>>> > these 12 hrs
>>> >>>> >   VIRT stays at 85 gigs
>>> >>>> >
>>> >>>> > I understand that SHR goes up because of mmap, RES goes up because
>>> >>>> > it
>>> >>>> > is
>>> >>>> > showing SHR value as well.
>>> >>>> >
>>> >>>> > After around 10-12 hrs, the cpu utilization of the system starts
>>> >>>> > increasing,
>>> >>>> > and i notice that kswapd0 process starts becoming more active.
>>> >>>> > Gradually,
>>> >>>> > the system cpu becomes high almost 70%, and the client starts
>>> >>>> > getting
>>> >>>> > continuous timeouts. The fact that the buffers went down from 20
>>> >>>> > mb to
>>> >>>> > 2 mb
>>> >>>> > suggests that kswapd0 is probably swapping out the pagecache.
>>> >>>> >
>>> >>>> > Is there a way out of this to avoid the kswapd0 starting to do
>>> >>>> > things
>>> >>>> > even
>>> >>>> > when there is no swap configured?
>>> >>>> > This is very easily reproducible for me, and would like a way out
>>> >>>> > of
>>> >>>> > this
>>> >>>> > situation. Do i need to adjust vm memory management stuff like
>>> >>>> > pagecache,
>>> >>>> > vfs_cache_pressure.. things like that?
>>> >>>> >
>>> >>>> > just some extra information, jna is installed, mlockall is
>>> >>>> > successful.
>>> >>>> > there
>>> >>>> > is no compaction running.
>>> >>>> > would appreciate any help on this.
>>> >>>> > Thanks
>>> >>>> > Gurpreet
>>> >>>> >
>>> >>>> >
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>> >
>>
>>
>

Re: kswapd0 causing read timeouts

Reply via email to