For those following along at home I thought I'd provide an update.

Jörg provided a big hint and all his advise was useful. Based on what he 
said we discovered that the VM host was performing memory ballooning on the 
guest. Briefly, this is a process where the host can reclaim memory from 
the guest by inflating a memory balloon that grabs memory from the OS and 
gives it back to the host. It transfers memory pressure the host might be 
under to the guest. (Google it for more.)

We showed that our failures were happening when ballooning occurred. By 
default, the balloon is designed to inflate to 60% (from memory) of 
configured memory. Given that 50% of configured memory is mlocked these two 
settings are incompatible.

Our fix was to configure VMWare to reserve the entire configured memory. 
This means that the host doesn't try to take the memory back. It seemed 
sensible to reserve all of the configured memory as we want elasticsearch 
to keep its buffers and memory maps in place just as it would be on a 
hardware instance in production. If placed under memory pressure, the OS 
would start to reclaim these things.

After making the change, we've been running for a few weeks with no further 
failures.

Cheers,
Edward

On Wednesday, May 7, 2014 4:09:16 PM UTC-7, Edward Sargisson wrote:
>
> Hi Jörg,
> Thanks for your reply - that's given me a number of leads to follow up on.
>
> > Errors in allocating direct buffers will result in Java errors. You 
> mention Linux memory errors but unfortunately you do not quote it, so I 
> have to guess.
> We see nothing useful in elasticsearch logs. What we do see is either the 
> console saying, "Out of memory: Kill process ... score 1 or sacrifice 
> child" or, once, we saw, "Loading dm-mirror.ko module, Waiting for required 
> block device discovery, Waiting for 2 sda-like device(s)...Kernel panic - 
> not syncing: Out of memory and no killable processes".
> The first message I understand as the OOM-Killer coming out to whack a 
> process on the head. I don't understand the last one. I have screenshots of 
> these if required.
>
> > You should have enabled memory mapped files by index store mmapfs 
> (default on RHEL)
> We haven't changed this setting so I expect it is the default. I looked 
> for a way to verify this but the es api appears not to return it.
>
> > bootstrap.mlockall = true...set memlock to unlimited
> Yes - both done.
>
> > If you still encounter issues from Linux OS errors it is most probably 
> because of VMware limitations
> Is there a way to get evidence to show this? I reviewed the VMWare event 
> log and there was no ballooning in there (assuming we were looking at the 
> right spot).
>
> >  If you run a VM, you should assign at most 50% of the configured guest 
> OS memory to ES.
> We use the elasticsearch Puppet module but I modified it with a version of 
> the code in the elasticsearch Chef cookbook to automatically assign this - 
> where it appears to be assigning 60%. I was surprised by this too but I 
> copied it on the assumption that the cookbook writer knew what they were 
> doing. I've raised an issue to ask the question: 
> https://github.com/elasticsearch/cookbook-elasticsearch/issues/209
>
> For the curious: I've setup some monitoring to capture /proc/meminfo, the 
> count of the /proc/<pid>/maps for elasticsearch and Flume as well as the 
> top few entries in top by memory usage. Now I'm just waiting for the next 
> failure.
>
> Thanks for any help provided.
>
> Cheers,
> Edward
>
> On Tuesday, May 6, 2014 3:23:10 PM UTC-7, Jörg Prante wrote:
>>
>> Yes, of course Elasticsearch is using off-heap memory. All the Lucene 
>> index I/O is using direct buffers in native OS memory.
>>
>> Errors in allocating direct buffers will result in Java errors. You 
>> mention Linux memory errors but unfortunately you do not quote it, so I 
>> have to guess.
>>
>> You should have enabled memory mapped files by index store mmapfs 
>> (default on RHEL) so all files that are read by ES are mapped into virtual 
>> address space of the OS VM management.
>>
>> And also bootstrap.mlockall = true, so you also need to set memlock to 
>> unlimited in /etc/security/limits.conf, because RHEL/Centos memlockable 
>> memory is limited to 25% of RAM by default. In that case, Java should throw 
>> an IOException "Map failed".
>>
>> Note, because of the memory page lock support of the host OS, you should 
>> also check what kind of virtualization you have enabled for the guest, it 
>> should be HW (full) virtualization, not paravirtualization.
>>
>> If you still encounter issues from Linux OS errors it is most probably 
>> because of VMware limitations, so you should disable the bootstrap.mlockall 
>> setting.
>>
>> As a side note, the recommended heap size is 50% of the RAM that is 
>> available to the ES process. If you run a VM, you should assign at most 50% 
>> of the configured guest OS memory to ES.
>>
>> Jörg
>>
>>
>> On Tue, May 6, 2014 at 10:35 PM, Edward Sargisson <ejs...@gmail.com> 
>> wrote:
>>
>>> Hi all,
>>> We have a problem where our es nodes will fail with an out of memory 
>>> error from Linux (note, not Java). Our es processes are configured with a 
>>> fixed amount of heap (60% of total RAM - just as in in the elasticsearch 
>>> chef cookbook).
>>>
>>> So, something is consuming all of the memory available to Linux.
>>>
>>> Is there any other memory that ES can use? Does it lock OS cache or 
>>> buffer memory so that it can't be released? If it opens lots of files does 
>>> it use up too much RAM? Is it doing off-heap allocation? (I'm pretty sure 
>>> the answer is no to the last).
>>>
>>> We're struggling to find the exact memory resource being used up.
>>>
>>> For the record. this is ES 1.1.0 on CentOS 6.4 running in VMWare.
>>>
>>> Thanks!
>>> Edward
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/ab6421e3-89a1-409f-b89b-f09ca5bc9551%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/ab6421e3-89a1-409f-b89b-f09ca5bc9551%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ec3537f5-0db3-40a0-9409-b83fecee2d1d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to