Re: Sustainable way to regularly purge deleted docs

Govind Chandrasekhar Tue, 02 Dec 2014 14:54:44 -0800

Jonathan,
Did you find a solution to this? I've been facing pretty much the same 
issue since I've added nested documents to my index - delete percentage 
goes really high and an explicit optimize leads to an OOM.
Thanks.


On Saturday, August 23, 2014 8:08:32 AM UTC-7, Jonathan Foy wrote:
>
> Hello
>
> I was a bit surprised to see the number of deleted docs grow so large, but 
> I won't rule out my having something setup wrong.  Non-default merge 
> settings are below, by all means let me know if I've done something stupid:
>
> indices.store.throttle.type: none
> index.merge.policy.reclaim_deletes_weight: 6.0
> index.merge.policy.max_merge_at_once: 5
> index.merge.policy.segments_per_tier: 5
> index.merge.policy.max_merged_segment: 2gb
> index.merge.scheduler.max_thread_count: 3
>
> I make extensive use of nested documents, and to a smaller degree child 
> docs.  Right now things are hovering around 15% deleted after a cleanup on 
> Wednesday.  I've also cleaned up my mappings a lot since I saw the 45% 
> deleted number (less redundant data, broke some things off into child docs 
> to maintain separately), but it was up to 30% this last weekend.  When I've 
> looked in the past when I saw the 40+% numbers, the segments in the largest 
> tier (2 GB) would sometimes have up to 50+% deleted docs in them, the 
> smaller segments all seemed pretty contained, which I guess makes sense as 
> they didn't stick around for nearly as long.
>
> As for where the memory is spent, according to ElasticHQ, right now on one 
> server I have a 20 GB heap (out of 30.5, which I know is above the 50% 
> suggested, just trying to get things to work), I'm using 90% as follows:
>
> Field cache: 5.9 GB
> Filter cache: 4.0 GB (I had reduced this before the last restart, but 
> forgot to make the changes permanent.  I do use a lot of filters though, so 
> would like to be able to use the cache).
> ID cache: 3.5 GB
>
> Node stats "Segments: memory_in_bytes": 6.65 GB (I'm not exactly sure how 
> this one contributes to the total heap number).
>
> As for the disk-based "doc values", I don't know how I have not come 
> across them thus far, but that sounds quite promising.  I'm a little late 
> in the game to be changing everything yet again, but it may be a good idea 
> regardless, and is definitely something I'll read more about and consider 
> going forward.  Thank you for bringing it to my attention.
>
> Anyway, my current plan, since I'm running in AWS and have the 
> flexibility, is just to add another r3.xlarge node to the cluster over the 
> weekend, try the deleted-doc purge, and then pull the node back out after 
> moving all shards off of it.  I'm hoping this will allow me to clean things 
> up with extra horsepower, but not increase costs too much throughout the 
> week.
>
> Thanks for you input, it's very much appreciated.
>
>
> On Friday, August 22, 2014 7:14:18 PM UTC-4, Adrien Grand wrote:
>>
>> Hi Jonathan,
>>
>> The default merge policy is already supposed to merge quite aggressively 
>> segments that contain lots of deleted documents so it is a bit surprising 
>> that you can see that many numbers of deleted documents, even with merge 
>> throttling disabled.
>>
>> You mention having memory pressure because of the number of documents in 
>> your index, do you know what causes this memory pressure? In case it is due 
>> to field data maybe you could consider storing field data on disk? (what we 
>> call "doc values")
>>
>>
>>
>> On Fri, Aug 22, 2014 at 5:27 AM, Jonathan Foy <the...@gmail.com> wrote:
>>
>>> Hello
>>>
>>> I'm in the process of putting a two-node Elasticsearch cluster (1.1.2) 
>>> into production, but I'm having a bit of trouble keeping it stable enough 
>>> for comfort.  Specifically, I'm trying to figure out the best way to keep 
>>> the number of deleted documents under control.
>>>
>>> Both nodes are r3.xlarge EC2 instances (4 cores, 30.5 GB RAM).  The ES 
>>> cluster mirrors the primary data store, a MySQL database.  Relevant updates 
>>> to the database are caught via triggers which populate a table that's 
>>> monitored by an indexing process.  This results in what I'd consider of lot 
>>> of reindexing, any time the primary data is updated.  Search and indexing 
>>> performance thus far has been in line with expectations when the number of 
>>> deleted documents is small, but as it grows (up to 30-40%), the amount of 
>>> available RAM becomes limited, ultimately causing memory problems.  If I 
>>> optimize/purge deletes then things return to normal, though I usually end 
>>> up having to restart at least one server if not both due to OOM problems 
>>> and shard failures during optimization.  When ES becomes the source of all 
>>> searches for the application, I can't really afford this downtime.
>>>
>>> What would be the preferred course of action here?  I do have a window 
>>> over the weekend where I could work with somewhat reduced capacity;  I was 
>>> thinking perhaps I could pull one node out of search rotation, optimize it, 
>>> swap it with the other, optimize it, and then go on my way.  However, I 
>>> don't know that I CAN pull one node out of rotation (it seems like the 
>>> search API lets me specify a node, but nothing to say "Node X doesn't need 
>>> any searches"), nor does it appear that I can optimize an index on one node 
>>> without doing the same to the other.
>>>
>>> I've tried tweaking the merge settings to favour segments containing 
>>> large numbers of deletions, but it doesn't seem to make enough of a 
>>> difference.  I've also disabled merge throttling (I do have SSD-backed 
>>> storage).  Is there any safe way to perform regular maintenance on the 
>>> cluster, preferably one node at a time, without causing TOO many problems?  
>>> Am I just trying to do too much with the hardware I have?
>>>
>>> Any advice is appreciated.  Let me know what info I left out that would 
>>> help.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/65b96db1-0e56-4681-b73d-c21365983199%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/65b96db1-0e56-4681-b73d-c21365983199%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> -- 
>> Adrien Grand
>>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9401c5ac-2751-44ae-b8f3-548472e777cd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Sustainable way to regularly purge deleted docs

Reply via email to