Re: Better understanding Lucene/Shard overheads

Drew Kutcharian Fri, 23 Jan 2015 17:43:17 -0800

Thanks Mike. I’m still a bit unclear on these comments:

> IndexReader requires some RAM for each segment to hold structures like live 
> docs, terms index, index data structures for doc values fields, and holds 
> open a number of file descriptors in proportion to how many segments are in 
> the index.
> There is also a per-indexed-field cost in Lucene; if you have a great many 
> unique indexed fields that may matter.



Aren’t these structures dependent on the size of the “lucene index"? Say if I 
have 1 large lucene index vs 10 small lucene indices (considering not much 
duplicated data across indices) wouldn’t the total memory used be the same? I 
understand that there will be more file descriptors because there will be more 
segments.

> IndexWriter has a RAM buffer (indices.memory.index_buffer_size in ES) to hold 
> recently indexed/deleted documents, and periodically opens readers (10 at a 
> time by default) to do merging, which bumps up RAM usage and file descriptors 
> while the merge runs.


According to the doc at 
https://github.com/elasticsearch/elasticsearch/blob/master/docs/reference/modules/indices.asciidoc
 
<https://github.com/elasticsearch/elasticsearch/blob/master/docs/reference/modules/indices.asciidoc>
 seems like indices.memory.index_buffer_size is the “total” size of the buffer 
for all the shards on a node, so not sure how this would matter in case of 
having too many shards. I understand that there will be more file descriptors 
and a lot more “smaller” merge jobs running.

I’m going to test this myself, but I just wanted to understand the model better 
first so I have more accurate tests.


Thanks again,

Drew



> On Jan 23, 2015, at 2:18 AM, Michael McCandless <m...@elasticsearch.com> 
> wrote:
> 
> There is definitely a non-trivial per-index cost.
> 
> From Lucene's standpoint, ES holds an IndexReader (for searching) and 
> IndexWriter (for indexing) open.
> 
> IndexReader requires some RAM for each segment to hold structures like live 
> docs, terms index, index data structures for doc values fields, and holds 
> open a number of file descriptors in proportion to how many segments are in 
> the index.
> 
> IndexWriter has a RAM buffer (indices.memory.index_buffer_size in ES) to hold 
> recently indexed/deleted documents, and periodically opens readers (10 at a 
> time by default) to do merging, which bumps up RAM usage and file descriptors 
> while the merge runs.
> 
> There is also a per-indexed-field cost in Lucene; if you have a great many 
> unique indexed fields that may matter.
> 
> If you use field data, it's entirely RAM resident (doc values is a better 
> choice since it uses much less RAM).
> 
> ES has common thread pools on the node which are shared for all ops across 
> all shards on that node, so I don't think more indices translates to more 
> threads.
> 
> Net/net you really should just conduct your own tests to get a feel of 
> resource consumption in your use case...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com <http://blog.mikemccandless.com/>
> On Thu, Jan 22, 2015 at 4:07 PM, Drew Kutcharian <d...@venarc.com 
> <mailto:d...@venarc.com>> wrote:
> Hi,
> 
> I just came across this blog post: 
> http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html 
> <http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html>
> 
> Seems like there has been a lot of work done on Lucene to reduce its memory 
> requirements and even more on Lucene 5.0. This is specifically interesting to 
> me since I’m working on a project that uses Elasticsearch and we are planning 
> on using 1 index per customer model (each with 1 or maybe 2 shards and no 
> replicas) and shard allocation, mainly because:
> 
> 1. We are going to have few thousand customers at most
> 
> 2. Each customer will only need access to their own data (no global queries)
> 
> 3. The indices are going be relatively large (each with millions of small 
> docs)
> 
> 4. We are going to need to do a lot of parent/child type queries (and ES 
> doesn’t support cross-shard parent/child relationships and the parent id 
> cache seems not that efficient, see 
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/parent-child.html
>  
> <http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/parent-child.html>
>  and 
> https://github.com/elasticsearch/elasticsearch/issues/3516#issuecomment-23081662
>  
> <https://github.com/elasticsearch/elasticsearch/issues/3516#issuecomment-23081662>).
>  This is the main reason we feel we can’t use time based (daily, monthly, …) 
> indices.
> 
> 5. Being able to easily “drop” an index if a customer leaves the initial 
> trial.
> 
> 
> I wanted to better understand the overheads of an Elasticsearch shard. Is it 
> just memory or CPU/threads too? Where can I find more information about this?
> 
> Thanks,
> 
> Drew
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com 
> <mailto:elasticsearch+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/F59813A2-904C-4B29-BBC9-6174DD3C8DAF%40venarc.com
>  
> <https://groups.google.com/d/msgid/elasticsearch/F59813A2-904C-4B29-BBC9-6174DD3C8DAF%40venarc.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com 
> <mailto:elasticsearch+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/CAD7smRcpOy6RYgvi-GC6jpsuO1-qsRcTecUvr066Rkr3qxZijA%40mail.gmail.com
>  
> <https://groups.google.com/d/msgid/elasticsearch/CAD7smRcpOy6RYgvi-GC6jpsuO1-qsRcTecUvr066Rkr3qxZijA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/85AA9AA2-2B5A-49DF-969F-96F5C3438290%40venarc.com.
For more options, visit https://groups.google.com/d/optout.

Re: Better understanding Lucene/Shard overheads

Reply via email to