Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

Ivan Brusic Sat, 23 Aug 2014 09:03:01 -0700

"When I kept size as Integer.MAX_VALUE, it caused all the problems"


Are you trying to return up to 2 billion documents at once? Even if that
number was only 1 million, you will face problems. Or did I perhaps
misunderstand you?

Are you sorting the documents based on the score (the default)?
Lucene/Elasticsearch would need to keep all the values in memory in order
to start them, causing memory problems. In general, Lucene is not effective
at deep pagination. Use scan/scroll:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

-- 
Ivan


On Sat, Aug 23, 2014 at 6:46 AM, Narendra Yadala <narendra.yad...@gmail.com>
wrote:

> Hi Jörg,
>
> This query
> {
>    "query" : {
>       "bool": {
>           "must": {
>                "match" : { "body" : "big" }
>            },
>           "must_not": {
>                "match" : { "body" : "data" }
>            },
>            "must": {
>             "match" : {"id": 521}
>            }
>      }
>    }
> }
>
> and this query are performing exactly same
> {
>    "query" : {
>       "bool": {
>           "must": {
>                "match" : { "body" : "big" }
>            },
>           "must_not": {
>                "match" : { "body" : "data" }
>            }
>      }
>    },
>    "filter" : {
>        "term" : { "id" : "521" }
>    }
> }
>
> I am not able understand what makes a filtered query fast. Is there any
> place where I can find documentation on the internals of how different
> queries are processed by elasticsearch.
>
> On Saturday, 23 August 2014 18:20:23 UTC+5:30, Jörg Prante wrote:
>
>> Before firing queries, you should consider if the index design and query
>> choice is optimal.
>>
>> Numeric range queries are not straightforward. They were a major issue on
>> inverted index engines like Lucene/Elasticsearch and it has taken some time
>> to introduce efficient implementations. See e.g.
>> https://issues.apache.org/jira/browse/LUCENE-1673
>>
>> ES tries to compensate the downsides of massive numeric range queries by
>> loading all the field values into memory. To achieve effective queries, you
>> have to carefully discretize the values you index.
>>
>> For example, a few hundred millions of different timestamps, with
>> millisecond resolution, are a real burden for searching on inverted
>> indices. A good discretization strategy for indexing is to reduce the total
>> amount of values in such field to a few hundred or thousands. For
>> timestamps, this means, indexing time-based series data in discrete
>> intervals of days, hours, minutes, maybe seconds is much more efficient
>> than e.g. millisecond resolution.
>>
>> Another topic is to use filters for boolean queries. They are much faster.
>>
>> Jörg
>>
>>
>>
>> On Sat, Aug 23, 2014 at 2:19 PM, Narendra Yadala <narendr...@gmail.com>
>> wrote:
>>
>>> Hi Ivan,
>>>
>>> Thanks for the input about aggregating on strings, I do that, but those
>>> queries take time but they do not crash node.
>>>
>>> The queries which caused problem were pretty straightforward queries
>>> (such as a boolean query with two musts, one must is equal match and other
>>> a range match on long) but the real problem was with the size. When I kept
>>> size as Integer.MAX_VALUE, it caused all the problems. When I removed it,
>>> it started working fine. I think it is worth mentioning somewhere about
>>> this strange behavior (probably expected but strange).
>>>
>>> I did double up on the RAM though and now I have allocated 5*10G RAM to
>>> the cluster. Things are looking ok as of now, except that the aggregations
>>> (on strings) are quite slow. May be I would run these aggregations as batch
>>> and cache the outputs in a different type and move on for now.
>>>
>>> Thanks
>>> NY
>>>
>>>
>>> On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic <iv...@brusic.com> wrote:
>>>
>>>> How expensive are your queries? Are you using aggregations or sorting
>>>> on string fields that could use up your field data cache? Are you using the
>>>> defaults for the cache? Post the current usage.
>>>>
>>>> If you post an example query and mapping, perhaps the community can
>>>> help optimize it.
>>>>
>>>> Cheers,
>>>>
>>>> Ivan
>>>>
>>>>
>>>>  On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <
>>>> narendr...@gmail.com> wrote:
>>>>
>>>>>  I have a cluster of size 240 GB including replica and it has 5 nodes
>>>>> in it. I allocated 5 GB RAM (total 5*5 GB) to each node and started the
>>>>> cluster. When I start continuously firing queries on the cluster the GC
>>>>> starts kicking in and eventually node goes down because of OutOfMemory
>>>>> exception. I add upto 200k documents everyday. The indexing part works 
>>>>> fine
>>>>> but querying part is causing trouble. I have the cluster on ec2 and I use
>>>>> ec2 discovery mode.
>>>>>
>>>>> What is ideal RAM size and are there any other parameters I need to
>>>>> tune to get this cluster going?
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to elasticsearc...@googlegroups.com.
>>>>>
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%
>>>>> 40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "elasticsearch" group.
>>>> To unsubscribe from this topic, visit https://groups.google.com/d/
>>>> topic/elasticsearch/DdPD8MiquYQ/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> elasticsearc...@googlegroups.com.
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-
>>>> 0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%
>>> 2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB95stJ%3DOhuBJSGM9%3DgpNsnrykb4kAwhpSDbvA8OT%3Ds4g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

Reply via email to