Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

joergpra...@gmail.com Sun, 24 Aug 2014 08:03:11 -0700

Exactly. Filters do not use scores. They also use bit sets which makes them
reusable and fast.


I wasn't talking about a filter added to a query, I mean filtered queries.
This is a huge difference.

This query

{
   "query" : {
      "bool": {
          "must": {
               "match" : { "body" : "big" }
           },
          "must_not": {
               "match" : { "body" : "data" }
           },
           "must": {
            "match" : {"id": 521}
           }
     }
   }
}

can be turned into this filtered query

{
 "query" : {
   "constant_score": {
     "filter": {
       "bool": {
         "must": [
              { "match" : { "body" : "big" } },
              {"match" : {"id": 521} }
         ],
         "must_not": {
              "match" : { "body" : "data" }
         }
       }
     }
   }
  }
}

(plus fixing the double key "must" which is a potential source of errors)

Jörg



On Sun, Aug 24, 2014 at 4:30 PM, Jonathan Foy <the...@gmail.com> wrote:

> I ran into the same issue when using Integer.MAX_VALUE as the size
> parameter (migrating from a DB-based search).  Perhaps someone can come up
> with a proper reference, I cannot, but according to a comment in this SO
> <http://stackoverflow.com/questions/8829468/elasticsearch-query-to-return-all-records>
> question, Elasticsearch/Lucene tries to allocate memory for that many
> scores.  When I switched those queries to a count/search duo, things
> improved dramatically, as you've already noticed.
>
>
> On Saturday, August 23, 2014 12:17:47 PM UTC-4, Narendra Yadala wrote:
>>
>>
>> I am not returning 2 billion documents :)
>>
>> I am returning all documents that match. Actual number can be anywhere
>> between 0 to 50k. I am just fetching documents between a given time
>> interval such as one hour, one day so on and then do batch processing them.
>>
>> I fixed this by making 2 queries, one to fetch count and other for actual
>> data. It is mentioned in some other thread that scroll api is performance
>> intensive so I did not go for it.
>>
>> On Saturday, 23 August 2014 21:32:59 UTC+5:30, Ivan Brusic wrote:
>>>
>>> "When I kept size as Integer.MAX_VALUE, it caused all the problems"
>>>
>>> Are you trying to return up to 2 billion documents at once? Even if that
>>> number was only 1 million, you will face problems. Or did I perhaps
>>> misunderstand you?
>>>
>>> Are you sorting the documents based on the score (the default)?
>>> Lucene/Elasticsearch would need to keep all the values in memory in order
>>> to start them, causing memory problems. In general, Lucene is not effective
>>> at deep pagination. Use scan/scroll:
>>>
>>> http://www.elasticsearch.org/guide/en/elasticsearch/
>>> reference/current/search-request-scroll.html
>>>
>>> --
>>> Ivan
>>>
>>>
>>> On Sat, Aug 23, 2014 at 6:46 AM, Narendra Yadala <narendr...@gmail.com>
>>> wrote:
>>>
>>>> Hi Jörg,
>>>>
>>>> This query
>>>> {
>>>>    "query" : {
>>>>       "bool": {
>>>>           "must": {
>>>>                "match" : { "body" : "big" }
>>>>            },
>>>>           "must_not": {
>>>>                "match" : { "body" : "data" }
>>>>            },
>>>>            "must": {
>>>>             "match" : {"id": 521}
>>>>            }
>>>>      }
>>>>    }
>>>> }
>>>>
>>>> and this query are performing exactly same
>>>> {
>>>>    "query" : {
>>>>       "bool": {
>>>>           "must": {
>>>>                "match" : { "body" : "big" }
>>>>            },
>>>>           "must_not": {
>>>>                "match" : { "body" : "data" }
>>>>            }
>>>>      }
>>>>    },
>>>>    "filter" : {
>>>>        "term" : { "id" : "521" }
>>>>    }
>>>> }
>>>>
>>>> I am not able understand what makes a filtered query fast. Is there any
>>>> place where I can find documentation on the internals of how different
>>>> queries are processed by elasticsearch.
>>>>
>>>> On Saturday, 23 August 2014 18:20:23 UTC+5:30, Jörg Prante wrote:
>>>>
>>>>> Before firing queries, you should consider if the index design and
>>>>> query choice is optimal.
>>>>>
>>>>> Numeric range queries are not straightforward. They were a major issue
>>>>> on inverted index engines like Lucene/Elasticsearch and it has taken some
>>>>> time to introduce efficient implementations. See e.g.
>>>>> https://issues.apache.org/jira/browse/LUCENE-1673
>>>>>
>>>>> ES tries to compensate the downsides of massive numeric range queries
>>>>> by loading all the field values into memory. To achieve effective queries,
>>>>> you have to carefully discretize the values you index.
>>>>>
>>>>> For example, a few hundred millions of different timestamps, with
>>>>> millisecond resolution, are a real burden for searching on inverted
>>>>> indices. A good discretization strategy for indexing is to reduce the 
>>>>> total
>>>>> amount of values in such field to a few hundred or thousands. For
>>>>> timestamps, this means, indexing time-based series data in discrete
>>>>> intervals of days, hours, minutes, maybe seconds is much more efficient
>>>>> than e.g. millisecond resolution.
>>>>>
>>>>> Another topic is to use filters for boolean queries. They are much
>>>>> faster.
>>>>>
>>>>> Jörg
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Aug 23, 2014 at 2:19 PM, Narendra Yadala <narendr...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Ivan,
>>>>>>
>>>>>> Thanks for the input about aggregating on strings, I do that, but
>>>>>> those queries take time but they do not crash node.
>>>>>>
>>>>>> The queries which caused problem were pretty straightforward queries
>>>>>> (such as a boolean query with two musts, one must is equal match and 
>>>>>> other
>>>>>> a range match on long) but the real problem was with the size. When I 
>>>>>> kept
>>>>>> size as Integer.MAX_VALUE, it caused all the problems. When I removed it,
>>>>>> it started working fine. I think it is worth mentioning somewhere about
>>>>>> this strange behavior (probably expected but strange).
>>>>>>
>>>>>> I did double up on the RAM though and now I have allocated 5*10G RAM
>>>>>> to the cluster. Things are looking ok as of now, except that the
>>>>>> aggregations (on strings) are quite slow. May be I would run these
>>>>>> aggregations as batch and cache the outputs in a different type and move 
>>>>>> on
>>>>>> for now.
>>>>>>
>>>>>> Thanks
>>>>>> NY
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic <iv...@brusic.com>
>>>>>> wrote:
>>>>>>
>>>>>>> How expensive are your queries? Are you using aggregations or
>>>>>>> sorting on string fields that could use up your field data cache? Are 
>>>>>>> you
>>>>>>> using the defaults for the cache? Post the current usage.
>>>>>>>
>>>>>>> If you post an example query and mapping, perhaps the community can
>>>>>>> help optimize it.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Ivan
>>>>>>>
>>>>>>>
>>>>>>>  On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala <
>>>>>>> narendr...@gmail.com> wrote:
>>>>>>>
>>>>>>>>  I have a cluster of size 240 GB including replica and it has 5
>>>>>>>> nodes in it. I allocated 5 GB RAM (total 5*5 GB) to each node and 
>>>>>>>> started
>>>>>>>> the cluster. When I start continuously firing queries on the cluster 
>>>>>>>> the GC
>>>>>>>> starts kicking in and eventually node goes down because of OutOfMemory
>>>>>>>> exception. I add upto 200k documents everyday. The indexing part works 
>>>>>>>> fine
>>>>>>>> but querying part is causing trouble. I have the cluster on ec2 and I 
>>>>>>>> use
>>>>>>>> ec2 discovery mode.
>>>>>>>>
>>>>>>>> What is ideal RAM size and are there any other parameters I need to
>>>>>>>> tune to get this cluster going?
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "elasticsearch" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>>>
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/5b659d11-d75
>>>>>>>> 7-4f8e-b347-60b3807c2dfe%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> You received this message because you are subscribed to a topic in
>>>>>>> the Google Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>>>>>>> pic/elasticsearch/DdPD8MiquYQ/unsubscribe.
>>>>>>>  To unsubscribe from this group and all its topics, send an email to
>>>>>>> elasticsearc...@googlegroups.com.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9
>>>>>>> GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_
>>>>>> JrcMgwXy2MA%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to elasticsearc...@googlegroups.com.
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%
>>>> 40googlegroups.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/4f0a203e-775e-4295-9081-a694554a2ed0%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/4f0a203e-775e-4295-9081-a694554a2ed0%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaSJjdyTm9FG2DsL9RP8kBOi2YuUNEv3yDiRzOB4cBRw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

Reply via email to