Exactly. Filters do not use scores. They also use bit sets which makes them reusable and fast.
I wasn't talking about a filter added to a query, I mean filtered queries. This is a huge difference. This query { "query" : { "bool": { "must": { "match" : { "body" : "big" } }, "must_not": { "match" : { "body" : "data" } }, "must": { "match" : {"id": 521} } } } } can be turned into this filtered query { "query" : { "constant_score": { "filter": { "bool": { "must": [ { "match" : { "body" : "big" } }, {"match" : {"id": 521} } ], "must_not": { "match" : { "body" : "data" } } } } } } } (plus fixing the double key "must" which is a potential source of errors) Jörg On Sun, Aug 24, 2014 at 4:30 PM, Jonathan Foy <the...@gmail.com> wrote: > I ran into the same issue when using Integer.MAX_VALUE as the size > parameter (migrating from a DB-based search). Perhaps someone can come up > with a proper reference, I cannot, but according to a comment in this SO > <http://stackoverflow.com/questions/8829468/elasticsearch-query-to-return-all-records> > question, Elasticsearch/Lucene tries to allocate memory for that many > scores. When I switched those queries to a count/search duo, things > improved dramatically, as you've already noticed. > > > On Saturday, August 23, 2014 12:17:47 PM UTC-4, Narendra Yadala wrote: >> >> >> I am not returning 2 billion documents :) >> >> I am returning all documents that match. Actual number can be anywhere >> between 0 to 50k. I am just fetching documents between a given time >> interval such as one hour, one day so on and then do batch processing them. >> >> I fixed this by making 2 queries, one to fetch count and other for actual >> data. It is mentioned in some other thread that scroll api is performance >> intensive so I did not go for it. >> >> On Saturday, 23 August 2014 21:32:59 UTC+5:30, Ivan Brusic wrote: >>> >>> "When I kept size as Integer.MAX_VALUE, it caused all the problems" >>> >>> Are you trying to return up to 2 billion documents at once? Even if that >>> number was only 1 million, you will face problems. Or did I perhaps >>> misunderstand you? >>> >>> Are you sorting the documents based on the score (the default)? >>> Lucene/Elasticsearch would need to keep all the values in memory in order >>> to start them, causing memory problems. In general, Lucene is not effective >>> at deep pagination. Use scan/scroll: >>> >>> http://www.elasticsearch.org/guide/en/elasticsearch/ >>> reference/current/search-request-scroll.html >>> >>> -- >>> Ivan >>> >>> >>> On Sat, Aug 23, 2014 at 6:46 AM, Narendra Yadala <narendr...@gmail.com> >>> wrote: >>> >>>> Hi Jörg, >>>> >>>> This query >>>> { >>>> "query" : { >>>> "bool": { >>>> "must": { >>>> "match" : { "body" : "big" } >>>> }, >>>> "must_not": { >>>> "match" : { "body" : "data" } >>>> }, >>>> "must": { >>>> "match" : {"id": 521} >>>> } >>>> } >>>> } >>>> } >>>> >>>> and this query are performing exactly same >>>> { >>>> "query" : { >>>> "bool": { >>>> "must": { >>>> "match" : { "body" : "big" } >>>> }, >>>> "must_not": { >>>> "match" : { "body" : "data" } >>>> } >>>> } >>>> }, >>>> "filter" : { >>>> "term" : { "id" : "521" } >>>> } >>>> } >>>> >>>> I am not able understand what makes a filtered query fast. Is there any >>>> place where I can find documentation on the internals of how different >>>> queries are processed by elasticsearch. >>>> >>>> On Saturday, 23 August 2014 18:20:23 UTC+5:30, Jörg Prante wrote: >>>> >>>>> Before firing queries, you should consider if the index design and >>>>> query choice is optimal. >>>>> >>>>> Numeric range queries are not straightforward. They were a major issue >>>>> on inverted index engines like Lucene/Elasticsearch and it has taken some >>>>> time to introduce efficient implementations. See e.g. >>>>> https://issues.apache.org/jira/browse/LUCENE-1673 >>>>> >>>>> ES tries to compensate the downsides of massive numeric range queries >>>>> by loading all the field values into memory. To achieve effective queries, >>>>> you have to carefully discretize the values you index. >>>>> >>>>> For example, a few hundred millions of different timestamps, with >>>>> millisecond resolution, are a real burden for searching on inverted >>>>> indices. A good discretization strategy for indexing is to reduce the >>>>> total >>>>> amount of values in such field to a few hundred or thousands. For >>>>> timestamps, this means, indexing time-based series data in discrete >>>>> intervals of days, hours, minutes, maybe seconds is much more efficient >>>>> than e.g. millisecond resolution. >>>>> >>>>> Another topic is to use filters for boolean queries. They are much >>>>> faster. >>>>> >>>>> Jörg >>>>> >>>>> >>>>> >>>>> On Sat, Aug 23, 2014 at 2:19 PM, Narendra Yadala <narendr...@gmail.com >>>>> > wrote: >>>>> >>>>>> Hi Ivan, >>>>>> >>>>>> Thanks for the input about aggregating on strings, I do that, but >>>>>> those queries take time but they do not crash node. >>>>>> >>>>>> The queries which caused problem were pretty straightforward queries >>>>>> (such as a boolean query with two musts, one must is equal match and >>>>>> other >>>>>> a range match on long) but the real problem was with the size. When I >>>>>> kept >>>>>> size as Integer.MAX_VALUE, it caused all the problems. When I removed it, >>>>>> it started working fine. I think it is worth mentioning somewhere about >>>>>> this strange behavior (probably expected but strange). >>>>>> >>>>>> I did double up on the RAM though and now I have allocated 5*10G RAM >>>>>> to the cluster. Things are looking ok as of now, except that the >>>>>> aggregations (on strings) are quite slow. May be I would run these >>>>>> aggregations as batch and cache the outputs in a different type and move >>>>>> on >>>>>> for now. >>>>>> >>>>>> Thanks >>>>>> NY >>>>>> >>>>>> >>>>>> On Fri, Aug 22, 2014 at 10:34 PM, Ivan Brusic <iv...@brusic.com> >>>>>> wrote: >>>>>> >>>>>>> How expensive are your queries? Are you using aggregations or >>>>>>> sorting on string fields that could use up your field data cache? Are >>>>>>> you >>>>>>> using the defaults for the cache? Post the current usage. >>>>>>> >>>>>>> If you post an example query and mapping, perhaps the community can >>>>>>> help optimize it. >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Ivan >>>>>>> >>>>>>> >>>>>>> On Fri, Aug 22, 2014 at 12:28 AM, Narendra Yadala < >>>>>>> narendr...@gmail.com> wrote: >>>>>>> >>>>>>>> I have a cluster of size 240 GB including replica and it has 5 >>>>>>>> nodes in it. I allocated 5 GB RAM (total 5*5 GB) to each node and >>>>>>>> started >>>>>>>> the cluster. When I start continuously firing queries on the cluster >>>>>>>> the GC >>>>>>>> starts kicking in and eventually node goes down because of OutOfMemory >>>>>>>> exception. I add upto 200k documents everyday. The indexing part works >>>>>>>> fine >>>>>>>> but querying part is causing trouble. I have the cluster on ec2 and I >>>>>>>> use >>>>>>>> ec2 discovery mode. >>>>>>>> >>>>>>>> What is ideal RAM size and are there any other parameters I need to >>>>>>>> tune to get this cluster going? >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "elasticsearch" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>>>> >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/elasticsearch/5b659d11-d75 >>>>>>>> 7-4f8e-b347-60b3807c2dfe%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/5b659d11-d757-4f8e-b347-60b3807c2dfe%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to a topic in >>>>>>> the Google Groups "elasticsearch" group. >>>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/to >>>>>>> pic/elasticsearch/DdPD8MiquYQ/unsubscribe. >>>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>>> elasticsearc...@googlegroups.com. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9 >>>>>>> GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com >>>>>>> <https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDQ9GTt%3Dcf1s1sXy57UMNB-0MNgNgCWEQOLooXDX7yNUA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>>> msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_ >>>>>> JrcMgwXy2MA%40mail.gmail.com >>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAOpeyMHfTmW06iSrximhD2F%2BxdeV2KhRy6AppO_JrcMgwXy2MA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to elasticsearc...@googlegroups.com. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/4cafd135-eb98-490c-bb75-84010a92c778%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/4f0a203e-775e-4295-9081-a694554a2ed0%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/4f0a203e-775e-4295-9081-a694554a2ed0%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFaSJjdyTm9FG2DsL9RP8kBOi2YuUNEv3yDiRzOB4cBRw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.