Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

2014-08-27 Thread Narendra Yadala
Hi Jörg, This query { query : { bool: { must: { match : { body : big } }, must_not: { match : { body : data } }, must: { match : {id: 521} } } } } and this query are

Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

2014-08-24 Thread Jonathan Foy
I ran into the same issue when using Integer.MAX_VALUE as the size parameter (migrating from a DB-based search). Perhaps someone can come up with a proper reference, I cannot, but according to a comment in this SO

Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

2014-08-24 Thread joergpra...@gmail.com
Exactly. Filters do not use scores. They also use bit sets which makes them reusable and fast. I wasn't talking about a filter added to a query, I mean filtered queries. This is a huge difference. This query { query : { bool: { must: { match : { body : big }

Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

2014-08-23 Thread Narendra Yadala
Hi Ivan, Thanks for the input about aggregating on strings, I do that, but those queries take time but they do not crash node. The queries which caused problem were pretty straightforward queries (such as a boolean query with two musts, one must is equal match and other a range match on long)

Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

2014-08-23 Thread joergpra...@gmail.com
Before firing queries, you should consider if the index design and query choice is optimal. Numeric range queries are not straightforward. They were a major issue on inverted index engines like Lucene/Elasticsearch and it has taken some time to introduce efficient implementations. See e.g.

Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

2014-08-23 Thread Narendra Yadala
Hi Jörg, This query { query : { bool: { must: { match : { body : big } }, must_not: { match : { body : data } }, must: { match : {id: 521} } } } } and this query are

Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

2014-08-23 Thread Ivan Brusic
When I kept size as Integer.MAX_VALUE, it caused all the problems Are you trying to return up to 2 billion documents at once? Even if that number was only 1 million, you will face problems. Or did I perhaps misunderstand you? Are you sorting the documents based on the score (the default)?

Re: Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

2014-08-23 Thread Narendra Yadala
I am not returning 2 billion documents :) I am returning all documents that match. Actual number can be anywhere between 0 to 50k. I am just fetching documents between a given time interval such as one hour, one day so on and then do batch processing them. I fixed this by making 2 queries,

Optimizing queries for a 5 node cluster with 250 M documents (causes OutOfMemory exceptions and GC pauses)

2014-08-22 Thread Narendra Yadala
I have a cluster of size 240 GB including replica and it has 5 nodes in it. I allocated 5 GB RAM (total 5*5 GB) to each node and started the cluster. When I start continuously firing queries on the cluster the GC starts kicking in and eventually node goes down because of OutOfMemory exception.