Re: Bucket query results | top hits performance

2015-01-08 Thread Martijn v Groningen
Micheal Dustin, what should reduce the query time a lot is if you set `collect_mode` to `breadth_first` on the `top-fingerprints` agg. Like this: GET /_search?search_type=count { aggs: { top-fingerprints: { terms: { field: fingerprint, size: 50, collect_mode:

Re: Bucket query results | top hits performance

2015-01-08 Thread Martijn v Groningen
Micheal: I'd would expect that setting the `size` option on the terms agg to a smaller value would have a positive impact on the total query time. Feels like I'm missing something, can you run hot threads api (

Re: Bucket query results | top hits performance

2015-01-07 Thread Michael Irani
Martijn, Thanks for thinking about this. I tried changing the `size` on terms agg to 1, 5, 10, 25, 50 and timing didn't change much. Interestingly I also set the size to 0 which in turn took down our cluster. I tried removing the `_source` option and that didn't have any noticeable effect on

Re: Bucket query results | top hits performance

2015-01-07 Thread Dustin Boswell
I'm curious what the underlying algorithm is for TopHits. My mental model for ordinary aggregations is that there's basically a hash table of (field_value - count) maintained (for each field being aggregated), and that hash table count is incremented once per document, and then the top K

Re: Bucket query results | top hits performance

2015-01-06 Thread Itamar Syn-Hershko
Can you share the query and example results please? -- Itamar Syn-Hershko http://code972.com | @synhershko https://twitter.com/synhershko Freelance Developer Consultant Author of RavenDB in Action http://manning.com/synhershko/ On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani

Re: Bucket query results | top hits performance

2015-01-06 Thread Michael Irani
Sure. I simplified the query to keep things focused. This query takes about 3 seconds to run: { size: 0, aggs: { top-fingerprints: { terms: { field: fingerprint, size: 50 }, aggs: {

Re: Bucket query results | top hits performance

2015-01-06 Thread Martijn v Groningen
Hi Michael, In general the more buckets being returned by the parent aggregator the top_hits is nested in, the more work the top_hits agg needs to do, but I didn't come across performance issues with `size` on terms agg being set to 50 and the time it takes to execute increasing 30 times when