Hi Michael,

In general the more buckets being returned by the parent aggregator the
top_hits is nested in, the more work the top_hits agg needs to do, but I
didn't come across performance issues with `size` on terms agg being set to
50 and the time it takes to execute increasing 30 times when top_hits is
used. To exclude this on your side, can you play around with the `size`
option on terms agg?

Also perhaps the _source of your documents are relatively large. How does
the top_hits agg perform without the `_source` option on the top_hits agg?

Martijn

On 6 January 2015 at 22:29, Michael Irani <irani.mich...@gmail.com> wrote:

> Sure. I simplified the query to keep things focused.
>
> This query takes about 3 seconds to run:
>
> {
>
>     "size": 0,
>
>     "aggs": {
>         "top-fingerprints": {
>             "terms": {
>                 "field": "fingerprint",
>                 "size": 50
>             },
>             "aggs": {
>                 "top_tag_hits": {
>                     "top_hits": {
>                         "size": 1,
>                         "_source": {
>                            "include": [
>                               "title"
>                            ]
>                         }
>                     }
>                 }
>             }
>         }
>     }
>
> }
>
>
> This one takes about 80 milliseconds:
>
> {
>
>     "size": 0,
>
>     "aggs": {
>         "fingerprints": {
>             "terms": {
>                 "field": "fingerprint",
>                 "size": 100
>             }
>         }
>     }
>
> }
>
>
> The result's a bit too big to paste here. Anything specific about it you want 
> me to expose?
>
>
> Michael.
>
>
> On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:
>>
>> Can you share the query and example results please?
>>
>> --
>>
>> Itamar Syn-Hershko
>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>> Freelance Developer & Consultant
>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>
>> On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani <irani....@gmail.com>
>> wrote:
>>
>>> Hello,
>>> I'm working on a corpus of size approximately 10 million documents. The
>>> issue I'm running into right now is that the top scoring documents that
>>> come back from my query are essentially all the same result. I'm trying to
>>> find a way to get back unique results.
>>>
>>> I've looked into modeling the data differently with nested objects or
>>> parent-child relationships, but neither layout seems to fit the bill. The
>>> nested model won't work because some of the documents have too many closely
>>> related objects. On the flip side there are also too many unique documents
>>> for the parent-child relationship to fit.
>>>
>>> I then tried the "top hits aggregation" and it's exactly what I'm
>>> looking for, except the running time of the query is approximately 30x
>>> slower than the query without the aggregation. Are there known performance
>>> issues with "top hits"? Any ideas on what I should use to make these
>>> queries? Here's the aggregation piece:
>>> "aggs": {
>>>
>>>     "top-fingerprints": {
>>>         "terms": {
>>>             "field": "fingerprint",
>>>             "size": 50
>>>         },
>>>         "aggs": {
>>>             "top_tag_hits": {
>>>                 "top_hits": {
>>>                     "size": 1,
>>>                     "_source": {
>>>                        "include": [
>>>                           "title"
>>>                        ]
>>>                     }
>>>                 }
>>>             }
>>>         }
>>>     }
>>> }
>>>
>>>
>>> Thanks,
>>> Michael
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Met vriendelijke groet,

Martijn van Groningen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CA%2BA76Tzqo48VW0xTkR3zMpZ4Ys1CxwjB7J8dGTdp19N_1rYO3Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to