Re: Large results sets and paging for Aggregations

Mark Harwood Tue, 10 Feb 2015 01:51:13 -0800

> Why can't aggs be based on shard based calculations 

They are. The "shard_size" setting will determine how many member 
*summaries* will be returned from each shard - we won't stream each 
member's thousands of related records back to a centralized point to 
compute a final result. The final step is to summarise the summaries from 
each shard.

> if the number of members keep on increasing, day by day ES has to keep 
more and more data into memory to calculate the aggs

This is a different point to the one above (shard-level computation vs 
memory costs). If your analysis involves summarising the behaviours of 
large numbers of people over time then you may well find the cost of doing 
this in a single query too high when the numbers of people are extremely 
large. There is a cost to any computation and in that scenario you have 
deferred all these member-summarising costs to the very last moment. A 
better strategy for large-scale analysis of behaviours over time is to use 
a "pay-as-you-go" model where you update a per-member summary document at 
regular intervals with batches of their related records. This shifts the 
bulk of the computation cost from your single query to many smaller costs 
when writing data. You can then perform efficient aggs or scan/scroll 
operations on *member* documents with pre-summarised attributes e.g. 
totalSpend rather than deriving these properties on-the-fly from records 
with a shared member ID.

On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:
>
> Well, my use case says I have tens of thousands of records for each 
> members. I want to do a simple terms aggs on member ID. If my count of 
> member ID remains same throughout .. good enough, if the number of members 
> keep on increasing, day by day ES has to keep more and more data into 
> memory to calculate the aggs. Does not sound very promising. What we do is 
> implementation of routing to put member specific data into a particular 
> shard. Why can't aggs be based on shard based calculations so that I am 
> safe from loading tons of data into memory.
>
> Any thoughts?
>
> On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:
>>
>> Sharing a response I received from Igor Motov:
>>
>> "scroll works only to page results. paging aggs doesn't make sense since 
>>> aggs are executed on the entire result set. therefore if it managed to fit 
>>> into the memory you should just get it. paging will mean that you throw 
>>> away a lot of results that were already calculated. the only way to "page" 
>>> is by limiting the results that you are running aggs on. for example if 
>>> your data is sorted by date and you want to build histogram for the results 
>>> one date range at a time."
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/486fc700-a89f-473f-a6c6-4e69e862766f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Large results sets and paging for Aggregations

Reply via email to