> Why can't aggs be based on shard based calculations They are. The "shard_size" setting will determine how many member *summaries* will be returned from each shard - we won't stream each member's thousands of related records back to a centralized point to compute a final result. The final step is to summarise the summaries from each shard.
> if the number of members keep on increasing, day by day ES has to keep more and more data into memory to calculate the aggs This is a different point to the one above (shard-level computation vs memory costs). If your analysis involves summarising the behaviours of large numbers of people over time then you may well find the cost of doing this in a single query too high when the numbers of people are extremely large. There is a cost to any computation and in that scenario you have deferred all these member-summarising costs to the very last moment. A better strategy for large-scale analysis of behaviours over time is to use a "pay-as-you-go" model where you update a per-member summary document at regular intervals with batches of their related records. This shifts the bulk of the computation cost from your single query to many smaller costs when writing data. You can then perform efficient aggs or scan/scroll operations on *member* documents with pre-summarised attributes e.g. totalSpend rather than deriving these properties on-the-fly from records with a shared member ID. On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote: > > Well, my use case says I have tens of thousands of records for each > members. I want to do a simple terms aggs on member ID. If my count of > member ID remains same throughout .. good enough, if the number of members > keep on increasing, day by day ES has to keep more and more data into > memory to calculate the aggs. Does not sound very promising. What we do is > implementation of routing to put member specific data into a particular > shard. Why can't aggs be based on shard based calculations so that I am > safe from loading tons of data into memory. > > Any thoughts? > > On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote: >> >> Sharing a response I received from Igor Motov: >> >> "scroll works only to page results. paging aggs doesn't make sense since >>> aggs are executed on the entire result set. therefore if it managed to fit >>> into the memory you should just get it. paging will mean that you throw >>> away a lot of results that were already calculated. the only way to "page" >>> is by limiting the results that you are running aggs on. for example if >>> your data is sorted by date and you want to build histogram for the results >>> one date range at a time." >> >> >> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/486fc700-a89f-473f-a6c6-4e69e862766f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.