Re: Large results sets and paging for Aggregations

Mark Harwood Tue, 10 Feb 2015 02:45:37 -0800

>these kind of queries are hit more for qualitative analysis.

Do you have any example queries? The "pay as you go" summarisation need not 
be about just maintaining quantities.  In the demo here [1] I derive 
"profile" names for people, categorizing them as "newbies", "fanboys" or 
"haters" based on a history of their reviewing behaviours in a marketplace.


>By the way, are there any other strategies suggested by ES for these kind 
of scenarios?

Igor hit on one which is to use some criteria eg. date to limit the volume 
of what you analyze in any one query request.

[1] 
http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/



On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:
>
> Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But 
> considering the dynamics of the application, these kind of queries are hit 
> more for qualitative analysis. There are hundred of such queries(I am not 
> exaggerating) which are being hit daily by our analytic team. Keeping count 
> of all those qualitative checks daily and maintaining them as documents is 
> a headache itself. Addition/update/removals of these documents would cause 
> us huge maintenance overheads. Hence was thinking of getting something of 
> getting pagination on aggregations which would definitely help us to keep 
> our ES memory leaks away.
>
> By the way, are there any other strategies suggested by ES for these kind 
> of scenarios?
>
> Thanks
>
> On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:
>>
>> > Why can't aggs be based on shard based calculations 
>>
>> They are. The "shard_size" setting will determine how many member 
>> *summaries* will be returned from each shard - we won't stream each 
>> member's thousands of related records back to a centralized point to 
>> compute a final result. The final step is to summarise the summaries from 
>> each shard.
>>
>> > if the number of members keep on increasing, day by day ES has to keep 
>> more and more data into memory to calculate the aggs
>>
>> This is a different point to the one above (shard-level computation vs 
>> memory costs). If your analysis involves summarising the behaviours of 
>> large numbers of people over time then you may well find the cost of doing 
>> this in a single query too high when the numbers of people are extremely 
>> large. There is a cost to any computation and in that scenario you have 
>> deferred all these member-summarising costs to the very last moment. A 
>> better strategy for large-scale analysis of behaviours over time is to use 
>> a "pay-as-you-go" model where you update a per-member summary document at 
>> regular intervals with batches of their related records. This shifts the 
>> bulk of the computation cost from your single query to many smaller costs 
>> when writing data. You can then perform efficient aggs or scan/scroll 
>> operations on *member* documents with pre-summarised attributes e.g. 
>> totalSpend rather than deriving these properties on-the-fly from records 
>> with a shared member ID.
>>
>>
>>
>> On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:
>>>
>>> Well, my use case says I have tens of thousands of records for each 
>>> members. I want to do a simple terms aggs on member ID. If my count of 
>>> member ID remains same throughout .. good enough, if the number of members 
>>> keep on increasing, day by day ES has to keep more and more data into 
>>> memory to calculate the aggs. Does not sound very promising. What we do is 
>>> implementation of routing to put member specific data into a particular 
>>> shard. Why can't aggs be based on shard based calculations so that I am 
>>> safe from loading tons of data into memory.
>>>
>>> Any thoughts?
>>>
>>> On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:
>>>>
>>>> Sharing a response I received from Igor Motov:
>>>>
>>>> "scroll works only to page results. paging aggs doesn't make sense 
>>>>> since aggs are executed on the entire result set. therefore if it managed 
>>>>> to fit into the memory you should just get it. paging will mean that you 
>>>>> throw away a lot of results that were already calculated. the only way to 
>>>>> "page" is by limiting the results that you are running aggs on. for 
>>>>> example 
>>>>> if your data is sorted by date and you want to build histogram for the 
>>>>> results one date range at a time."
>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b8ddcc91-a1c8-472e-b08c-f662313a042a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Large results sets and paging for Aggregations

Reply via email to