Re: Large results sets and paging for Aggregations

piyush goyal Wed, 11 Feb 2015 05:33:40 -0800

aah..! This seems to be the best explanation of how aggregation works. 
Thanks a ton Mark for that. :) Few other questions:


1.) Would I assume that as my document count would increase, the time for 
aggregation calculation would as well increase? Reason: Trying to figure 
out if bucket creation is at individual shard level, then document count 
would happen asynchronously at each shard level thus decreasing the 
execution time significantly. Also at shard level, as and when my document 
count increases(satisfying the criteria as per query) considering if this 
process is linear time, the execution time would increase. 

2.) How would I relate this analogy with sub aggregations. My observation 
says that as you increase the number of child aggregations, so it increases 
the execution time along with memory utilization. What happens in case of 
sub aggregations?

3.) I didn't get your last statement:
     "There is however a fixed overhead for all queries which *is* a 
function of number of docs and that is the Field Data cache required to 
hold the dates/member IDs in RAM - if this becomes a problem then you may 
want to look at on-disk alternative structure in the form of "DocValues"."
 
 4.) Off the topic, but I guess best to ask it here since we are talking 
about it. :) - DocValues - Since it was introduced in 1.0.0 and most of our 
mapping was defined in ES 0.9, can I change the mapping of existing fields 
now? Might be I can take this conversation in another thread but would love 
to hear about 1-3 points. You made this thread very interesting for me.

Thanks
Piyush 



On Wednesday, 11 February 2015 15:12:37 UTC+5:30, Mark Harwood wrote:
>
> 5k doesn't sound  too scary.
>
> Think of the aggs tree like a "Bean Machine" [1] - one of those wooden 
> boards with pins arranged on it like a christmas tree and you drop balls at 
> the top of the board and they rattle down a choice of path to the bottom.
> In the case of aggs, your buckets are the pins and documents are the balls
>
> The memory requirement for processing the agg tree is typically the number 
> of pins, not the number of balls you drop into the tree as these just fall 
> out of the bottom of the tree.
> So in your case it is 5k members multiplied by 12 months each = 60k unique 
> buckets, each of which will maintain a counter of how many docs pass 
> through that point. So you could pass millions or billions of docs through 
> and the working memory requirement for the query would be the same.
> There is however a fixed overhead for all queries which *is* a function 
> of number of docs and that is the Field Data cache required to hold the 
> dates/member IDs in RAM - if this becomes a problem then you may want to 
> look at on-disk alternative structure in the form of "DocValues".
>
> Hope that helps.
>
> [1] http://en.wikipedia.org/wiki/Bean_machine
>
> On Wednesday, February 11, 2015 at 7:04:04 AM UTC, piyush goyal wrote:
>>
>> Hi Mark,
>>
>> Before getting into queries, here is a little bit info about the project:
>>
>> 1.) A community where members keep on increasing, decreasing and 
>> changing. Maintained in a different type.
>> 2.) Approximately 3K to 4K documents of data of each user inserted into 
>> ES per month in a different type maintained by member ID.
>> 3.) Mapping is flat, there are no nested and array type of data.
>>
>> Requirement:
>>
>> Here is a sample requirement:
>>
>> 1.) Getting a report against each member ID against the count of data for 
>> last three month.
>> 2.) Query used to get the data is:
>>
>> {
>>   "query": {
>>     "constant_score": {
>>       "filter": {
>>         "bool": {
>>           "must": [
>>             {"term": {
>>               "datatype": "XYZ"
>>             }
>>             }, {
>>               "range": {
>>                 "response_timestamp": {
>>                   "from": "2014-11-01",
>>                   "to": "2015-01-31"
>>                 }
>>               }
>>             }
>>           ]
>>         }
>>       }
>>     }
>>   },"aggs": {
>>     "memberIDAggs": {
>>       "terms": {
>>         "field": "member_id",
>>         "size": 0
>>       },"aggs": {
>>         "dateHistAggs": {
>>           "date_histogram": {
>>             "field": "response_timestamp",
>>             "interval": "month"
>>           }
>>         }
>>       }
>>     }
>>   },"size": 0
>> }
>>
>> Now since the current member count is approximately 1K which will 
>> increase to 5K in next 10 months. 5K * 4K * 3 times of documents to be used 
>> for this aggregation. I guess a major hit on system. And this is only two 
>> level of aggregation. Next requirement by our analyst is to get per month 
>> data into three different categories. 
>>
>> What is the optimum solution to this problem?
>>
>> Regards
>> Piyush
>>
>> On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:
>>>
>>> >these kind of queries are hit more for qualitative analysis.
>>>
>>> Do you have any example queries? The "pay as you go" summarisation need 
>>> not be about just maintaining quantities.  In the demo here [1] I derive 
>>> "profile" names for people, categorizing them as "newbies", "fanboys" or 
>>> "haters" based on a history of their reviewing behaviours in a marketplace. 
>>>
>>> >By the way, are there any other strategies suggested by ES for these 
>>> kind of scenarios?
>>>
>>> Igor hit on one which is to use some criteria eg. date to limit the 
>>> volume of what you analyze in any one query request.
>>>
>>> [1] 
>>> http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/
>>>
>>>
>>>
>>> On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:
>>>>
>>>> Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But 
>>>> considering the dynamics of the application, these kind of queries are hit 
>>>> more for qualitative analysis. There are hundred of such queries(I am not 
>>>> exaggerating) which are being hit daily by our analytic team. Keeping 
>>>> count 
>>>> of all those qualitative checks daily and maintaining them as documents is 
>>>> a headache itself. Addition/update/removals of these documents would cause 
>>>> us huge maintenance overheads. Hence was thinking of getting something of 
>>>> getting pagination on aggregations which would definitely help us to keep 
>>>> our ES memory leaks away.
>>>>
>>>> By the way, are there any other strategies suggested by ES for these 
>>>> kind of scenarios?
>>>>
>>>> Thanks
>>>>
>>>> On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:
>>>>>
>>>>> > Why can't aggs be based on shard based calculations 
>>>>>
>>>>> They are. The "shard_size" setting will determine how many member 
>>>>> *summaries* will be returned from each shard - we won't stream each 
>>>>> member's thousands of related records back to a centralized point to 
>>>>> compute a final result. The final step is to summarise the summaries from 
>>>>> each shard.
>>>>>
>>>>> > if the number of members keep on increasing, day by day ES has to 
>>>>> keep more and more data into memory to calculate the aggs
>>>>>
>>>>> This is a different point to the one above (shard-level computation vs 
>>>>> memory costs). If your analysis involves summarising the behaviours of 
>>>>> large numbers of people over time then you may well find the cost of 
>>>>> doing 
>>>>> this in a single query too high when the numbers of people are extremely 
>>>>> large. There is a cost to any computation and in that scenario you have 
>>>>> deferred all these member-summarising costs to the very last moment. A 
>>>>> better strategy for large-scale analysis of behaviours over time is to 
>>>>> use 
>>>>> a "pay-as-you-go" model where you update a per-member summary document at 
>>>>> regular intervals with batches of their related records. This shifts the 
>>>>> bulk of the computation cost from your single query to many smaller costs 
>>>>> when writing data. You can then perform efficient aggs or scan/scroll 
>>>>> operations on *member* documents with pre-summarised attributes e.g. 
>>>>> totalSpend rather than deriving these properties on-the-fly from records 
>>>>> with a shared member ID.
>>>>>
>>>>>
>>>>>
>>>>> On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:
>>>>>>
>>>>>> Well, my use case says I have tens of thousands of records for each 
>>>>>> members. I want to do a simple terms aggs on member ID. If my count of 
>>>>>> member ID remains same throughout .. good enough, if the number of 
>>>>>> members 
>>>>>> keep on increasing, day by day ES has to keep more and more data into 
>>>>>> memory to calculate the aggs. Does not sound very promising. What we do 
>>>>>> is 
>>>>>> implementation of routing to put member specific data into a particular 
>>>>>> shard. Why can't aggs be based on shard based calculations so that I am 
>>>>>> safe from loading tons of data into memory.
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>> On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:
>>>>>>>
>>>>>>> Sharing a response I received from Igor Motov:
>>>>>>>
>>>>>>> "scroll works only to page results. paging aggs doesn't make sense 
>>>>>>>> since aggs are executed on the entire result set. therefore if it 
>>>>>>>> managed 
>>>>>>>> to fit into the memory you should just get it. paging will mean that 
>>>>>>>> you 
>>>>>>>> throw away a lot of results that were already calculated. the only way 
>>>>>>>> to 
>>>>>>>> "page" is by limiting the results that you are running aggs on. for 
>>>>>>>> example 
>>>>>>>> if your data is sorted by date and you want to build histogram for the 
>>>>>>>> results one date range at a time."
>>>>>>>
>>>>>>>
>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/252f95d5-007c-4591-8939-a4baa133f344%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Large results sets and paging for Aggregations

Reply via email to