Re: Large results sets and paging for Aggregations

2015-02-11 Thread Mark Harwood

>
> 1.) Would I assume that as my document count would increase, the time for 
> aggregation calculation would as well increase? 
>

Yes - documents are processed serially through the tree but it happens 
really quickly and each shard is doing this at the same time to produce its 
summary which is the key to scaling if things start slowing.

 

> 2.) How would I relate this analogy with sub aggregations.
>

Each row of pins in the bean machine represents another decision point for 
direction of travel - this is the equivalent of sub aggregation. The 
difference is in the Bean machine each layer can only go in a choice of 2 
directions - the ball goes left or right around a pin. In your first layer 
of aggregation there are not 2 but 5k choices of direction - one bucket for 
each member. Each member bucket is further broken down by 12 months. 

 

> My observation says that as you increase the number of child aggregations, 
> so it increases the execution time along with memory utilization. What 
> happens in case of sub aggregations?
>

Hopefully the previous comment should have made that clear. Each sub agg 
represents an additional decision point that has to be negotiated by each 
document - consider direction based on member ID, then next direction based 
on month.

 

> 3.) I didn't get your last statement:
>  "There is however a fixed overhead for all queries which *is* a 
> function of number of docs and that is the Field Data cache required to 
> hold the dates/member IDs in RAM - if this becomes a problem then you may 
> want to look at on-disk alternative structure in the form of "DocValues"."
>

We are fast at routing docs through the agg tree because we can lookup 
things like memberID for each matching doc really quickly and then 
determine the appropriate direction of travel. We rely on these look-up 
stores to be pre-warmed (i.e. held in Field Data arrays in the JVM or 
cached by the file system when using DocValues disk-based equivalents) to 
make our queries fast.
They are a fixed cost shared by all queries that make use of them.

 

>  
>  4.) Off the topic, but I guess best to ask it here since we are talking 
> about it. :) - DocValues - Since it was introduced in 1.0.0 and most of our 
> mapping was defined in ES 0.9, can I change the mapping of existing fields 
> now? Might be I can take this conversation in another thread but would love 
> to hear about 1-3 points. You made this thread very interesting for me.
>

I recommend you shift that off to another topic.
 

>
> Thanks
> Piyush 
>
>
>
> On Wednesday, 11 February 2015 15:12:37 UTC+5:30, Mark Harwood wrote:
>>
>> 5k doesn't sound  too scary.
>>
>> Think of the aggs tree like a "Bean Machine" [1] - one of those wooden 
>> boards with pins arranged on it like a christmas tree and you drop balls at 
>> the top of the board and they rattle down a choice of path to the bottom.
>> In the case of aggs, your buckets are the pins and documents are the balls
>>
>> The memory requirement for processing the agg tree is typically the 
>> number of pins, not the number of balls you drop into the tree as these 
>> just fall out of the bottom of the tree.
>> So in your case it is 5k members multiplied by 12 months each = 60k 
>> unique buckets, each of which will maintain a counter of how many docs pass 
>> through that point. So you could pass millions or billions of docs through 
>> and the working memory requirement for the query would be the same.
>> There is however a fixed overhead for all queries which *is* a function 
>> of number of docs and that is the Field Data cache required to hold the 
>> dates/member IDs in RAM - if this becomes a problem then you may want to 
>> look at on-disk alternative structure in the form of "DocValues".
>>
>> Hope that helps.
>>
>> [1] http://en.wikipedia.org/wiki/Bean_machine
>>
>> On Wednesday, February 11, 2015 at 7:04:04 AM UTC, piyush goyal wrote:
>>>
>>> Hi Mark,
>>>
>>> Before getting into queries, here is a little bit info about the project:
>>>
>>> 1.) A community where members keep on increasing, decreasing and 
>>> changing. Maintained in a different type.
>>> 2.) Approximately 3K to 4K documents of data of each user inserted into 
>>> ES per month in a different type maintained by member ID.
>>> 3.) Mapping is flat, there are no nested and array type of data.
>>>
>>> Requirement:
>>>
>>> Here is a sample requirement:
>>>
>>> 1.) Getting a report against each member ID against the count of data 
>>> for last three month.
>>> 2.) Query used to get the data is:
>>>
>>> {
>>>   "query": {
>>> "constant_score": {
>>>   "filter": {
>>> "bool": {
>>>   "must": [
>>> {"term": {
>>>   "datatype": "XYZ"
>>> }
>>> }, {
>>>   "range": {
>>> "response_timestamp": {
>>>   "from": "2014-11-01",
>>>   "to": "2015-01-31"
>>> }
>>>   }
>>>   

Re: Large results sets and paging for Aggregations

2015-02-11 Thread piyush goyal
aah..! This seems to be the best explanation of how aggregation works. 
Thanks a ton Mark for that. :) Few other questions:

1.) Would I assume that as my document count would increase, the time for 
aggregation calculation would as well increase? Reason: Trying to figure 
out if bucket creation is at individual shard level, then document count 
would happen asynchronously at each shard level thus decreasing the 
execution time significantly. Also at shard level, as and when my document 
count increases(satisfying the criteria as per query) considering if this 
process is linear time, the execution time would increase. 

2.) How would I relate this analogy with sub aggregations. My observation 
says that as you increase the number of child aggregations, so it increases 
the execution time along with memory utilization. What happens in case of 
sub aggregations?

3.) I didn't get your last statement:
 "There is however a fixed overhead for all queries which *is* a 
function of number of docs and that is the Field Data cache required to 
hold the dates/member IDs in RAM - if this becomes a problem then you may 
want to look at on-disk alternative structure in the form of "DocValues"."
 
 4.) Off the topic, but I guess best to ask it here since we are talking 
about it. :) - DocValues - Since it was introduced in 1.0.0 and most of our 
mapping was defined in ES 0.9, can I change the mapping of existing fields 
now? Might be I can take this conversation in another thread but would love 
to hear about 1-3 points. You made this thread very interesting for me.

Thanks
Piyush 



On Wednesday, 11 February 2015 15:12:37 UTC+5:30, Mark Harwood wrote:
>
> 5k doesn't sound  too scary.
>
> Think of the aggs tree like a "Bean Machine" [1] - one of those wooden 
> boards with pins arranged on it like a christmas tree and you drop balls at 
> the top of the board and they rattle down a choice of path to the bottom.
> In the case of aggs, your buckets are the pins and documents are the balls
>
> The memory requirement for processing the agg tree is typically the number 
> of pins, not the number of balls you drop into the tree as these just fall 
> out of the bottom of the tree.
> So in your case it is 5k members multiplied by 12 months each = 60k unique 
> buckets, each of which will maintain a counter of how many docs pass 
> through that point. So you could pass millions or billions of docs through 
> and the working memory requirement for the query would be the same.
> There is however a fixed overhead for all queries which *is* a function 
> of number of docs and that is the Field Data cache required to hold the 
> dates/member IDs in RAM - if this becomes a problem then you may want to 
> look at on-disk alternative structure in the form of "DocValues".
>
> Hope that helps.
>
> [1] http://en.wikipedia.org/wiki/Bean_machine
>
> On Wednesday, February 11, 2015 at 7:04:04 AM UTC, piyush goyal wrote:
>>
>> Hi Mark,
>>
>> Before getting into queries, here is a little bit info about the project:
>>
>> 1.) A community where members keep on increasing, decreasing and 
>> changing. Maintained in a different type.
>> 2.) Approximately 3K to 4K documents of data of each user inserted into 
>> ES per month in a different type maintained by member ID.
>> 3.) Mapping is flat, there are no nested and array type of data.
>>
>> Requirement:
>>
>> Here is a sample requirement:
>>
>> 1.) Getting a report against each member ID against the count of data for 
>> last three month.
>> 2.) Query used to get the data is:
>>
>> {
>>   "query": {
>> "constant_score": {
>>   "filter": {
>> "bool": {
>>   "must": [
>> {"term": {
>>   "datatype": "XYZ"
>> }
>> }, {
>>   "range": {
>> "response_timestamp": {
>>   "from": "2014-11-01",
>>   "to": "2015-01-31"
>> }
>>   }
>> }
>>   ]
>> }
>>   }
>> }
>>   },"aggs": {
>> "memberIDAggs": {
>>   "terms": {
>> "field": "member_id",
>> "size": 0
>>   },"aggs": {
>> "dateHistAggs": {
>>   "date_histogram": {
>> "field": "response_timestamp",
>> "interval": "month"
>>   }
>> }
>>   }
>> }
>>   },"size": 0
>> }
>>
>> Now since the current member count is approximately 1K which will 
>> increase to 5K in next 10 months. 5K * 4K * 3 times of documents to be used 
>> for this aggregation. I guess a major hit on system. And this is only two 
>> level of aggregation. Next requirement by our analyst is to get per month 
>> data into three different categories. 
>>
>> What is the optimum solution to this problem?
>>
>> Regards
>> Piyush
>>
>> On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:
>>>
>>> >these kind of queries are hit more for qualitative analysis.
>>>
>>> Do you have any example queries? The "pay 

Re: Large results sets and paging for Aggregations

2015-02-11 Thread Mark Harwood
5k doesn't sound  too scary.

Think of the aggs tree like a "Bean Machine" [1] - one of those wooden 
boards with pins arranged on it like a christmas tree and you drop balls at 
the top of the board and they rattle down a choice of path to the bottom.
In the case of aggs, your buckets are the pins and documents are the balls

The memory requirement for processing the agg tree is typically the number 
of pins, not the number of balls you drop into the tree as these just fall 
out of the bottom of the tree.
So in your case it is 5k members multiplied by 12 months each = 60k unique 
buckets, each of which will maintain a counter of how many docs pass 
through that point. So you could pass millions or billions of docs through 
and the working memory requirement for the query would be the same.
There is however a fixed overhead for all queries which *is* a function of 
number of docs and that is the Field Data cache required to hold the 
dates/member IDs in RAM - if this becomes a problem then you may want to 
look at on-disk alternative structure in the form of "DocValues".

Hope that helps.

[1] http://en.wikipedia.org/wiki/Bean_machine

On Wednesday, February 11, 2015 at 7:04:04 AM UTC, piyush goyal wrote:
>
> Hi Mark,
>
> Before getting into queries, here is a little bit info about the project:
>
> 1.) A community where members keep on increasing, decreasing and changing. 
> Maintained in a different type.
> 2.) Approximately 3K to 4K documents of data of each user inserted into ES 
> per month in a different type maintained by member ID.
> 3.) Mapping is flat, there are no nested and array type of data.
>
> Requirement:
>
> Here is a sample requirement:
>
> 1.) Getting a report against each member ID against the count of data for 
> last three month.
> 2.) Query used to get the data is:
>
> {
>   "query": {
> "constant_score": {
>   "filter": {
> "bool": {
>   "must": [
> {"term": {
>   "datatype": "XYZ"
> }
> }, {
>   "range": {
> "response_timestamp": {
>   "from": "2014-11-01",
>   "to": "2015-01-31"
> }
>   }
> }
>   ]
> }
>   }
> }
>   },"aggs": {
> "memberIDAggs": {
>   "terms": {
> "field": "member_id",
> "size": 0
>   },"aggs": {
> "dateHistAggs": {
>   "date_histogram": {
> "field": "response_timestamp",
> "interval": "month"
>   }
> }
>   }
> }
>   },"size": 0
> }
>
> Now since the current member count is approximately 1K which will increase 
> to 5K in next 10 months. 5K * 4K * 3 times of documents to be used for this 
> aggregation. I guess a major hit on system. And this is only two level of 
> aggregation. Next requirement by our analyst is to get per month data into 
> three different categories. 
>
> What is the optimum solution to this problem?
>
> Regards
> Piyush
>
> On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:
>>
>> >these kind of queries are hit more for qualitative analysis.
>>
>> Do you have any example queries? The "pay as you go" summarisation need 
>> not be about just maintaining quantities.  In the demo here [1] I derive 
>> "profile" names for people, categorizing them as "newbies", "fanboys" or 
>> "haters" based on a history of their reviewing behaviours in a marketplace. 
>>
>> >By the way, are there any other strategies suggested by ES for these 
>> kind of scenarios?
>>
>> Igor hit on one which is to use some criteria eg. date to limit the 
>> volume of what you analyze in any one query request.
>>
>> [1] 
>> http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/
>>
>>
>>
>> On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:
>>>
>>> Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But 
>>> considering the dynamics of the application, these kind of queries are hit 
>>> more for qualitative analysis. There are hundred of such queries(I am not 
>>> exaggerating) which are being hit daily by our analytic team. Keeping count 
>>> of all those qualitative checks daily and maintaining them as documents is 
>>> a headache itself. Addition/update/removals of these documents would cause 
>>> us huge maintenance overheads. Hence was thinking of getting something of 
>>> getting pagination on aggregations which would definitely help us to keep 
>>> our ES memory leaks away.
>>>
>>> By the way, are there any other strategies suggested by ES for these 
>>> kind of scenarios?
>>>
>>> Thanks
>>>
>>> On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:

 > Why can't aggs be based on shard based calculations 

 They are. The "shard_size" setting will determine how many member 
 *summaries* will be returned from each shard - we won't stream each 
 member's thousands of related records b

Re: Large results sets and paging for Aggregations

2015-02-10 Thread piyush goyal
Hi Mark,

Before getting into queries, here is a little bit info about the project:

1.) A community where members keep on increasing, decreasing and changing. 
Maintained in a different type.
2.) Approximately 3K to 4K documents of data of each user inserted into ES 
per month in a different type maintained by member ID.
3.) Mapping is flat, there are no nested and array type of data.

Requirement:

Here is a sample requirement:

1.) Getting a report against each member ID against the count of data for 
last three month.
2.) Query used to get the data is:

{
  "query": {
"constant_score": {
  "filter": {
"bool": {
  "must": [
{"term": {
  "datatype": "XYZ"
}
}, {
  "range": {
"response_timestamp": {
  "from": "2014-11-01",
  "to": "2015-01-31"
}
  }
}
  ]
}
  }
}
  },"aggs": {
"memberIDAggs": {
  "terms": {
"field": "member_id",
"size": 0
  },"aggs": {
"dateHistAggs": {
  "date_histogram": {
"field": "response_timestamp",
"interval": "month"
  }
}
  }
}
  },"size": 0
}

Now since the current member count is approximately 1K which will increase 
to 5K in next 10 months. 5K * 4K * 3 times of documents to be used for this 
aggregation. I guess a major hit on system. And this is only two level of 
aggregation. Next requirement by our analyst is to get per month data into 
three different categories. 

What is the optimum solution to this problem?

Regards
Piyush

On Tuesday, 10 February 2015 16:15:22 UTC+5:30, Mark Harwood wrote:
>
> >these kind of queries are hit more for qualitative analysis.
>
> Do you have any example queries? The "pay as you go" summarisation need 
> not be about just maintaining quantities.  In the demo here [1] I derive 
> "profile" names for people, categorizing them as "newbies", "fanboys" or 
> "haters" based on a history of their reviewing behaviours in a marketplace. 
>
> >By the way, are there any other strategies suggested by ES for these kind 
> of scenarios?
>
> Igor hit on one which is to use some criteria eg. date to limit the volume 
> of what you analyze in any one query request.
>
> [1] 
> http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/
>
>
>
> On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:
>>
>> Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But 
>> considering the dynamics of the application, these kind of queries are hit 
>> more for qualitative analysis. There are hundred of such queries(I am not 
>> exaggerating) which are being hit daily by our analytic team. Keeping count 
>> of all those qualitative checks daily and maintaining them as documents is 
>> a headache itself. Addition/update/removals of these documents would cause 
>> us huge maintenance overheads. Hence was thinking of getting something of 
>> getting pagination on aggregations which would definitely help us to keep 
>> our ES memory leaks away.
>>
>> By the way, are there any other strategies suggested by ES for these kind 
>> of scenarios?
>>
>> Thanks
>>
>> On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:
>>>
>>> > Why can't aggs be based on shard based calculations 
>>>
>>> They are. The "shard_size" setting will determine how many member 
>>> *summaries* will be returned from each shard - we won't stream each 
>>> member's thousands of related records back to a centralized point to 
>>> compute a final result. The final step is to summarise the summaries from 
>>> each shard.
>>>
>>> > if the number of members keep on increasing, day by day ES has to keep 
>>> more and more data into memory to calculate the aggs
>>>
>>> This is a different point to the one above (shard-level computation vs 
>>> memory costs). If your analysis involves summarising the behaviours of 
>>> large numbers of people over time then you may well find the cost of doing 
>>> this in a single query too high when the numbers of people are extremely 
>>> large. There is a cost to any computation and in that scenario you have 
>>> deferred all these member-summarising costs to the very last moment. A 
>>> better strategy for large-scale analysis of behaviours over time is to use 
>>> a "pay-as-you-go" model where you update a per-member summary document at 
>>> regular intervals with batches of their related records. This shifts the 
>>> bulk of the computation cost from your single query to many smaller costs 
>>> when writing data. You can then perform efficient aggs or scan/scroll 
>>> operations on *member* documents with pre-summarised attributes e.g. 
>>> totalSpend rather than deriving these properties on-the-fly from records 
>>> with a shared member ID.
>>>
>>>
>>>
>>> On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:

>>

Re: Large results sets and paging for Aggregations

2015-02-10 Thread Mark Harwood
>these kind of queries are hit more for qualitative analysis.

Do you have any example queries? The "pay as you go" summarisation need not 
be about just maintaining quantities.  In the demo here [1] I derive 
"profile" names for people, categorizing them as "newbies", "fanboys" or 
"haters" based on a history of their reviewing behaviours in a marketplace. 

>By the way, are there any other strategies suggested by ES for these kind 
of scenarios?

Igor hit on one which is to use some criteria eg. date to limit the volume 
of what you analyze in any one query request.

[1] 
http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/



On Tuesday, February 10, 2015 at 10:05:24 AM UTC, piyush goyal wrote:
>
> Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But 
> considering the dynamics of the application, these kind of queries are hit 
> more for qualitative analysis. There are hundred of such queries(I am not 
> exaggerating) which are being hit daily by our analytic team. Keeping count 
> of all those qualitative checks daily and maintaining them as documents is 
> a headache itself. Addition/update/removals of these documents would cause 
> us huge maintenance overheads. Hence was thinking of getting something of 
> getting pagination on aggregations which would definitely help us to keep 
> our ES memory leaks away.
>
> By the way, are there any other strategies suggested by ES for these kind 
> of scenarios?
>
> Thanks
>
> On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:
>>
>> > Why can't aggs be based on shard based calculations 
>>
>> They are. The "shard_size" setting will determine how many member 
>> *summaries* will be returned from each shard - we won't stream each 
>> member's thousands of related records back to a centralized point to 
>> compute a final result. The final step is to summarise the summaries from 
>> each shard.
>>
>> > if the number of members keep on increasing, day by day ES has to keep 
>> more and more data into memory to calculate the aggs
>>
>> This is a different point to the one above (shard-level computation vs 
>> memory costs). If your analysis involves summarising the behaviours of 
>> large numbers of people over time then you may well find the cost of doing 
>> this in a single query too high when the numbers of people are extremely 
>> large. There is a cost to any computation and in that scenario you have 
>> deferred all these member-summarising costs to the very last moment. A 
>> better strategy for large-scale analysis of behaviours over time is to use 
>> a "pay-as-you-go" model where you update a per-member summary document at 
>> regular intervals with batches of their related records. This shifts the 
>> bulk of the computation cost from your single query to many smaller costs 
>> when writing data. You can then perform efficient aggs or scan/scroll 
>> operations on *member* documents with pre-summarised attributes e.g. 
>> totalSpend rather than deriving these properties on-the-fly from records 
>> with a shared member ID.
>>
>>
>>
>> On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:
>>>
>>> Well, my use case says I have tens of thousands of records for each 
>>> members. I want to do a simple terms aggs on member ID. If my count of 
>>> member ID remains same throughout .. good enough, if the number of members 
>>> keep on increasing, day by day ES has to keep more and more data into 
>>> memory to calculate the aggs. Does not sound very promising. What we do is 
>>> implementation of routing to put member specific data into a particular 
>>> shard. Why can't aggs be based on shard based calculations so that I am 
>>> safe from loading tons of data into memory.
>>>
>>> Any thoughts?
>>>
>>> On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:

 Sharing a response I received from Igor Motov:

 "scroll works only to page results. paging aggs doesn't make sense 
> since aggs are executed on the entire result set. therefore if it managed 
> to fit into the memory you should just get it. paging will mean that you 
> throw away a lot of results that were already calculated. the only way to 
> "page" is by limiting the results that you are running aggs on. for 
> example 
> if your data is sorted by date and you want to build histogram for the 
> results one date range at a time."




-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b8ddcc91-a1c8-472e-b08c-f662313a042a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Large results sets and paging for Aggregations

2015-02-10 Thread piyush goyal
Thanks Mark. Your suggestion of "pay-as-you-go" seems amazing. But 
considering the dynamics of the application, these kind of queries are hit 
more for qualitative analysis. There are hundred of such queries(I am not 
exaggerating) which are being hit daily by our analytic team. Keeping count 
of all those qualitative checks daily and maintaining them as documents is 
a headache itself. Addition/update/removals of these documents would cause 
us huge maintenance overheads. Hence was thinking of getting something of 
getting pagination on aggregations which would definitely help us to keep 
our ES memory leaks away.

By the way, are there any other strategies suggested by ES for these kind 
of scenarios?

Thanks

On Tuesday, 10 February 2015 15:20:40 UTC+5:30, Mark Harwood wrote:
>
> > Why can't aggs be based on shard based calculations 
>
> They are. The "shard_size" setting will determine how many member 
> *summaries* will be returned from each shard - we won't stream each 
> member's thousands of related records back to a centralized point to 
> compute a final result. The final step is to summarise the summaries from 
> each shard.
>
> > if the number of members keep on increasing, day by day ES has to keep 
> more and more data into memory to calculate the aggs
>
> This is a different point to the one above (shard-level computation vs 
> memory costs). If your analysis involves summarising the behaviours of 
> large numbers of people over time then you may well find the cost of doing 
> this in a single query too high when the numbers of people are extremely 
> large. There is a cost to any computation and in that scenario you have 
> deferred all these member-summarising costs to the very last moment. A 
> better strategy for large-scale analysis of behaviours over time is to use 
> a "pay-as-you-go" model where you update a per-member summary document at 
> regular intervals with batches of their related records. This shifts the 
> bulk of the computation cost from your single query to many smaller costs 
> when writing data. You can then perform efficient aggs or scan/scroll 
> operations on *member* documents with pre-summarised attributes e.g. 
> totalSpend rather than deriving these properties on-the-fly from records 
> with a shared member ID.
>
>
>
> On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:
>>
>> Well, my use case says I have tens of thousands of records for each 
>> members. I want to do a simple terms aggs on member ID. If my count of 
>> member ID remains same throughout .. good enough, if the number of members 
>> keep on increasing, day by day ES has to keep more and more data into 
>> memory to calculate the aggs. Does not sound very promising. What we do is 
>> implementation of routing to put member specific data into a particular 
>> shard. Why can't aggs be based on shard based calculations so that I am 
>> safe from loading tons of data into memory.
>>
>> Any thoughts?
>>
>> On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:
>>>
>>> Sharing a response I received from Igor Motov:
>>>
>>> "scroll works only to page results. paging aggs doesn't make sense since 
 aggs are executed on the entire result set. therefore if it managed to fit 
 into the memory you should just get it. paging will mean that you throw 
 away a lot of results that were already calculated. the only way to "page" 
 is by limiting the results that you are running aggs on. for example if 
 your data is sorted by date and you want to build histogram for the 
 results 
 one date range at a time."
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d4b5fd32-3ef7-4026-846e-5f7d388bad1f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Large results sets and paging for Aggregations

2015-02-10 Thread Mark Harwood
> Why can't aggs be based on shard based calculations 

They are. The "shard_size" setting will determine how many member 
*summaries* will be returned from each shard - we won't stream each 
member's thousands of related records back to a centralized point to 
compute a final result. The final step is to summarise the summaries from 
each shard.

> if the number of members keep on increasing, day by day ES has to keep 
more and more data into memory to calculate the aggs

This is a different point to the one above (shard-level computation vs 
memory costs). If your analysis involves summarising the behaviours of 
large numbers of people over time then you may well find the cost of doing 
this in a single query too high when the numbers of people are extremely 
large. There is a cost to any computation and in that scenario you have 
deferred all these member-summarising costs to the very last moment. A 
better strategy for large-scale analysis of behaviours over time is to use 
a "pay-as-you-go" model where you update a per-member summary document at 
regular intervals with batches of their related records. This shifts the 
bulk of the computation cost from your single query to many smaller costs 
when writing data. You can then perform efficient aggs or scan/scroll 
operations on *member* documents with pre-summarised attributes e.g. 
totalSpend rather than deriving these properties on-the-fly from records 
with a shared member ID.



On Tuesday, February 10, 2015 at 7:03:17 AM UTC, piyush goyal wrote:
>
> Well, my use case says I have tens of thousands of records for each 
> members. I want to do a simple terms aggs on member ID. If my count of 
> member ID remains same throughout .. good enough, if the number of members 
> keep on increasing, day by day ES has to keep more and more data into 
> memory to calculate the aggs. Does not sound very promising. What we do is 
> implementation of routing to put member specific data into a particular 
> shard. Why can't aggs be based on shard based calculations so that I am 
> safe from loading tons of data into memory.
>
> Any thoughts?
>
> On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:
>>
>> Sharing a response I received from Igor Motov:
>>
>> "scroll works only to page results. paging aggs doesn't make sense since 
>>> aggs are executed on the entire result set. therefore if it managed to fit 
>>> into the memory you should just get it. paging will mean that you throw 
>>> away a lot of results that were already calculated. the only way to "page" 
>>> is by limiting the results that you are running aggs on. for example if 
>>> your data is sorted by date and you want to build histogram for the results 
>>> one date range at a time."
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/486fc700-a89f-473f-a6c6-4e69e862766f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Large results sets and paging for Aggregations

2015-02-09 Thread piyush goyal
Well, my use case says I have tens of thousands of records for each 
members. I want to do a simple terms aggs on member ID. If my count of 
member ID remains same throughout .. good enough, if the number of members 
keep on increasing, day by day ES has to keep more and more data into 
memory to calculate the aggs. Does not sound very promising. What we do is 
implementation of routing to put member specific data into a particular 
shard. Why can't aggs be based on shard based calculations so that I am 
safe from loading tons of data into memory.

Any thoughts?

On Sunday, 9 November 2014 22:58:12 UTC+5:30, pulkitsinghal wrote:
>
> Sharing a response I received from Igor Motov:
>
> "scroll works only to page results. paging aggs doesn't make sense since 
>> aggs are executed on the entire result set. therefore if it managed to fit 
>> into the memory you should just get it. paging will mean that you throw 
>> away a lot of results that were already calculated. the only way to "page" 
>> is by limiting the results that you are running aggs on. for example if 
>> your data is sorted by date and you want to build histogram for the results 
>> one date range at a time."
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f6307a18-ea96-403d-ac02-dc37d3f2cceb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Large results sets and paging for Aggregations

2014-11-09 Thread pulkitsinghal
Sharing a response I received from Igor Motov:

"scroll works only to page results. paging aggs doesn't make sense since 
> aggs are executed on the entire result set. therefore if it managed to fit 
> into the memory you should just get it. paging will mean that you throw 
> away a lot of results that were already calculated. the only way to "page" 
> is by limiting the results that you are running aggs on. for example if 
> your data is sorted by date and you want to build histogram for the results 
> one date range at a time."


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0ae54bc1-7059-4ae7-a979-191a64d068fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Large results sets and paging for Aggregations

2014-11-09 Thread pulkitsinghal
Based on the reference docs I couldn't figure out what happens when the 
aggregation result set is very large. Does it get cut off? What is the 
upperbound? Does ES crash?

I see closed issues that indicate that pagination for aggregations will not 
be supported (https://github.com/elasticsearch/elasticsearch/issues/4915) 
BUT does that mean we can still get the entire result set without missing 
anything in the response?

Is the best way to do this, via a scroll (non-scan) query and that will 
give the entire aggregation result set in the very first response, no 
matter how huge?

Thanks!
- Pulkit

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/57cb8dd9-aa54-4777-b9ae-8ee0495ecba3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.