I think aggregating 32 shards on one node is a bit degenerate. I imagine its more typical to aggregate across one of two shards per node. Don't get me wrong, you can totally have nodes store and query ~100 shards each without much trouble. If aggregating across a bunch of shards per node were a common thing I think a node level reduce step might help. I'm certainly no expert in the reduce code though.
Nik On Thu, Dec 18, 2014 at 10:48 AM, Yifan Wang <yifan.wang....@gmail.com> wrote: > > Sorry, if I did not make it clear. For sure I know aggregation is done on > the node for each shard, but here is the challenge. Say we set > shard_size=50,000. ES will aggregate on each shard and create buckets for > the matching documents, and then send top 50,000 buckets to the client node > for Reduce. Say we have 50 data nodes, and each node has 32 shards. This > means we need to send 50,000 buckets from each shard to the client node for > final aggregation. First, this may add heavy traffic to the network (what > if we have 100 nodes?). And second, the client will need to aggregate on > received 50*32*50,000 buckets. Would this cause any congestion on the > client node? However if we can aggregate on the node first, meaning reduce > from 32 buckets to only one bucket, then the client node only has to > process 50 buckets. This would significanly reduce the network traffic and > improve the scalability, plus because we can set relatively larger > shard_size, it will improve the accuracy of the final results, which is > another key issue we are facing in distributed environment on aggregations. > > So my key question is about the scalability particularly on aggregations. > It seems to be a challenge in my experience. I just want to hear other > people's experience. On heavy analytics applications, this will be a key. > > Of course, I also understand, adding node level aggregation may impact the > overall performance. I am wondering if anyone has thought about or done > anything in this aspect. > > BTW, I like ElasticSearch, but want to hear from the community on some of > the key challenges. > > > > On Thursday, December 18, 2014 9:34:07 AM UTC-5, Adrien Grand wrote: >> >> +1 to what AlexR said. I think there is indeed a bad assumption that >> shards just forward data to the coordinating node, this is not the case. >> >> On Thu, Dec 18, 2014 at 1:09 AM, AlexR <royt...@gmail.com> wrote: >>> >>> if you take a terms aggregation, the heavy lifting of the aggregation is >>> done on each node then aggregated results are combined on the master node. >>> So if you have thousands of nodes and very high cardinality nested aggs the >>> merging part may become a bottleneck but cost of doing actual aggregation >>> in most cases is far higher than cost of merging results from reasonable >>> number of shards. So in practice I think it balances pretty well. Of course >>> you are not limited to one master to handle concurrent requests >>> >>> On Wednesday, December 17, 2014 4:12:44 PM UTC-5, Yifan Wang wrote: >>>> >>>> I thought ES only "Collect" on individual shards, and "Reduce" on >>>> Client Node (master if you call it), nothing is done at the data node >>>> level. >>>> >>>> On Tuesday, December 16, 2014 1:31:30 PM UTC-5, AlexR wrote: >>>>> >>>>> ES already doing aggregations on each node. it is not like it is >>>>> shipping row level query data back to master for aggregation. >>>>> In fact, one unpleasant effect of it is that aggregation results are >>>>> not guaranteed to be precise due to distributed nature of the aggregation >>>>> for multibucket aggs ordered by count such as terms >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elasticsearc...@googlegroups.com. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> Adrien Grand >> > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/3519d982-390a-4a62-9d75-0bb5e1a4654b%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/3519d982-390a-4a62-9d75-0bb5e1a4654b%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1J0UfSzvAX07vo9HXxBOJYnzD4%3DBwfd9-JkiEJx43StQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.