I think aggregating 32 shards on one node is a bit degenerate.  I imagine
its more typical to aggregate across one of two shards per node.  Don't get
me wrong, you can totally have nodes store and query ~100 shards each
without much trouble.  If aggregating across a bunch of shards per node
were a common thing I think a node level reduce step might help.  I'm
certainly no expert in the reduce code though.

Nik

On Thu, Dec 18, 2014 at 10:48 AM, Yifan Wang <yifan.wang....@gmail.com>
wrote:
>
> Sorry, if I did not make it clear. For sure I know aggregation is done on
> the node for each shard, but here is the challenge. Say we set
> shard_size=50,000. ES will aggregate on each shard and create buckets for
> the matching documents, and then send top 50,000 buckets to the client node
> for Reduce. Say we have 50 data nodes, and each node has 32 shards. This
> means we need to send 50,000 buckets from each shard to the client node for
> final aggregation. First, this may add heavy traffic to the network (what
> if we have 100 nodes?). And second, the client will need to aggregate on
> received 50*32*50,000 buckets. Would this cause any congestion on the
> client node? However if we can aggregate on the node first, meaning reduce
> from 32 buckets to only one bucket, then the client node only has to
> process 50 buckets. This would significanly reduce the network traffic and
> improve the scalability, plus because we can set relatively larger
> shard_size, it will improve the accuracy of the final results, which is
> another key issue we are facing in distributed environment on aggregations.
>
> So my key question is about the scalability particularly on aggregations.
> It seems to be a challenge in my experience. I just want to hear other
> people's experience. On heavy analytics applications, this will be a key.
>
> Of course, I also understand, adding node level aggregation may impact the
> overall performance. I am wondering if anyone has thought about or done
> anything in this aspect.
>
> BTW, I like ElasticSearch, but want to hear from the community on some of
> the key challenges.
>
>
>
> On Thursday, December 18, 2014 9:34:07 AM UTC-5, Adrien Grand wrote:
>>
>> +1 to what AlexR said. I think there is indeed a bad assumption that
>> shards just forward data to the coordinating node, this is not the case.
>>
>> On Thu, Dec 18, 2014 at 1:09 AM, AlexR <royt...@gmail.com> wrote:
>>>
>>> if you take a terms aggregation, the heavy lifting of the aggregation is
>>> done on each node then aggregated results are combined on the master node.
>>> So if you have thousands of nodes and very high cardinality nested aggs the
>>> merging part may become a bottleneck but cost of doing actual aggregation
>>> in most cases is far higher than cost of merging results from reasonable
>>> number of shards. So in practice I think it balances pretty well. Of course
>>> you are not limited to one master to handle concurrent requests
>>>
>>> On Wednesday, December 17, 2014 4:12:44 PM UTC-5, Yifan Wang wrote:
>>>>
>>>> I thought ES only "Collect" on individual shards, and "Reduce" on
>>>> Client Node (master if you call it), nothing is done at the data node 
>>>> level.
>>>>
>>>> On Tuesday, December 16, 2014 1:31:30 PM UTC-5, AlexR wrote:
>>>>>
>>>>> ES already doing aggregations on each node. it is not like it is
>>>>> shipping row level query data back to master for aggregation.
>>>>> In fact, one unpleasant effect of it is that aggregation results are
>>>>> not guaranteed to be precise due to distributed nature of the aggregation
>>>>> for multibucket aggs ordered by count such as terms
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>> Adrien Grand
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/3519d982-390a-4a62-9d75-0bb5e1a4654b%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/3519d982-390a-4a62-9d75-0bb5e1a4654b%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1J0UfSzvAX07vo9HXxBOJYnzD4%3DBwfd9-JkiEJx43StQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to