Re: Is ElasticSearch truly scalable for analytics?

AlexR Thu, 18 Dec 2014 10:23:06 -0800

Nick,

I am not an expert in this area either but with multi-core processors (24, 
32, 48) it is not uncommon to have fairly large number of shards on a node 
so 30 shards is not out of question
I assumed that ES aggregate shard results on a node prior to shipping them 
to a master but I do not know if it is true. It may very well be that node 
sends per shard aggregations to the master which case it it 32xShard 
ResultSize for our 32 shard node. reducing size of network packet by 32 
(even if it were just 8) and work for master by the same ratio is not a 
chump change. Somehow I think ES already doing it :-) but who knows


Another potential benefit of doing node aggregation is that on a single 
node when aggregating multiple shards ES could resolve potential errors by 
aggregating all buckets and re-calculating buckets not present in every 
shard at a fairly low cost while doing so across nodes is costly. On the 
other hand it may amplify the error across nodes do not know


On Thursday, December 18, 2014 11:26:37 AM UTC-5, Nikolas Everett wrote:
>
> I think aggregating 32 shards on one node is a bit degenerate.  I imagine 
> its more typical to aggregate across one of two shards per node.  Don't get 
> me wrong, you can totally have nodes store and query ~100 shards each 
> without much trouble.  If aggregating across a bunch of shards per node 
> were a common thing I think a node level reduce step might help.  I'm 
> certainly no expert in the reduce code though.
>
> Nik
>
> On Thu, Dec 18, 2014 at 10:48 AM, Yifan Wang <yifan.w...@gmail.com 
> <javascript:>> wrote:
>>
>> Sorry, if I did not make it clear. For sure I know aggregation is done on 
>> the node for each shard, but here is the challenge. Say we set 
>> shard_size=50,000. ES will aggregate on each shard and create buckets for 
>> the matching documents, and then send top 50,000 buckets to the client node 
>> for Reduce. Say we have 50 data nodes, and each node has 32 shards. This 
>> means we need to send 50,000 buckets from each shard to the client node for 
>> final aggregation. First, this may add heavy traffic to the network (what 
>> if we have 100 nodes?). And second, the client will need to aggregate on 
>> received 50*32*50,000 buckets. Would this cause any congestion on the 
>> client node? However if we can aggregate on the node first, meaning reduce 
>> from 32 buckets to only one bucket, then the client node only has to 
>> process 50 buckets. This would significanly reduce the network traffic and 
>> improve the scalability, plus because we can set relatively larger 
>> shard_size, it will improve the accuracy of the final results, which is 
>> another key issue we are facing in distributed environment on aggregations.
>>
>> So my key question is about the scalability particularly on aggregations. 
>> It seems to be a challenge in my experience. I just want to hear other 
>> people's experience. On heavy analytics applications, this will be a key.
>>
>> Of course, I also understand, adding node level aggregation may impact 
>> the overall performance. I am wondering if anyone has thought about or done 
>> anything in this aspect.
>>
>> BTW, I like ElasticSearch, but want to hear from the community on some of 
>> the key challenges.
>>
>>
>>
>> On Thursday, December 18, 2014 9:34:07 AM UTC-5, Adrien Grand wrote:
>>>
>>> +1 to what AlexR said. I think there is indeed a bad assumption that 
>>> shards just forward data to the coordinating node, this is not the case.
>>>
>>> On Thu, Dec 18, 2014 at 1:09 AM, AlexR <royt...@gmail.com> wrote:
>>>>
>>>> if you take a terms aggregation, the heavy lifting of the aggregation 
>>>> is done on each node then aggregated results are combined on the master 
>>>> node. So if you have thousands of nodes and very high cardinality nested 
>>>> aggs the merging part may become a bottleneck but cost of doing actual 
>>>> aggregation in most cases is far higher than cost of merging results from 
>>>> reasonable number of shards. So in practice I think it balances pretty 
>>>> well. Of course you are not limited to one master to handle concurrent 
>>>> requests
>>>>
>>>> On Wednesday, December 17, 2014 4:12:44 PM UTC-5, Yifan Wang wrote:
>>>>>
>>>>> I thought ES only "Collect" on individual shards, and "Reduce" on 
>>>>> Client Node (master if you call it), nothing is done at the data node 
>>>>> level.
>>>>>
>>>>> On Tuesday, December 16, 2014 1:31:30 PM UTC-5, AlexR wrote:
>>>>>>
>>>>>> ES already doing aggregations on each node. it is not like it is 
>>>>>> shipping row level query data back to master for aggregation. 
>>>>>> In fact, one unpleasant effect of it is that aggregation results are 
>>>>>> not guaranteed to be precise due to distributed nature of the 
>>>>>> aggregation 
>>>>>> for multibucket aggs ordered by count such as terms
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to elasticsearc...@googlegroups.com.
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> -- 
>>> Adrien Grand
>>>  
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/3519d982-390a-4a62-9d75-0bb5e1a4654b%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/3519d982-390a-4a62-9d75-0bb5e1a4654b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3d289a72-0d2b-45ca-a77a-f61a7d70c5b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Is ElasticSearch truly scalable for analytics?

Reply via email to