Re: Is ElasticSearch truly scalable for analytics?

2015-01-15 Thread Mark Harwood
> Regarding the accuracy of top-k lists This is perhaps an over-simplification - we deal with far more complex scenarios than a simple, single top-K list - we have whole aggregation trees with multiple layers of aggs: geo, time, nested, parent/child, percentiles, cardinalities etc etc whic

Re: Is ElasticSearch truly scalable for analytics?

2015-01-15 Thread AlexR
I would be also very interested in node level shard results reduction but not for scalability but precision reasons. I would like to have an option for a node to do complete aggregations on its shards so the results are exact rather than approximate. There are many use cases when corpus of data

Re: Is ElasticSearch truly scalable for analytics?

2015-01-15 Thread Nils Dijk
Adding a 'node reduce phase' to aggregations is something I'm very interested in, and also investigating for the project I'm currently working on. "If you introduce an extra reduction phase (for multiple shards on the same node) you introduce further potential for inaccuracies in the final res

Re: Is ElasticSearch truly scalable for analytics?

2015-01-14 Thread Elliott Bradshaw
How hard would it to be to implement such a feature? Even if there are only a handful of use cases, it could prove very helpful in these. Particularly since very large trees are the ones that will struggle the most with bandwidth issues. On Wednesday, January 14, 2015 at 1:36:53 PM UTC-5, Ma

Re: Is ElasticSearch truly scalable for analytics?

2015-01-14 Thread Mark Harwood
> > Understood, but what about cases where size is set to unlimited? > Inaccuracies are not a concern in that case, correct? > Correct. But if we only consider the scenarios where the key sets are complete and accuracy is not put at risk by merging (i.e. there is no "top N" type filtering in

Re: Is ElasticSearch truly scalable for analytics?

2015-01-14 Thread Elliott Bradshaw
Mark, Understood, but what about cases where size is set to unlimited? Inaccuracies are not a concern in that case, correct? On Wednesday, January 14, 2015 at 1:09:48 PM UTC-5, Mark Harwood wrote: > > If you introduce an extra reduction phase (for multiple shards on the same > node) you introd

Re: Is ElasticSearch truly scalable for analytics?

2015-01-14 Thread Mark Harwood
If you introduce an extra reduction phase (for multiple shards on the same node) you introduce further potential for inaccuracies in the final results. Consider the role of 'size' and 'shard_size' in the "terms" aggregation [1] and the effects they have on accuracy. You'd arguably need a 'node_si

Re: Is ElasticSearch truly scalable for analytics?

2015-01-14 Thread Elliott Bradshaw
Adrien, I get the feeling that you're a pretty heavy contributor to the aggregation module. In your experience, would a shard per cpu core strategy be an effective performance solution in a pure aggregation use case?If this could proportionally reduce the aggregation time, would a node loc

Re: Is ElasticSearch truly scalable for analytics?

2015-01-14 Thread Adrien Grand
On Wed, Jan 14, 2015 at 4:16 PM, Elliott Bradshaw wrote: > Just out of curiosity, are aggregations on multiple shards on a single > node executed serially or in parallel? In my experience, it appears that > they're executed serially (my CPU usage did not change when going from 1 > shard to 2 sha

Re: Is ElasticSearch truly scalable for analytics?

2015-01-14 Thread Elliott Bradshaw
Just out of curiosity, are aggregations on multiple shards on a single node executed serially or in parallel? In my experience, it appears that they're executed serially (my CPU usage did not change when going from 1 shard to 2 shards per node, but I didn't test this extensively). I'm interes

Re: Is ElasticSearch truly scalable for analytics?

2014-12-19 Thread joergpra...@gmail.com
Yes, I have 3 nodes and each index has 3 shards, on 32 core machines. Each shard contains many segments, which can be read and written concurrently by Lucene. Since Lucene 4, there have been massive improvements in that area. Maybe you have observed the effect that many shards on a node for a sin

Re: Is ElasticSearch truly scalable for analytics?

2014-12-19 Thread AlexR
Jorg, if you have a single large index and a cluster with 3 nodes do you suggest to create just 3 shards even though each node has say 16 cores. With just three shards they will be very big and not much patallelism in computations will occur. am I missing something? -- You received this messag

Re: Is ElasticSearch truly scalable for analytics?

2014-12-19 Thread joergpra...@gmail.com
A node does not send shard aggregations to the master, but to the client node. The basic idea of sharding in Elasticsearch is that shards spread over all the nodes, and the shard count matches or comes close to the maximum number of nodes. The shard distribution should be undistorted, that means,

Re: Is ElasticSearch truly scalable for analytics?

2014-12-18 Thread AlexR
Nick, I am not an expert in this area either but with multi-core processors (24, 32, 48) it is not uncommon to have fairly large number of shards on a node so 30 shards is not out of question I assumed that ES aggregate shard results on a node prior to shipping them to a master but I do not kno

Re: Is ElasticSearch truly scalable for analytics?

2014-12-18 Thread Nikolas Everett
I think aggregating 32 shards on one node is a bit degenerate. I imagine its more typical to aggregate across one of two shards per node. Don't get me wrong, you can totally have nodes store and query ~100 shards each without much trouble. If aggregating across a bunch of shards per node were a

Re: Is ElasticSearch truly scalable for analytics?

2014-12-18 Thread Yifan Wang
Correction "meaning reduce from 32 buckets to only one bucket, then the client node only has to process 50 buckets." should read "reduce from 32 of 50K buckets to only 1 being sent to the client node, then the client node only has to process 50 of 50K buckets". On Thursday, December 18, 2014

Re: Is ElasticSearch truly scalable for analytics?

2014-12-18 Thread Yifan Wang
Sorry, if I did not make it clear. For sure I know aggregation is done on the node for each shard, but here is the challenge. Say we set shard_size=50,000. ES will aggregate on each shard and create buckets for the matching documents, and then send top 50,000 buckets to the client node for Redu

Re: Is ElasticSearch truly scalable for analytics?

2014-12-18 Thread Adrien Grand
+1 to what AlexR said. I think there is indeed a bad assumption that shards just forward data to the coordinating node, this is not the case. On Thu, Dec 18, 2014 at 1:09 AM, AlexR wrote: > > if you take a terms aggregation, the heavy lifting of the aggregation is > done on each node then aggrega

Re: Is ElasticSearch truly scalable for analytics?

2014-12-17 Thread AlexR
if you take a terms aggregation, the heavy lifting of the aggregation is done on each node then aggregated results are combined on the master node. So if you have thousands of nodes and very high cardinality nested aggs the merging part may become a bottleneck but cost of doing actual aggregatio

Re: Is ElasticSearch truly scalable for analytics?

2014-12-17 Thread Yifan Wang
I thought ES only "Collect" on individual shards, and "Reduce" on Client Node (master if you call it), nothing is done at the data node level. On Tuesday, December 16, 2014 1:31:30 PM UTC-5, AlexR wrote: > > ES already doing aggregations on each node. it is not like it is shipping > row level qu

Re: Is ElasticSearch truly scalable for analytics?

2014-12-17 Thread Yifan Wang
How the ranking will work across clusters? On Tuesday, December 16, 2014 1:31:03 PM UTC-5, Elvar Böðvarsson wrote: > > Elasticsearch supports tribe nodes, so you can combine multiple clusters, > you then query the tribe node to access data on all of them. > > On Monday, December 15, 2014 9:52:45

Re: Is ElasticSearch truly scalable for analytics?

2014-12-17 Thread Yifan Wang
"node" is referring to "individual data node". Currently "Reduce" is only done once on the "Client Node", not on each individual data node. I am just wondering how scalable it is for analytics with current architecture. I would like to hear if anyone had any experience. On Tuesday, December 1

Re: Is ElasticSearch truly scalable for analytics?

2014-12-16 Thread Elvar Böðvarsson
Elasticsearch supports tribe nodes, so you can combine multiple clusters, you then query the tribe node to access data on all of them. On Monday, December 15, 2014 9:52:45 PM UTC, Yifan Wang wrote: > > If I understand correctly, ElasticSearch directly sends query to and > collects aggregated res

Is ElasticSearch truly scalable for analytics?

2014-12-16 Thread AlexR
ES already doing aggregations on each node. it is not like it is shipping row level query data back to master for aggregation. In fact, one unpleasant effect of it is that aggregation results are not guaranteed to be precise due to distributed nature of the aggregation for multibucket aggs orde

Re: Is ElasticSearch truly scalable for analytics?

2014-12-16 Thread Adrien Grand
What you mean with "node level reduce"? On Mon, Dec 15, 2014 at 10:52 PM, Yifan Wang wrote: > > If I understand correctly, ElasticSearch directly sends query to and > collects aggregated results from each shard. With number of shards > increases, Reduce phase on the Client node will become overwh

Is ElasticSearch truly scalable for analytics?

2014-12-15 Thread Yifan Wang
If I understand correctly, ElasticSearch directly sends query to and collects aggregated results from each shard. With number of shards increases, Reduce phase on the Client node will become overwhelmed. One would assume, if ElasticSearch support node level aggregation, the "Reduce" becomes di