> Regarding the accuracy of top-k lists
This is perhaps an over-simplification - we deal with far more complex
scenarios than a simple, single top-K list - we have whole aggregation
trees with multiple layers of aggs: geo, time, nested, parent/child,
percentiles, cardinalities etc etc whic
I would be also very interested in node level shard results reduction but not
for scalability but precision reasons. I would like to have an option for a
node to do complete aggregations on its shards so the results are exact rather
than approximate. There are many use cases when corpus of data
Adding a 'node reduce phase' to aggregations is something I'm very
interested in, and also investigating for the project I'm currently working
on.
"If you introduce an extra reduction phase (for multiple shards on the same
node) you introduce further potential for inaccuracies in the final
res
How hard would it to be to implement such a feature? Even if there are
only a handful of use cases, it could prove very helpful in these.
Particularly since very large trees are the ones that will struggle the
most with bandwidth issues.
On Wednesday, January 14, 2015 at 1:36:53 PM UTC-5, Ma
>
> Understood, but what about cases where size is set to unlimited?
> Inaccuracies are not a concern in that case, correct?
>
Correct. But if we only consider the scenarios where the key sets are
complete and accuracy is not put at risk by merging (i.e. there is no "top
N" type filtering in
Mark,
Understood, but what about cases where size is set to unlimited?
Inaccuracies are not a concern in that case, correct?
On Wednesday, January 14, 2015 at 1:09:48 PM UTC-5, Mark Harwood wrote:
>
> If you introduce an extra reduction phase (for multiple shards on the same
> node) you introd
If you introduce an extra reduction phase (for multiple shards on the same
node) you introduce further potential for inaccuracies in the final results.
Consider the role of 'size' and 'shard_size' in the "terms" aggregation [1]
and the effects they have on accuracy. You'd arguably need a 'node_si
Adrien,
I get the feeling that you're a pretty heavy contributor to the aggregation
module. In your experience, would a shard per cpu core strategy be an
effective performance solution in a pure aggregation use case?If this
could proportionally reduce the aggregation time, would a node loc
On Wed, Jan 14, 2015 at 4:16 PM, Elliott Bradshaw
wrote:
> Just out of curiosity, are aggregations on multiple shards on a single
> node executed serially or in parallel? In my experience, it appears that
> they're executed serially (my CPU usage did not change when going from 1
> shard to 2 sha
Just out of curiosity, are aggregations on multiple shards on a single node
executed serially or in parallel? In my experience, it appears that
they're executed serially (my CPU usage did not change when going from 1
shard to 2 shards per node, but I didn't test this extensively). I'm
interes
Yes, I have 3 nodes and each index has 3 shards, on 32 core machines.
Each shard contains many segments, which can be read and written
concurrently by Lucene. Since Lucene 4, there have been massive
improvements in that area.
Maybe you have observed the effect that many shards on a node for a sin
Jorg, if you have a single large index and a cluster with 3 nodes do you
suggest to create just 3 shards even though each node has say 16 cores. With
just three shards they will be very big and not much patallelism in
computations will occur.
am I missing something?
--
You received this messag
A node does not send shard aggregations to the master, but to the client
node.
The basic idea of sharding in Elasticsearch is that shards spread over all
the nodes, and the shard count matches or comes close to the maximum number
of nodes. The shard distribution should be undistorted, that means,
Nick,
I am not an expert in this area either but with multi-core processors (24,
32, 48) it is not uncommon to have fairly large number of shards on a node
so 30 shards is not out of question
I assumed that ES aggregate shard results on a node prior to shipping them
to a master but I do not kno
I think aggregating 32 shards on one node is a bit degenerate. I imagine
its more typical to aggregate across one of two shards per node. Don't get
me wrong, you can totally have nodes store and query ~100 shards each
without much trouble. If aggregating across a bunch of shards per node
were a
Correction "meaning reduce from 32 buckets to only one bucket, then the
client node only has to process 50 buckets." should read "reduce from 32 of
50K buckets to only 1 being sent to the client node, then the client node
only has to process 50 of 50K buckets".
On Thursday, December 18, 2014
Sorry, if I did not make it clear. For sure I know aggregation is done on
the node for each shard, but here is the challenge. Say we set
shard_size=50,000. ES will aggregate on each shard and create buckets for
the matching documents, and then send top 50,000 buckets to the client node
for Redu
+1 to what AlexR said. I think there is indeed a bad assumption that shards
just forward data to the coordinating node, this is not the case.
On Thu, Dec 18, 2014 at 1:09 AM, AlexR wrote:
>
> if you take a terms aggregation, the heavy lifting of the aggregation is
> done on each node then aggrega
if you take a terms aggregation, the heavy lifting of the aggregation is
done on each node then aggregated results are combined on the master node.
So if you have thousands of nodes and very high cardinality nested aggs the
merging part may become a bottleneck but cost of doing actual aggregatio
I thought ES only "Collect" on individual shards, and "Reduce" on Client
Node (master if you call it), nothing is done at the data node level.
On Tuesday, December 16, 2014 1:31:30 PM UTC-5, AlexR wrote:
>
> ES already doing aggregations on each node. it is not like it is shipping
> row level qu
How the ranking will work across clusters?
On Tuesday, December 16, 2014 1:31:03 PM UTC-5, Elvar Böðvarsson wrote:
>
> Elasticsearch supports tribe nodes, so you can combine multiple clusters,
> you then query the tribe node to access data on all of them.
>
> On Monday, December 15, 2014 9:52:45
"node" is referring to "individual data node". Currently "Reduce" is only
done once on the "Client Node", not on each individual data node.
I am just wondering how scalable it is for analytics with current
architecture. I would like to hear if anyone had any experience.
On Tuesday, December 1
Elasticsearch supports tribe nodes, so you can combine multiple clusters,
you then query the tribe node to access data on all of them.
On Monday, December 15, 2014 9:52:45 PM UTC, Yifan Wang wrote:
>
> If I understand correctly, ElasticSearch directly sends query to and
> collects aggregated res
ES already doing aggregations on each node. it is not like it is shipping row
level query data back to master for aggregation.
In fact, one unpleasant effect of it is that aggregation results are not
guaranteed to be precise due to distributed nature of the aggregation for
multibucket aggs orde
What you mean with "node level reduce"?
On Mon, Dec 15, 2014 at 10:52 PM, Yifan Wang
wrote:
>
> If I understand correctly, ElasticSearch directly sends query to and
> collects aggregated results from each shard. With number of shards
> increases, Reduce phase on the Client node will become overwh
If I understand correctly, ElasticSearch directly sends query to and
collects aggregated results from each shard. With number of shards
increases, Reduce phase on the Client node will become overwhelmed.
One would assume, if ElasticSearch support node level aggregation, the
"Reduce" becomes di
26 matches
Mail list logo