The reason is the shard addressing. While it might be possible to imagine
that a distributed system can query all nodes and ask "please give me the
next free allocation for this doc" this is very slow. It would allow shard
splitting by simply adding shards, iterating over nodes, but science has
proven this is not scalable - the time of indexing and search grows linear
(if not quadratic) with the number of nodes and Elasticsearch scalability
would be severely broken.

When a new doc arrives, a hash is computed to address the shard where it
should reside in O(1) time. This is fast and scalable over the number of
nodes. The price to pay is to set a constant number of shards as the
divisor for the formula to distribute the doc by the doc ID.

If the hash computation for adding shards was changed during cluster life,
all things will be messed up, ES would not find docs any longer. To remedy
this, you have to reindex and this recomputes the hashes for a higher
number of shards.

There are systems who allow shard growth in a limited way. By using a hash
ring algorithm, you could address e.g. only each third shard, and let the
next two places in the hash ring free, maybe for adding shards later, see
e.g. Project Voldemort
http://www.project-voldemort.com/voldemort/design.html

But this would mean two things: there would be a limited number of shard
splits (only a fixed number of shards can be reserved), and it would be up
to the user to mess around with the very sensible condition when to
activate the reserved shards and instruct ES to use a more difficult
algorithm to lookup docs. I can say there is no general condition of shard
overflow. See the shard allocation deciders, there are many good criteria,
but in the end, the user will get the headache. And this is because it is
an anti-pattern: the promise is that shard split will solve a problem, but
you get a new, bigger one.

Elasticsearch philosophy is that it should scale and should be easy to use.
Having no headaches around the shard count, once it is set, is easy.

Jörg



On Fri, Dec 12, 2014 at 9:31 AM, Mark Walkom <markwal...@gmail.com> wrote:
>
> As I found out yesterday, the problem with shard splitting in ES is that
> there algorithms that are used to round robin the data allocation during
> indexing that are based on a pre-determined hash. So if you suddenly alter
> the hash you may end up with shards that are overloaded compared to others.
>
> Maybe a dev can confirm/clarify this, but that was the understanding I
> took away for not doing shard splitting within ES.
>
> On 12 December 2014 at 02:25, Kevin Burton <burtona...@gmail.com> wrote:
>
>>
>> It seems to me that most people arguing this have trivial scalability
>> requirements.  Not trying to be rude by saying that btw.  But shard
>> splitting is really the only way to scale from 250GB indexed to 500TB
>> indexed.
>>
>> On Thursday, December 11, 2014 4:58:42 PM UTC-8, Andrew Selden wrote:
>>>
>>> I would agree that shard splitting is not the best approach. Much better
>>> to design for expansion by building in layers of indirection into your
>>> application through the techniques of over-sharding, index aliasing, and
>>> multiple indices.
>>>
>>
>> Yes.. all those are lame attempts at shard splitting.
>>
>> Over sharding is wasteful, it might not have a significant performance
>> impact in practice if you only have a few shards, but if you only add a few
>> you're not goign to be able to increase your capacity.
>>
>> Using multiple indexes is just a way to cheat by adding more shards in a
>> round about fashion, your runtime query performance will suffer because of
>> this.
>>
>>>
>>> First, you can allocate more shards than you need when you create the
>>> index. If you need 5 shards today, but think you might need 10 shards in 6
>>> months, then just create the index with 10 shards. We call this
>>> over-sharding. There really is no penalty to doing this within reason.
>>>
>>
>> So you've only given yourself a 2x overhead in capacity.  That's not very
>> elastic.
>>
>> With shard splitting you can go from 2x to 10x to 100x without any wasted
>> IO in over-indexing.
>>
>>
>>> Searching against 1 index with 50 shards is exactly the same as
>>> searching against 50 indices with one shard.
>>>
>>
>> No it's not.. if the shards are on the same box you're paying a
>> performance cost there.. If the indexes are small and fit in memory you
>> won't feel it that much.
>>
>>
>>> Second, as others have mentioned, use multiple indices and hide them
>>> away behind an alias.
>>>
>>
>> If each index has say 20 shards, and you have 10 indexes, then you have
>> 200 shards to run your query against.  This means queries that use all
>> these indexes will get slower and slower.
>>
>> The ideal situation is to shard split so that when you need more shards,
>> you just split.
>>
>> If ES had this feature today, no one would be arguing against shard
>> splitting. It would just be common practice.  The only issue is that ES
>> hasn't implemented it yet so it's not a viable solution.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/c35d0b14-46a0-4baf-b06e-b5bb3ff43e5f%40googlegroups.com
>> <https://groups.google.com/d/msgid/elasticsearch/c35d0b14-46a0-4baf-b06e-b5bb3ff43e5f%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8WBPxKW1GeJvat5%3D7AmcExDwk9SW8%3DXMqjiH-S2nvd8Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8WBPxKW1GeJvat5%3D7AmcExDwk9SW8%3DXMqjiH-S2nvd8Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHkHHmLfYyKfumFarXWzj37makn7u6cu3MdcX3zCGGGnw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to