Thanks Jorg, you've put it a lot better than I did! :) On 12 December 2014 at 10:38, joergpra...@gmail.com <joergpra...@gmail.com> wrote:
> The reason is the shard addressing. While it might be possible to imagine > that a distributed system can query all nodes and ask "please give me the > next free allocation for this doc" this is very slow. It would allow shard > splitting by simply adding shards, iterating over nodes, but science has > proven this is not scalable - the time of indexing and search grows linear > (if not quadratic) with the number of nodes and Elasticsearch scalability > would be severely broken. > > When a new doc arrives, a hash is computed to address the shard where it > should reside in O(1) time. This is fast and scalable over the number of > nodes. The price to pay is to set a constant number of shards as the > divisor for the formula to distribute the doc by the doc ID. > > If the hash computation for adding shards was changed during cluster life, > all things will be messed up, ES would not find docs any longer. To remedy > this, you have to reindex and this recomputes the hashes for a higher > number of shards. > > There are systems who allow shard growth in a limited way. By using a hash > ring algorithm, you could address e.g. only each third shard, and let the > next two places in the hash ring free, maybe for adding shards later, see > e.g. Project Voldemort > http://www.project-voldemort.com/voldemort/design.html > > But this would mean two things: there would be a limited number of shard > splits (only a fixed number of shards can be reserved), and it would be up > to the user to mess around with the very sensible condition when to > activate the reserved shards and instruct ES to use a more difficult > algorithm to lookup docs. I can say there is no general condition of shard > overflow. See the shard allocation deciders, there are many good criteria, > but in the end, the user will get the headache. And this is because it is > an anti-pattern: the promise is that shard split will solve a problem, but > you get a new, bigger one. > > Elasticsearch philosophy is that it should scale and should be easy to > use. Having no headaches around the shard count, once it is set, is easy. > > Jörg > > > > On Fri, Dec 12, 2014 at 9:31 AM, Mark Walkom <markwal...@gmail.com> wrote: >> >> As I found out yesterday, the problem with shard splitting in ES is that >> there algorithms that are used to round robin the data allocation during >> indexing that are based on a pre-determined hash. So if you suddenly alter >> the hash you may end up with shards that are overloaded compared to others. >> >> Maybe a dev can confirm/clarify this, but that was the understanding I >> took away for not doing shard splitting within ES. >> >> On 12 December 2014 at 02:25, Kevin Burton <burtona...@gmail.com> wrote: >> >>> >>> It seems to me that most people arguing this have trivial scalability >>> requirements. Not trying to be rude by saying that btw. But shard >>> splitting is really the only way to scale from 250GB indexed to 500TB >>> indexed. >>> >>> On Thursday, December 11, 2014 4:58:42 PM UTC-8, Andrew Selden wrote: >>>> >>>> I would agree that shard splitting is not the best approach. Much >>>> better to design for expansion by building in layers of indirection into >>>> your application through the techniques of over-sharding, index aliasing, >>>> and multiple indices. >>>> >>> >>> Yes.. all those are lame attempts at shard splitting. >>> >>> Over sharding is wasteful, it might not have a significant performance >>> impact in practice if you only have a few shards, but if you only add a few >>> you're not goign to be able to increase your capacity. >>> >>> Using multiple indexes is just a way to cheat by adding more shards in a >>> round about fashion, your runtime query performance will suffer because of >>> this. >>> >>>> >>>> First, you can allocate more shards than you need when you create the >>>> index. If you need 5 shards today, but think you might need 10 shards in 6 >>>> months, then just create the index with 10 shards. We call this >>>> over-sharding. There really is no penalty to doing this within reason. >>>> >>> >>> So you've only given yourself a 2x overhead in capacity. That's not >>> very elastic. >>> >>> With shard splitting you can go from 2x to 10x to 100x without any >>> wasted IO in over-indexing. >>> >>> >>>> Searching against 1 index with 50 shards is exactly the same as >>>> searching against 50 indices with one shard. >>>> >>> >>> No it's not.. if the shards are on the same box you're paying a >>> performance cost there.. If the indexes are small and fit in memory you >>> won't feel it that much. >>> >>> >>>> Second, as others have mentioned, use multiple indices and hide them >>>> away behind an alias. >>>> >>> >>> If each index has say 20 shards, and you have 10 indexes, then you have >>> 200 shards to run your query against. This means queries that use all >>> these indexes will get slower and slower. >>> >>> The ideal situation is to shard split so that when you need more shards, >>> you just split. >>> >>> If ES had this feature today, no one would be arguing against shard >>> splitting. It would just be common practice. The only issue is that ES >>> hasn't implemented it yet so it's not a viable solution. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elasticsearch+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/c35d0b14-46a0-4baf-b06e-b5bb3ff43e5f%40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/c35d0b14-46a0-4baf-b06e-b5bb3ff43e5f%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearch+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8WBPxKW1GeJvat5%3D7AmcExDwk9SW8%3DXMqjiH-S2nvd8Q%40mail.gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8WBPxKW1GeJvat5%3D7AmcExDwk9SW8%3DXMqjiH-S2nvd8Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHkHHmLfYyKfumFarXWzj37makn7u6cu3MdcX3zCGGGnw%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHkHHmLfYyKfumFarXWzj37makn7u6cu3MdcX3zCGGGnw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-K%3D17nGnzFHMEdd78Kvm3EZRab13wOHBYmWZ_Gwa%2BhoA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.