Thats what I was looking for (murmur3) I really wondered what they used and I was going to ask about murmur3 as weel. But as I see things, are going pretty awesome.
Thanks 31 Mart 2015 Salı 00:42:45 UTC+3 tarihinde Aaron Mefford yazdı: > > I understand that if you do not have sufficient storage space, then you > cannot manage a replica on every node. However, you are not limited to the > size of a "usual hdd". You can have a file system that spans many hdds. I > am not suggesting this, but if you have a situation where you need to > distribute all of your data, then you can. Also as we have little info on > your use case, and the most typical seems to be log ingestion, in that > scenario you can have that hot index, the most recent treated differently > than the others. You could have the number of replicas on your most recent > index spread data across the entire cluster, but then as a new index comes > online reduce the number of replicas. You could also reindex historical > data into fewer shards, improving performance, reducing addtl maintenance > tasks. > > The reason I think you need to spend a bit more time reading is that the > algorithm is very easy to find: > > http://www.elastic.co/guide/en/elasticsearch/guide/master/routing-value.html > > It is a very simple algorithm and standard approach to the issue of > sharding: > > shard = hash(routing) % number_of_primary_shards > > > The routing value by default is the document id, though you can specify > your own routing value. The specifics of which hash are not as important > except in very odd cases. > > A bit more research shows this from the source: > > > https://github.com/elastic/elasticsearch/commit/9ea25df64927172787f2ffa1049f9c7804a91053#diff-d1fcc8637b3800bf7da881b93e1de983 > > Current implementations seem to use the DJB2 hash which is good but does > have some cases such as 33 shards where it behaves poorly. In version 2.0 > it appears they are moving to murmur3 which is a more consistent hash > across a greater set of use cases. Note that with the default of 5 shards, > DJB2 performs ideally. > > > On Monday, March 30, 2015 at 10:04:08 AM UTC-6, MrBu wrote: >> >> Aaron, thanks for the reply. >> >> You cant distribute all of the documents if the size of it is more than a >> usual hdd. Also that was an example I gave. I am just figuring out the >> magical ways that ES uses rather than lucene has its own. >> >> 30 Mart 2015 Pazartesi 18:55:49 UTC+3 tarihinde Aaron Mefford yazdı: >>> >>> "Automagic" routing happens already on hashing the document id. It >>> sounds like you may have a situation where your document id is creating a >>> hot spot. This being the case what you want is not automagic routing but >>> more control over the routing or a better document id. There is the >>> ability to code your own routing and create a more even distribution, for >>> your given keyset, but I think you would be better served by a better >>> document key, this isnt mongo or hbase where the document key rules the >>> world. >>> >>> The other possible reason you are hot-spotting is index creation. In a >>> log ingestion scenario, the most recent index is almost always the hottest >>> index. That is where all indexing is occurring, that is where all queries >>> start. If you have tweaked the 5 shard norm and are only creating 1 shard >>> that shard will be hot in this scenario. >>> >>> Your comment on routing a shard to another shard does not make any >>> sense. You need to read a bit more on what the shards are and how they >>> work. That said if you have multiple replicas of a shard, then those >>> shards will automatically be distributed across all of your nodes. In fact >>> if the number of replicas is the same as the number of nodes in the >>> cluster, you should automatically have all data on all nodes, and any node >>> will be able to query local data, and no node will be hot because of query >>> volume. However indexing is still routed to the master shard. >>> >>> Like was mentioned previously, the code is open, however it sounds like >>> you are looking to go deep water diving before learning to swim. >>> On Monday, March 30, 2015 at 8:57:51 AM UTC-6, MrBu wrote: >>>> >>>> Jörg, >>>> >>>> Thanks for the input. I have read many tutorials, guides (official one >>>> too). Just I want to re-route in more automagic way. Like routing evenly >>>> to >>>> the shard and duplicating mostly used shard to other shards maybe. >>>> >>>> 30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı: >>>>> >>>>> Elasticsearch is open source, so reading (and using and modifying) the >>>>> algorithms is possible. There is also a lot of introductory material >>>>> available online, and I recommend "Elasticsearch - The definitive guide" >>>>> if >>>>> you want paperwork. >>>>> >>>>> If you create an index, ES creates shards for this index (by default >>>>> 5), and different nodes receive one of such shards, so indexing and >>>>> search >>>>> is automatically distributed over the participating nodes. ES keeps a map >>>>> of shards in the cluster state, so every node is able to route a query or >>>>> an index command. You don't need to manually route queries to shards. >>>>> >>>>> You can force ES to put all data on 3rd node, and in that case, you >>>>> already know what you want... there is no surprise. ES follows the >>>>> principle of least surprise. >>>>> >>>>> Jörg >>>>> >>>>> On Mon, Mar 30, 2015 at 5:07 AM, MrBu <metin....@gmail.com> wrote: >>>>> >>>>>> Other than Lucene's own research papers, what are the research papers >>>>>> or special algorithms that is being used by Elastic? I couldn't find a >>>>>> list >>>>>> it in the documents. >>>>>> >>>>>> Are the special algorithms used (and which ones are used in where) >>>>>> for example what is the algorithm used in in load distribution or just >>>>>> round robin algorithm? >>>>>> >>>>>> I really want to get in deep with Elastic :) >>>>>> >>>>>> This way I could have more knowledge. Example, suppose there are 20 >>>>>> nodes, and surprisingly (and somehow) only the data in 3rd node is being >>>>>> searched all the time. (say these are popular documents somehow gathered >>>>>> only in this node) so Elastic weights this load into all cluster by >>>>>> dividing this data to other nodes ? Or will it always use only 3rd >>>>>> node? >>>>>> There are tons of questions in my mind, waiting to be answered. Only >>>>>> possible way to read the algorithms . It would help me a lot. >>>>>> >>>>>> Thanks >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "elasticsearch" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to elasticsearc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9d07163e-43c5-4ffb-b933-3b1e7214ad07%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.