Re: what are the research papers that ES relies on?

MrBu Tue, 31 Mar 2015 10:54:07 -0700

Thats what I was looking for (murmur3) I really wondered what they used and 
I was going to ask about murmur3 as weel. But as I see things, are going 
pretty awesome.


Thanks

31 Mart 2015 Salı 00:42:45 UTC+3 tarihinde Aaron Mefford yazdı:
>
> I understand that if you do not have sufficient storage space, then you 
> cannot manage a replica on every node.  However, you are not limited to the 
> size of a "usual hdd".  You can have a file system that spans many hdds.  I 
> am not suggesting this, but if you have a situation where you need to 
> distribute all of your data, then you can.  Also as we have little info on 
> your use case, and the most typical seems to be log ingestion, in that 
> scenario you can have that hot index, the most recent treated differently 
> than the others.  You could have the number of replicas on your most recent 
> index spread data across the entire cluster, but then as a new index comes 
> online reduce the number of replicas.  You could also reindex historical 
> data into fewer shards, improving performance, reducing addtl maintenance 
> tasks.
>
> The reason I think you need to spend a bit more time reading is that the 
> algorithm is very easy to find:
>
> http://www.elastic.co/guide/en/elasticsearch/guide/master/routing-value.html
>
> It is a very simple algorithm and standard approach to the issue of 
> sharding:
>
> shard = hash(routing) % number_of_primary_shards
>
>
> The routing value by default is the document id, though you can specify 
> your own routing value.  The specifics of which hash are not as important 
> except in very odd cases.
>
> A bit more research shows this from the source:
>
>
> https://github.com/elastic/elasticsearch/commit/9ea25df64927172787f2ffa1049f9c7804a91053#diff-d1fcc8637b3800bf7da881b93e1de983
>
> Current implementations seem to use the DJB2 hash which is good but does 
> have some cases such as 33 shards where it behaves poorly.  In version 2.0 
> it appears they are moving to murmur3 which is a more consistent hash 
> across a greater set of use cases.  Note that with the default of 5 shards, 
> DJB2 performs ideally.
>
>
> On Monday, March 30, 2015 at 10:04:08 AM UTC-6, MrBu wrote:
>>
>> Aaron, thanks for the reply.
>>
>> You cant distribute all of the documents if the size of it is more than a 
>> usual hdd. Also that was an example I gave. I am just figuring out the 
>> magical ways that ES uses rather than lucene has its own.
>>
>> 30 Mart 2015 Pazartesi 18:55:49 UTC+3 tarihinde Aaron Mefford yazdı:
>>>
>>> "Automagic" routing happens already on hashing the document id.  It 
>>> sounds like you may have a situation where your document id is creating a 
>>> hot spot.  This being the case what you want is not automagic routing but 
>>> more control over the routing or a better document id.  There is the 
>>> ability to code your own routing and create a more even distribution, for 
>>> your given keyset, but I think you would be better served by a better 
>>> document key, this isnt mongo or hbase where the document key rules the 
>>> world.
>>>
>>> The other possible reason you are hot-spotting is index creation.  In a 
>>> log ingestion scenario, the most recent index is almost always the hottest 
>>> index.  That is where all indexing is occurring, that is where all queries 
>>> start.  If you have tweaked the 5 shard norm and are only creating 1 shard 
>>> that shard will be hot in this scenario.
>>>
>>> Your comment on routing a shard to another shard does not make any 
>>> sense.  You need to read a bit more on what the shards are and how they 
>>> work.  That said if you have multiple replicas of a shard, then those 
>>> shards will automatically be distributed across all of your nodes.  In fact 
>>> if the number of replicas is the same as the number of nodes in the 
>>> cluster, you should automatically have all data on all nodes, and any node 
>>> will be able to query local data, and no node will be hot because of query 
>>> volume.  However indexing is still routed to the master shard.
>>>
>>> Like was mentioned previously, the code is open, however it sounds like 
>>> you are looking to go deep water diving before learning to swim.
>>> On Monday, March 30, 2015 at 8:57:51 AM UTC-6, MrBu wrote:
>>>>
>>>> Jörg,
>>>>
>>>> Thanks for the input. I have read many tutorials, guides (official one 
>>>> too). Just I want to re-route in more automagic way. Like routing evenly 
>>>> to 
>>>> the shard and duplicating mostly used shard to other shards maybe.
>>>>
>>>> 30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı:
>>>>>
>>>>> Elasticsearch is open source, so reading (and using and modifying) the 
>>>>> algorithms is possible. There is also a lot of introductory material 
>>>>> available online, and I recommend "Elasticsearch - The definitive guide" 
>>>>> if 
>>>>> you want paperwork.
>>>>>
>>>>> If you create an index, ES creates shards for this index (by default 
>>>>> 5), and different nodes receive one of such shards, so indexing and 
>>>>> search 
>>>>> is automatically distributed over the participating nodes. ES keeps a map 
>>>>> of shards in the cluster state, so every node is able to route a query or 
>>>>> an index command. You don't need to manually route queries to shards.
>>>>>
>>>>> You can force ES to put all data on 3rd node, and in that case, you 
>>>>> already know what you want... there is no surprise. ES follows the 
>>>>> principle of least surprise.
>>>>>
>>>>> Jörg
>>>>>
>>>>> On Mon, Mar 30, 2015 at 5:07 AM, MrBu <metin....@gmail.com> wrote:
>>>>>
>>>>>> Other than Lucene's own research papers, what are the research papers 
>>>>>> or special algorithms that is being used by Elastic? I couldn't find a 
>>>>>> list 
>>>>>> it in the documents.
>>>>>>
>>>>>> Are the special algorithms used (and which ones are used in where) 
>>>>>> for example what is the algorithm used in in load distribution or just 
>>>>>> round robin algorithm?
>>>>>>
>>>>>> I really want to get in deep with Elastic :)
>>>>>>
>>>>>> This way I could have more knowledge. Example, suppose there are 20 
>>>>>> nodes, and surprisingly (and somehow) only the data in 3rd node is being 
>>>>>> searched all the time. (say these are popular documents somehow gathered 
>>>>>> only in this node) so Elastic weights this load into all cluster by 
>>>>>> dividing this data to other nodes ?  Or will it always use only 3rd 
>>>>>> node? 
>>>>>> There are tons of questions in my mind, waiting to be answered. Only 
>>>>>> possible way to read the algorithms . It would help me a lot.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9d07163e-43c5-4ffb-b933-3b1e7214ad07%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: what are the research papers that ES relies on?

Reply via email to