Re: what are the research papers that ES relies on?

Aaron Mefford Mon, 30 Mar 2015 14:43:38 -0700

I understand that if you do not have sufficient storage space, then you 
cannot manage a replica on every node.  However, you are not limited to the 
size of a "usual hdd".  You can have a file system that spans many hdds.  I 
am not suggesting this, but if you have a situation where you need to 
distribute all of your data, then you can.  Also as we have little info on 
your use case, and the most typical seems to be log ingestion, in that 
scenario you can have that hot index, the most recent treated differently 
than the others.  You could have the number of replicas on your most recent 
index spread data across the entire cluster, but then as a new index comes 
online reduce the number of replicas.  You could also reindex historical 
data into fewer shards, improving performance, reducing addtl maintenance 
tasks.


The reason I think you need to spend a bit more time reading is that the 
algorithm is very easy to find:
http://www.elastic.co/guide/en/elasticsearch/guide/master/routing-value.html

It is a very simple algorithm and standard approach to the issue of 
sharding:

shard = hash(routing) % number_of_primary_shards


The routing value by default is the document id, though you can specify 
your own routing value.  The specifics of which hash are not as important 
except in very odd cases.

A bit more research shows this from the source:

https://github.com/elastic/elasticsearch/commit/9ea25df64927172787f2ffa1049f9c7804a91053#diff-d1fcc8637b3800bf7da881b93e1de983

Current implementations seem to use the DJB2 hash which is good but does 
have some cases such as 33 shards where it behaves poorly.  In version 2.0 
it appears they are moving to murmur3 which is a more consistent hash 
across a greater set of use cases.  Note that with the default of 5 shards, 
DJB2 performs ideally.


On Monday, March 30, 2015 at 10:04:08 AM UTC-6, MrBu wrote:
>
> Aaron, thanks for the reply.
>
> You cant distribute all of the documents if the size of it is more than a 
> usual hdd. Also that was an example I gave. I am just figuring out the 
> magical ways that ES uses rather than lucene has its own.
>
> 30 Mart 2015 Pazartesi 18:55:49 UTC+3 tarihinde Aaron Mefford yazdı:
>>
>> "Automagic" routing happens already on hashing the document id.  It 
>> sounds like you may have a situation where your document id is creating a 
>> hot spot.  This being the case what you want is not automagic routing but 
>> more control over the routing or a better document id.  There is the 
>> ability to code your own routing and create a more even distribution, for 
>> your given keyset, but I think you would be better served by a better 
>> document key, this isnt mongo or hbase where the document key rules the 
>> world.
>>
>> The other possible reason you are hot-spotting is index creation.  In a 
>> log ingestion scenario, the most recent index is almost always the hottest 
>> index.  That is where all indexing is occurring, that is where all queries 
>> start.  If you have tweaked the 5 shard norm and are only creating 1 shard 
>> that shard will be hot in this scenario.
>>
>> Your comment on routing a shard to another shard does not make any 
>> sense.  You need to read a bit more on what the shards are and how they 
>> work.  That said if you have multiple replicas of a shard, then those 
>> shards will automatically be distributed across all of your nodes.  In fact 
>> if the number of replicas is the same as the number of nodes in the 
>> cluster, you should automatically have all data on all nodes, and any node 
>> will be able to query local data, and no node will be hot because of query 
>> volume.  However indexing is still routed to the master shard.
>>
>> Like was mentioned previously, the code is open, however it sounds like 
>> you are looking to go deep water diving before learning to swim.
>> On Monday, March 30, 2015 at 8:57:51 AM UTC-6, MrBu wrote:
>>>
>>> Jörg,
>>>
>>> Thanks for the input. I have read many tutorials, guides (official one 
>>> too). Just I want to re-route in more automagic way. Like routing evenly to 
>>> the shard and duplicating mostly used shard to other shards maybe.
>>>
>>> 30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı:
>>>>
>>>> Elasticsearch is open source, so reading (and using and modifying) the 
>>>> algorithms is possible. There is also a lot of introductory material 
>>>> available online, and I recommend "Elasticsearch - The definitive guide" 
>>>> if 
>>>> you want paperwork.
>>>>
>>>> If you create an index, ES creates shards for this index (by default 
>>>> 5), and different nodes receive one of such shards, so indexing and search 
>>>> is automatically distributed over the participating nodes. ES keeps a map 
>>>> of shards in the cluster state, so every node is able to route a query or 
>>>> an index command. You don't need to manually route queries to shards.
>>>>
>>>> You can force ES to put all data on 3rd node, and in that case, you 
>>>> already know what you want... there is no surprise. ES follows the 
>>>> principle of least surprise.
>>>>
>>>> Jörg
>>>>
>>>> On Mon, Mar 30, 2015 at 5:07 AM, MrBu <metin....@gmail.com> wrote:
>>>>
>>>>> Other than Lucene's own research papers, what are the research papers 
>>>>> or special algorithms that is being used by Elastic? I couldn't find a 
>>>>> list 
>>>>> it in the documents.
>>>>>
>>>>> Are the special algorithms used (and which ones are used in where) for 
>>>>> example what is the algorithm used in in load distribution or just round 
>>>>> robin algorithm?
>>>>>
>>>>> I really want to get in deep with Elastic :)
>>>>>
>>>>> This way I could have more knowledge. Example, suppose there are 20 
>>>>> nodes, and surprisingly (and somehow) only the data in 3rd node is being 
>>>>> searched all the time. (say these are popular documents somehow gathered 
>>>>> only in this node) so Elastic weights this load into all cluster by 
>>>>> dividing this data to other nodes ?  Or will it always use only 3rd node? 
>>>>> There are tons of questions in my mind, waiting to be answered. Only 
>>>>> possible way to read the algorithms . It would help me a lot.
>>>>>
>>>>> Thanks
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to elasticsearc...@googlegroups.com.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fa934662-61b3-42db-a97f-671ade563297%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: what are the research papers that ES relies on?

Reply via email to