Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
On Thu, Sep 9, 2010 at 11:51 AM, Grant Ingersoll wrote: > On Sep 6, 2010, at 10:41 AM, Yonik Seeley wrote: >> For SolrCloud, I don't think we'll end up using consistent hashing - >> we don't need it (although some of the concepts may still be useful). > > Can you elaborate on why we don't need it? I guess because I can't think of a reason why we would need it - hence it seems we don't? Random node placement and virtual nodes would seem to be a disadvantage for us since we aren't just a key-value store and care about more than one key at a time. Larger partitions in conjunction with user-based/directed partitioning will allow doing things like querying a specific user's email box (for example) by hitting a single (or very few) nodes in the complete cluster. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
On Sep 6, 2010, at 10:41 AM, Yonik Seeley wrote: > > For SolrCloud, I don't think we'll end up using consistent hashing - > we don't need it (although some of the concepts may still be useful). Can you elaborate on why we don't need it?
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
I must add something to my last post: When saying it could be used together with techniques like consistent hashing, I mean it could be used at indexing time for indexing documents, since I assumed that the number of shards does not change frequently and therefore an ODV-case becomes relatively infrequent. Furthermore the overhead of searching for and removing those ODV-documents is relatively low. -- View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434364.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
What if we do not care about the version of a document at index-time? When it comes to distributed search, we currently decide aggregating documents based on their uniqueKey. But what would be, if we decide additionally decide on uniqueKey plus indexingDate, so that we only aggregate the last indexed version of a document? The concept could look like this: When Solr aggregated the documents for a response, it could store what shard responsed an older version of document x. Now a crawler can crawl through our SolrCloud and asking each shard whether it noticed something like "shard y got an older version of doc x"-case. The crawler aggregates those information. After he finished crawling, he sends delete-by-query-requests to those shards which have older versions of documents than they should have. I will call these "stores document versions that are older than the newest version" ODV (Old Document Versions) for better understanding. So, what can happen: Before the crawler can visit shard A - who noticed that shard y stores an ODV of doc x - shard A can go down. That's okay, because either another shard noticed the same, or shard A will be available later on. If those information will we stored at HD, it will also be available. If it was stored in RAM the information is lost... however, you could replicate those information over more than one shard, right? :-) Another case: Shard y can go down - so someone has to care for storing the noticed ODV-information, so that one can delete the document when Shard Y comes back. Pros: - You can do something like consistent hashing in connection with a concept where each node has to care for its neighbour-nodes. This is because only the neighbour nodes can store ODVs. - using the described concept, you can do nightly batches, looking for ODVs in the neigbour-nodes. - ODVs will be found at requesting time, so we can avoid to response ODVs over newer versions. Cons: - We are wasting disc space. - This works only for smaller clusters, not for large ones where the number of machines changes very frequently ... this is just another idea - and it is very very lazy. I must emphasize, that I assume that neighbour-machines do not go down very frequently. Of course, it is not a question whether a machine crashes, but when it crashes - but I assume that the same server does not crash every hour. :-) Thoughts? Kind regards Andrzej Bialecki wrote: > > On 2010-09-06 16:41, Yonik Seeley wrote: >> On Mon, Sep 6, 2010 at 10:18 AM, MitchK wrote: >> [...consistent hashing...] >>> But it doesn't solve the problem at all, correct me if I am wrong, but: >>> If >>> you add a new server, let's call him IP3-1, and IP3-1 is nearer to the >>> current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1 >>> holds the older version. >>> Am I right? >> >> Right. You still need code to handle migration. >> >> Consistent hashing is a way for everyone to be able to agree on the >> mapping, and for the mapping to change incrementally. i.e. you add a >> node and it only changes the docid->node mapping of a limited percent >> of the mappings, rather than changing the mappings of potentially >> everything, as a simple MOD would do. > > Another strategy to avoid excessive reindexing is to keep splitting the > largest shards, and then your mapping becomes a regular MOD plus a list > of these additional splits. Really, there's an infinite number of ways > you could implement this... > >> >> For SolrCloud, I don't think we'll end up using consistent hashing - >> we don't need it (although some of the concepts may still be useful). > > I imagine there could be situations where a simple MOD won't do ;) so I > think it would be good to hide this strategy behind an > interface/abstract class. It costs nothing, and gives you flexibility in > how you implement this mapping. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > -- View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434329.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
Oh, THAT MOD! LOL! I thought it was some search engine specific acronym. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Mon, 9/6/10, Markus Jelsma wrote: > From: Markus Jelsma > Subject: RE: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?) > To: solr-user@lucene.apache.org > Date: Monday, September 6, 2010, 2:53 PM > The remainder of an arithmetic > division > > http://en.wikipedia.org/wiki/Modulo_operation > -Original message- > From: Dennis Gearon > Sent: Mon 06-09-2010 22:04 > To: solr-user@lucene.apache.org; > > Subject: Re: SolrCloud distributed indexing (Re: anyone use > hadoop+solr?) > > What is a 'simple MOD'? > > Dennis Gearon > > Signature Warning > > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php > > > --- On Mon, 9/6/10, Andrzej Bialecki > wrote: > > > From: Andrzej Bialecki > > Subject: Re: SolrCloud distributed indexing (Re: > anyone use hadoop+solr?) > > To: solr-user@lucene.apache.org > > Date: Monday, September 6, 2010, 11:30 AM > > On 2010-09-06 16:41, Yonik Seeley > > wrote: > > > On Mon, Sep 6, 2010 at 10:18 AM, MitchK > > wrote: > > > [...consistent hashing...] > > >> But it doesn't solve the problem at all, > correct > > me if I am wrong, but: If > > >> you add a new server, let's call him IP3-1, > and > > IP3-1 is nearer to the > > >> current ressource X, than doc x will be > indexed at > > IP3-1 - even if IP2-1 > > >> holds the older version. > > >> Am I right? > > > > > > Right. You still need code to handle > migration. > > > > > > Consistent hashing is a way for everyone to be > able to > > agree on the > > > mapping, and for the mapping to change > > incrementally. i.e. you add a > > > node and it only changes the docid->node > mapping of > > a limited percent > > > of the mappings, rather than changing the > mappings of > > potentially > > > everything, as a simple MOD would do. > > > > Another strategy to avoid excessive reindexing is to > keep > > splitting the largest shards, and then your mapping > becomes > > a regular MOD plus a list of these additional splits. > > Really, there's an infinite number of ways you could > > implement this... > > > > > > > > For SolrCloud, I don't think we'll end up using > > consistent hashing - > > > we don't need it (although some of the concepts > may > > still be useful). > > > > I imagine there could be situations where a simple > MOD > > won't do ;) so I think it would be good to hide this > > strategy behind an interface/abstract class. It costs > > nothing, and gives you flexibility in how you > implement this > > mapping. > > > > -- Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ > > _ __ > > [__ || __|__/|__||\/| Information Retrieval, > Semantic > > Web > > ___|||__|| \| || | Embedded Unix, > > System Integration > > http://www.sigram.com Contact: info at sigram dot > > com > > > > >
RE: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
The remainder of an arithmetic division http://en.wikipedia.org/wiki/Modulo_operation -Original message- From: Dennis Gearon Sent: Mon 06-09-2010 22:04 To: solr-user@lucene.apache.org; Subject: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?) What is a 'simple MOD'? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Mon, 9/6/10, Andrzej Bialecki wrote: > From: Andrzej Bialecki > Subject: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?) > To: solr-user@lucene.apache.org > Date: Monday, September 6, 2010, 11:30 AM > On 2010-09-06 16:41, Yonik Seeley > wrote: > > On Mon, Sep 6, 2010 at 10:18 AM, MitchK > wrote: > > [...consistent hashing...] > >> But it doesn't solve the problem at all, correct > me if I am wrong, but: If > >> you add a new server, let's call him IP3-1, and > IP3-1 is nearer to the > >> current ressource X, than doc x will be indexed at > IP3-1 - even if IP2-1 > >> holds the older version. > >> Am I right? > > > > Right. You still need code to handle migration. > > > > Consistent hashing is a way for everyone to be able to > agree on the > > mapping, and for the mapping to change > incrementally. i.e. you add a > > node and it only changes the docid->node mapping of > a limited percent > > of the mappings, rather than changing the mappings of > potentially > > everything, as a simple MOD would do. > > Another strategy to avoid excessive reindexing is to keep > splitting the largest shards, and then your mapping becomes > a regular MOD plus a list of these additional splits. > Really, there's an infinite number of ways you could > implement this... > > > > > For SolrCloud, I don't think we'll end up using > consistent hashing - > > we don't need it (although some of the concepts may > still be useful). > > I imagine there could be situations where a simple MOD > won't do ;) so I think it would be good to hide this > strategy behind an interface/abstract class. It costs > nothing, and gives you flexibility in how you implement this > mapping. > > -- Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ > _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic > Web > ___|||__|| \| || | Embedded Unix, > System Integration > http://www.sigram.com Contact: info at sigram dot > com > >
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
On 2010-09-06 22:03, Dennis Gearon wrote: What is a 'simple MOD'? md5(docId) % numShards -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
What is a 'simple MOD'? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Mon, 9/6/10, Andrzej Bialecki wrote: > From: Andrzej Bialecki > Subject: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?) > To: solr-user@lucene.apache.org > Date: Monday, September 6, 2010, 11:30 AM > On 2010-09-06 16:41, Yonik Seeley > wrote: > > On Mon, Sep 6, 2010 at 10:18 AM, MitchK > wrote: > > [...consistent hashing...] > >> But it doesn't solve the problem at all, correct > me if I am wrong, but: If > >> you add a new server, let's call him IP3-1, and > IP3-1 is nearer to the > >> current ressource X, than doc x will be indexed at > IP3-1 - even if IP2-1 > >> holds the older version. > >> Am I right? > > > > Right. You still need code to handle migration. > > > > Consistent hashing is a way for everyone to be able to > agree on the > > mapping, and for the mapping to change > incrementally. i.e. you add a > > node and it only changes the docid->node mapping of > a limited percent > > of the mappings, rather than changing the mappings of > potentially > > everything, as a simple MOD would do. > > Another strategy to avoid excessive reindexing is to keep > splitting the largest shards, and then your mapping becomes > a regular MOD plus a list of these additional splits. > Really, there's an infinite number of ways you could > implement this... > > > > > For SolrCloud, I don't think we'll end up using > consistent hashing - > > we don't need it (although some of the concepts may > still be useful). > > I imagine there could be situations where a simple MOD > won't do ;) so I think it would be good to hide this > strategy behind an interface/abstract class. It costs > nothing, and gives you flexibility in how you implement this > mapping. > > -- Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ > _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic > Web > ___|||__|| \| || | Embedded Unix, > System Integration > http://www.sigram.com Contact: info at sigram dot > com > >
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
On 2010-09-06 16:41, Yonik Seeley wrote: On Mon, Sep 6, 2010 at 10:18 AM, MitchK wrote: [...consistent hashing...] But it doesn't solve the problem at all, correct me if I am wrong, but: If you add a new server, let's call him IP3-1, and IP3-1 is nearer to the current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1 holds the older version. Am I right? Right. You still need code to handle migration. Consistent hashing is a way for everyone to be able to agree on the mapping, and for the mapping to change incrementally. i.e. you add a node and it only changes the docid->node mapping of a limited percent of the mappings, rather than changing the mappings of potentially everything, as a simple MOD would do. Another strategy to avoid excessive reindexing is to keep splitting the largest shards, and then your mapping becomes a regular MOD plus a list of these additional splits. Really, there's an infinite number of ways you could implement this... For SolrCloud, I don't think we'll end up using consistent hashing - we don't need it (although some of the concepts may still be useful). I imagine there could be situations where a simple MOD won't do ;) so I think it would be good to hide this strategy behind an interface/abstract class. It costs nothing, and gives you flexibility in how you implement this mapping. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
On Mon, Sep 6, 2010 at 10:18 AM, MitchK wrote: [...consistent hashing...] > But it doesn't solve the problem at all, correct me if I am wrong, but: If > you add a new server, let's call him IP3-1, and IP3-1 is nearer to the > current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1 > holds the older version. > Am I right? Right. You still need code to handle migration. Consistent hashing is a way for everyone to be able to agree on the mapping, and for the mapping to change incrementally. i.e. you add a node and it only changes the docid->node mapping of a limited percent of the mappings, rather than changing the mappings of potentially everything, as a simple MOD would do. For SolrCloud, I don't think we'll end up using consistent hashing - we don't need it (although some of the concepts may still be useful). -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
Andrzej, thank you for sharing your experiences. > b) use consistent hashing as the mapping schema to assign documents to a > changing number of shards. There are many explanations of this schema on > the net, here's one that is very simple: > Boom. With the given explanation, I understand it as the following: You can use hadoop and do some map-reduce-jobs per csv-file. At the reducer-side, the reducer has to look for the id of the current doc and needs to create a hash of it. Now it looks inside a SortedSet, picks the next-best server and looks in a map, whether this server has got free capacity or not. That's cool. But it doesn't solve the problem at all, correct me if I am wrong, but: If you add a new server, let's call him IP3-1, and IP3-1 is nearer to the current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1 holds the older version. Am I right? Thank you for sharing the paper. I will have a look for more like this. > In this case the lack of good docs and user-level API can be blamed on > the fact that this functionality is still under heavy development. > I do not only mean documentation at the user-level but also inside a class, if there is going on some complicated stuff. - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1426728.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
(I adjusted the subject to better reflect the content of this discussion). On 2010-09-06 14:37, MitchK wrote: Thanks for your detailed feedback Andzej! From what I understood, SOLR-1301 becomes obsolete ones Solr becomes cloud-ready, right? Who knows... I certainly didn't expect this code to become so popular ;) so even after SolrCloud becomes available it's likely that some people will continue to use it. But SolrCloud should solve the original problem that I tried to solve with this patch. Looking into the future: eventually, when SolrCloud arrives we will be able to index straight to a SolrCloud cluster, assigning documents to shards through a hashing schema (e.g. 'md5(docId) % numShards') Hm, let's say the md5(docId) would produce a value of 10 (it won't, but let's assume it). If I got a constant number of shards, the doc will be published to the same shard again and again. i.e.: 10 % numShards(5) = 2 -> doc 10 will be indexed at shard 2. A few days later the rest of the cluster is available, now it looks like 10 % numShards(10) -> 1 -> doc 10 will be indexed at shard 1... and what about the older version at shard 2? I am no expert when it comes to cloudComputing and the other stuff. There are several possible solutions to this, and they all boil down to the way how you assign documents to shards... Keep in mind that nodes (physical machines) can manage several shards, and the aggregate collection of all unique shards across all nodes forms your whole index - so there's also a related, but different issue, of how to assign shards to nodes. Here are some scenarios how you can solve the doc-to-shard mapping problem (note: I removed the issue of replication from the picture to make this clearer): a) keep the number of shards constant no matter how large is the cluster. The mapping schema is then as simple as the one above. In this scenario you create relatively small shards, so that a single physical node can manage dozens of shards (each shard using one core, or perhaps a more lightweight structure like MultiReader). This is also known as micro-sharding. As the number of documents grows the size of each shard will grow until you have to reduce the number of shards per node, ultimately ending up with a single shard per node. After that, if your collection continues to grow, you have to modify your hashing schema to split some shards (and reindex some shards, or use an index splitter tool). b) use consistent hashing as the mapping schema to assign documents to a changing number of shards. There are many explanations of this schema on the net, here's one that is very simple: http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/ In this case, you can grow/shrink the number of shards (and their size) as you see fit, incurring only a small reindexing cost. If you can point me to one or another reference where I can read about it, it would help me a lot, since I only want to understand how it works at the moment. http://wiki.apache.org/solr/SolrCloud ... The problem with Solr is its lack of documentation in some classes and the lack of capsulating some very complex things into different methods or extra-classes. Of course, this is because it costs some extra time to do so, but it makes understanding and modifying things very complicated if you do not understand whats going on from a theoretical point of view. In this case the lack of good docs and user-level API can be blamed on the fact that this functionality is still under heavy development. Since the cloud-feature will be complex, a lack of documentation and no understanding of the theory behind the code will make contributing back very, very complicated. For now, yes, it's an issue - though as soon as SolrCloud gets committed I'm sure people will follow up with user-level convenience components that will make it easier. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com