Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-09 Thread Yonik Seeley
On Thu, Sep 9, 2010 at 11:51 AM, Grant Ingersoll  wrote:
> On Sep 6, 2010, at 10:41 AM, Yonik Seeley wrote:
>> For SolrCloud, I don't think we'll end up using consistent hashing -
>> we don't need it (although some of the concepts may still be useful).
>
> Can you elaborate on why we don't need it?

I guess because I can't think of a reason why we would need it - hence
it seems we don't?

Random node placement and virtual nodes would seem to be a
disadvantage for us since we aren't just a key-value store and care
about more than one key at a time.  Larger partitions in conjunction
with user-based/directed partitioning will allow doing things like
querying a specific user's email box (for example) by hitting a single
(or very few) nodes in the complete cluster.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-09 Thread Grant Ingersoll

On Sep 6, 2010, at 10:41 AM, Yonik Seeley wrote:

> 
> For SolrCloud, I don't think we'll end up using consistent hashing -
> we don't need it (although some of the concepts may still be useful).

Can you elaborate on why we don't need it?


Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-07 Thread MitchK

I must add something to my last post:

When saying it could be used together with techniques like consistent
hashing, I mean it could be used at indexing time for indexing documents,
since I assumed that the number of shards does not change frequently and
therefore an ODV-case becomes relatively infrequent. Furthermore the
overhead of searching for and removing those ODV-documents is relatively
low. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434364.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-07 Thread MitchK

What if we do not care about the version of a document at index-time?

When it comes to distributed search, we currently decide aggregating
documents based on their uniqueKey. But what would be, if we decide
additionally decide on uniqueKey plus indexingDate, so that we only
aggregate the last indexed version of a document?

The concept could look like this:
When Solr aggregated the documents for a response, it could store what shard
responsed an older version of document x. 

Now a crawler can crawl through our SolrCloud and asking each shard whether
it noticed something like "shard y got an older version of doc x"-case.
The crawler aggregates those information. After he finished crawling, he
sends delete-by-query-requests to those shards which have older versions of
documents than they should have. 

I will call these "stores document versions that are older than the newest
version" ODV (Old Document Versions) for better understanding. 

So, what can happen:
Before the crawler can visit shard A - who noticed that shard y stores an
ODV of doc x - shard A can go down. That's okay, because either another
shard noticed the same, or shard A will be available later on. If those
information will we stored at HD, it will also be available. If it was
stored in RAM the information is lost... however, you could replicate those
information over more than one shard, right? :-)

Another case:
Shard y can go down - so someone has to care for storing the noticed
ODV-information, so that one can delete the document when Shard Y comes
back.

Pros:
- You can do something like consistent hashing in connection with a concept
where each node has to care for its neighbour-nodes. This is because only
the neighbour nodes can store ODVs.

- using the described concept, you can do nightly batches, looking for ODVs
in the neigbour-nodes.

- ODVs will be found at requesting time, so we can avoid to response ODVs
over newer versions.

Cons:
- We are wasting disc space.

- This works only for smaller clusters, not for large ones where the number
of machines changes very frequently

... this is just another idea - and it is very very lazy.

I must emphasize, that I assume that neighbour-machines do not go down very
frequently. Of course, it is not a question whether a machine crashes, but
when it crashes - but I assume that the same server does not crash every
hour. :-)

Thoughts?

Kind regards


Andrzej Bialecki wrote:
> 
> On 2010-09-06 16:41, Yonik Seeley wrote:
>> On Mon, Sep 6, 2010 at 10:18 AM, MitchK  wrote:
>> [...consistent hashing...]
>>> But it doesn't solve the problem at all, correct me if I am wrong, but:
>>> If
>>> you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
>>> current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
>>> holds the older version.
>>> Am I right?
>>
>> Right.  You still need code to handle migration.
>>
>> Consistent hashing is a way for everyone to be able to agree on the
>> mapping, and for the mapping to change incrementally.  i.e. you add a
>> node and it only changes the docid->node mapping of a limited percent
>> of the mappings, rather than changing the mappings of potentially
>> everything, as a simple MOD would do.
> 
> Another strategy to avoid excessive reindexing is to keep splitting the 
> largest shards, and then your mapping becomes a regular MOD plus a list 
> of these additional splits. Really, there's an infinite number of ways 
> you could implement this...
> 
>>
>> For SolrCloud, I don't think we'll end up using consistent hashing -
>> we don't need it (although some of the concepts may still be useful).
> 
> I imagine there could be situations where a simple MOD won't do ;) so I 
> think it would be good to hide this strategy behind an 
> interface/abstract class. It costs nothing, and gives you flexibility in 
> how you implement this mapping.
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434329.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread Dennis Gearon
Oh, THAT MOD! LOL!

I thought it was some search engine specific acronym.
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/6/10, Markus Jelsma  wrote:

> From: Markus Jelsma 
> Subject: RE: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
> To: solr-user@lucene.apache.org
> Date: Monday, September 6, 2010, 2:53 PM
> The remainder of an arithmetic
> division
> 
> http://en.wikipedia.org/wiki/Modulo_operation

> -Original message-
> From: Dennis Gearon 
> Sent: Mon 06-09-2010 22:04
> To: solr-user@lucene.apache.org;
> 
> Subject: Re: SolrCloud distributed indexing (Re: anyone use
> hadoop+solr?)
> 
> What is a 'simple MOD'?
> 
> Dennis Gearon
> 
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
> 
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php

> 
> 
> --- On Mon, 9/6/10, Andrzej Bialecki 
> wrote:
> 
> > From: Andrzej Bialecki 
> > Subject: Re: SolrCloud distributed indexing (Re:
> anyone use hadoop+solr?)
> > To: solr-user@lucene.apache.org
> > Date: Monday, September 6, 2010, 11:30 AM
> > On 2010-09-06 16:41, Yonik Seeley
> > wrote:
> > > On Mon, Sep 6, 2010 at 10:18 AM, MitchK 
> > wrote:
> > > [...consistent hashing...]
> > >> But it doesn't solve the problem at all,
> correct
> > me if I am wrong, but: If
> > >> you add a new server, let's call him IP3-1,
> and
> > IP3-1 is nearer to the
> > >> current ressource X, than doc x will be
> indexed at
> > IP3-1 - even if IP2-1
> > >> holds the older version.
> > >> Am I right?
> > > 
> > > Right.  You still need code to handle
> migration.
> > > 
> > > Consistent hashing is a way for everyone to be
> able to
> > agree on the
> > > mapping, and for the mapping to change
> > incrementally.  i.e. you add a
> > > node and it only changes the docid->node
> mapping of
> > a limited percent
> > > of the mappings, rather than changing the
> mappings of
> > potentially
> > > everything, as a simple MOD would do.
> > 
> > Another strategy to avoid excessive reindexing is to
> keep
> > splitting the largest shards, and then your mapping
> becomes
> > a regular MOD plus a list of these additional splits.
> > Really, there's an infinite number of ways you could
> > implement this...
> > 
> > > 
> > > For SolrCloud, I don't think we'll end up using
> > consistent hashing -
> > > we don't need it (although some of the concepts
> may
> > still be useful).
> > 
> > I imagine there could be situations where a simple
> MOD
> > won't do ;) so I think it would be good to hide this
> > strategy behind an interface/abstract class. It costs
> > nothing, and gives you flexibility in how you
> implement this
> > mapping.
> > 
> > -- Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _
> > _   __
> > [__ || __|__/|__||\/|  Information Retrieval,
> Semantic
> > Web
> > ___|||__||  \|  ||  |  Embedded Unix,
> > System Integration
> > http://www.sigram.com  Contact: info at sigram dot
> > com
> > 
> > 
>


RE: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread Markus Jelsma
The remainder of an arithmetic division

http://en.wikipedia.org/wiki/Modulo_operation
-Original message-
From: Dennis Gearon 
Sent: Mon 06-09-2010 22:04
To: solr-user@lucene.apache.org; 
Subject: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

What is a 'simple MOD'?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/6/10, Andrzej Bialecki  wrote:

> From: Andrzej Bialecki 
> Subject: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
> To: solr-user@lucene.apache.org
> Date: Monday, September 6, 2010, 11:30 AM
> On 2010-09-06 16:41, Yonik Seeley
> wrote:
> > On Mon, Sep 6, 2010 at 10:18 AM, MitchK 
> wrote:
> > [...consistent hashing...]
> >> But it doesn't solve the problem at all, correct
> me if I am wrong, but: If
> >> you add a new server, let's call him IP3-1, and
> IP3-1 is nearer to the
> >> current ressource X, than doc x will be indexed at
> IP3-1 - even if IP2-1
> >> holds the older version.
> >> Am I right?
> > 
> > Right.  You still need code to handle migration.
> > 
> > Consistent hashing is a way for everyone to be able to
> agree on the
> > mapping, and for the mapping to change
> incrementally.  i.e. you add a
> > node and it only changes the docid->node mapping of
> a limited percent
> > of the mappings, rather than changing the mappings of
> potentially
> > everything, as a simple MOD would do.
> 
> Another strategy to avoid excessive reindexing is to keep
> splitting the largest shards, and then your mapping becomes
> a regular MOD plus a list of these additional splits.
> Really, there's an infinite number of ways you could
> implement this...
> 
> > 
> > For SolrCloud, I don't think we'll end up using
> consistent hashing -
> > we don't need it (although some of the concepts may
> still be useful).
> 
> I imagine there could be situations where a simple MOD
> won't do ;) so I think it would be good to hide this
> strategy behind an interface/abstract class. It costs
> nothing, and gives you flexibility in how you implement this
> mapping.
> 
> -- Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _
> _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic
> Web
> ___|||__||  \|  ||  |  Embedded Unix,
> System Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 


Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread Andrzej Bialecki

On 2010-09-06 22:03, Dennis Gearon wrote:

What is a 'simple MOD'?


md5(docId) % numShards

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread Dennis Gearon
What is a 'simple MOD'?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/6/10, Andrzej Bialecki  wrote:

> From: Andrzej Bialecki 
> Subject: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
> To: solr-user@lucene.apache.org
> Date: Monday, September 6, 2010, 11:30 AM
> On 2010-09-06 16:41, Yonik Seeley
> wrote:
> > On Mon, Sep 6, 2010 at 10:18 AM, MitchK 
> wrote:
> > [...consistent hashing...]
> >> But it doesn't solve the problem at all, correct
> me if I am wrong, but: If
> >> you add a new server, let's call him IP3-1, and
> IP3-1 is nearer to the
> >> current ressource X, than doc x will be indexed at
> IP3-1 - even if IP2-1
> >> holds the older version.
> >> Am I right?
> > 
> > Right.  You still need code to handle migration.
> > 
> > Consistent hashing is a way for everyone to be able to
> agree on the
> > mapping, and for the mapping to change
> incrementally.  i.e. you add a
> > node and it only changes the docid->node mapping of
> a limited percent
> > of the mappings, rather than changing the mappings of
> potentially
> > everything, as a simple MOD would do.
> 
> Another strategy to avoid excessive reindexing is to keep
> splitting the largest shards, and then your mapping becomes
> a regular MOD plus a list of these additional splits.
> Really, there's an infinite number of ways you could
> implement this...
> 
> > 
> > For SolrCloud, I don't think we'll end up using
> consistent hashing -
> > we don't need it (although some of the concepts may
> still be useful).
> 
> I imagine there could be situations where a simple MOD
> won't do ;) so I think it would be good to hide this
> strategy behind an interface/abstract class. It costs
> nothing, and gives you flexibility in how you implement this
> mapping.
> 
> -- Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _
> _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic
> Web
> ___|||__||  \|  ||  |  Embedded Unix,
> System Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
>


Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread Andrzej Bialecki

On 2010-09-06 16:41, Yonik Seeley wrote:

On Mon, Sep 6, 2010 at 10:18 AM, MitchK  wrote:
[...consistent hashing...]

But it doesn't solve the problem at all, correct me if I am wrong, but: If
you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
holds the older version.
Am I right?


Right.  You still need code to handle migration.

Consistent hashing is a way for everyone to be able to agree on the
mapping, and for the mapping to change incrementally.  i.e. you add a
node and it only changes the docid->node mapping of a limited percent
of the mappings, rather than changing the mappings of potentially
everything, as a simple MOD would do.


Another strategy to avoid excessive reindexing is to keep splitting the 
largest shards, and then your mapping becomes a regular MOD plus a list 
of these additional splits. Really, there's an infinite number of ways 
you could implement this...




For SolrCloud, I don't think we'll end up using consistent hashing -
we don't need it (although some of the concepts may still be useful).


I imagine there could be situations where a simple MOD won't do ;) so I 
think it would be good to hide this strategy behind an 
interface/abstract class. It costs nothing, and gives you flexibility in 
how you implement this mapping.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread Yonik Seeley
On Mon, Sep 6, 2010 at 10:18 AM, MitchK  wrote:
[...consistent hashing...]
> But it doesn't solve the problem at all, correct me if I am wrong, but: If
> you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
> current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
> holds the older version.
> Am I right?

Right.  You still need code to handle migration.

Consistent hashing is a way for everyone to be able to agree on the
mapping, and for the mapping to change incrementally.  i.e. you add a
node and it only changes the docid->node mapping of a limited percent
of the mappings, rather than changing the mappings of potentially
everything, as a simple MOD would do.

For SolrCloud, I don't think we'll end up using consistent hashing -
we don't need it (although some of the concepts may still be useful).

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread MitchK

Andrzej,

thank you for sharing your experiences.



> b) use consistent hashing as the mapping schema to assign documents to a 
> changing number of shards. There are many explanations of this schema on 
> the net, here's one that is very simple: 
> 
Boom. 
With the given explanation, I understand it as the following:
You can use hadoop and do some map-reduce-jobs per csv-file.
At the reducer-side, the reducer has to look for the id of the current doc
and needs to create a hash of it.
Now it looks inside a SortedSet, picks the next-best server and looks in a
map, whether this server has got free capacity or not. That's cool.

But it doesn't solve the problem at all, correct me if I am wrong, but: If
you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
holds the older version. 
Am I right?

Thank you for sharing the paper. I will have a look for more like this. 



> In this case the lack of good docs and user-level API can be blamed on 
> the fact that this functionality is still under heavy development. 
> 
I do not only mean documentation at the user-level but also inside a class,
if there is going on some complicated stuff. 

- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1426728.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

2010-09-06 Thread Andrzej Bialecki

(I adjusted the subject to better reflect the content of this discussion).

On 2010-09-06 14:37, MitchK wrote:


Thanks for your detailed feedback Andzej!


From what I understood, SOLR-1301 becomes obsolete ones Solr becomes

cloud-ready, right?


Who knows... I certainly didn't expect this code to become so popular ;) 
so even after SolrCloud becomes available it's likely that some people 
will continue to use it. But SolrCloud should solve the original problem 
that I tried to solve with this patch.



Looking into the future: eventually, when SolrCloud arrives we will be
able to index straight to a SolrCloud cluster, assigning documents to
shards through a hashing schema (e.g. 'md5(docId) % numShards')


Hm, let's say the md5(docId) would produce a value of 10 (it won't, but
let's assume it).
If I got a constant number of shards, the doc will be published to the same
shard again and again.

i.e.: 10 % numShards(5) = 2 ->  doc 10 will be indexed at shard 2.

A few days later the rest of the cluster is available, now it looks like

10 % numShards(10) ->   1 ->  doc 10 will be indexed at shard 1... and what
about the older version at shard 2? I am no expert when it comes to
cloudComputing and the other stuff.


There are several possible solutions to this, and they all boil down to
the way how you assign documents to shards... Keep in mind that nodes 
(physical machines) can manage several shards, and the aggregate 
collection of all unique shards across all nodes forms your whole index 
- so there's also a related, but different issue, of how to assign 
shards to nodes.


Here are some scenarios how you can solve the doc-to-shard mapping 
problem (note: I removed the issue of replication from the picture to 
make this clearer):


a) keep the number of shards constant no matter how large is the 
cluster. The mapping schema is then as simple as the one above. In this 
scenario you create relatively small shards, so that a single physical 
node can manage dozens of shards (each shard using one core, or perhaps 
a more lightweight structure like MultiReader). This is also known as 
micro-sharding. As the number of documents grows the size of each shard 
will grow until you have to reduce the number of shards per node, 
ultimately ending up with a single shard per node. After that, if your 
collection continues to grow, you have to modify your hashing schema to 
split some shards (and reindex some shards, or use an index splitter tool).


b) use consistent hashing as the mapping schema to assign documents to a 
changing number of shards. There are many explanations of this schema on 
the net, here's one that is very simple:


http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/

In this case, you can grow/shrink the number of shards (and their size) 
as you see fit, incurring only a small reindexing cost.



If you can point me to one or another reference where I can read about it,
it would help me a lot, since I only want to understand how it works at the
moment.


http://wiki.apache.org/solr/SolrCloud ...



The problem with Solr is its lack of documentation in some classes and the
lack of capsulating some very complex things into different methods or
extra-classes. Of course, this is because it costs some extra time to do so,
but it makes understanding and modifying things very complicated if you do
not understand whats going on from a theoretical point of view.


In this case the lack of good docs and user-level API can be blamed on 
the fact that this functionality is still under heavy development.




Since the cloud-feature will be complex, a lack of documentation and no
understanding of the theory behind the code will make contributing back
very, very complicated.


For now, yes, it's an issue - though as soon as SolrCloud gets committed 
I'm sure people will follow up with user-level convenience components 
that will make it easier.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com