Re: Distributed Indexing

Soheb Mahmood Wed, 26 Jan 2011 14:45:00 -0800

Hey guys!

On Thu, 2011-01-27 at 10:04 +1300, Todd Nine wrote:
> Just throwing in my 2 cents.  If you're on a tight deadline have you
> had a look at Solandra?  We were already using Cassandra, so it was
> incredibly easy to get a scalable Solr installation up and running.
>


In short: We are doing this implementation for a university group
project.

In long: But then again, you could be forgiven for thinking we are
trying to implement this feature to use it ourselves (as in, for some
sort of business), but we are actually doing this for a group project,
as I've said above. You see, we are UCL University students that have
been given this task to contribute something to the Apache SolrCloud
open source project.

We are doing this higher level goal of basically adding native
distributed indexing into Solr so it indirectly benefits SolrCloud in
the future. We are hoping to implement this and hopefully get it
contributed into the Apache Solr project. *crosses fingers*

> Hi Soheb,
> 
> Sounds good! A few things I thought of:
> 
> With regard to #1, would the list of shards to index to (if present)
> be exclusive or would we assume that the shard the update request was
> sent to should also be included? For example, say, using the example
> you gave, an update request was sent like so:
> java -Durl=http://localhost:7574/solr/collection1/update
> -Dshards=localhost:8983/solr -jar post.jar <list of XML files>
> 
> should the documents be indexed exclusively to the 'shards list' (ie.
> just localhost:8983/solr) or the 'shards list' & the server the
> request was sent to? So specifying something like this:
> java -Durl=http://localhost:7574/solr/collection1/update
> -Dshards=localhost:7574/solr -jar post.jar <list of XML files>
> would be equivalent to:
> java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar
> <list of XML files>

I reckon we should stick exclusively to the list that the user
specifies. I personally would find it strange behaviour that the shard
be also added to the shard I was on. 

For an example, if we had a user like me (as thick as a brick), our user
(me) may try to index a document on shard localhost:8983 when in fact
I've pointed it to localhost:7574. The user would then get terribly
confused at solr and think he has accidentally broken Solr if it got
indexed on both.

My ideal situation is, unless I... err... I mean unless the "user"
strictly specifies the shard he wants to index to, then it shouldn't be
indexed at that shard.

> For a default interface to decide which shard to index to, we were
> thinking of using either a simple hash function on the document's
> uniqueKey modulo the number of shards specified in the list (as
> mentioned here:
> http://wiki.apache.org/solr/DistributedSearch#Distributed_Indexing) or
> some sort of round robin method, indexing a document to each shard in
> turn, until there are no more documents left to index.

Good point, I completely missed that out last time. 

> Also, how will we deal with failures? Should we simply return a list
> of all documents which weren't indexed or have a retry period after
> the initial indexing?

Well I was thinking of something along the lines of what download
managers do - either retry if the shard is busy or fail it if the shard
is somehow inaccessible. The ones that were failed should possibly be
spat out by Solr.

Are we planning on having a GUI front-end to this? I mean not now, given
we have two weeks to do this, but is one of the future possible goals of
this to implement a front-end UI so that the user can index documents
painlessly? If so, I suggest we should also consider having an XML
output, or some kind of output that can easily be parsed into XML.

Soheb




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Distributed Indexing

Reply via email to