Re: Distributed Indexing

Soheb Mahmood Mon, 31 Jan 2011 16:48:04 -0800

(I'm sending this on behalf of William, a guy on our team working on
ShardDistributedPolicy):


Hi Guys

I've had a go at creating the ShardDistributionPolicy interface and a
few implementations. I've created a patch
(https://issues.apache.org/jira/browse/SOLR-2341) let me know what
needs doing.

Currently I assume that the documents passed to the policy will be
represented by some kind of identifier and that one needs only to
match the ID with a shard. This is better (I think) than reading the
document from the POST and figuring out some kind of unique
identifier?

A question we've had about this is who decides what policy to use and
where do they specify? I'm inclided to think that the user (the person
POSTing data) does not mind what policy is used but the administrator
might. This leads me to think that the policy should be set in the
solr config file? My collegues disagree that the user will not mind
and would rather see the policy be specified in the url. We've noticed
that request handlers can be specified in both so should we adopt this
idea instead (and as a kind of comprimise :) ).

All the best

William

> On Sat, Jan 29, 2011 at 11:56 PM, Upayavira <u...@odoko.co.uk> wrote:
> > Lance,
> >
> > Firstly, we're proposing a ShardDistributionPolicy interface for
> which
> > there is a default (mod of the doc ID) but other implementations are
> > possible. Another easy implementation would be a randomised or round
> > robin one.
> >
> > As to threading, the first task would be to put all of the source
> > documents into "buckets", one bucket per shard, using the above
> > ShardDistributionPolicy to assign documents to buckets/shards. Then
> all
> > of the documents in a "bucket" could be sent to the relevant shard
> for
> > indexing (which would be nothing more than a normal HTTP post (or
> solrj
> > call?)).
> >
> > As to whether this would be single threaded or multithreaded, I
> would
> > guess we would aim to do it the same as the distributed search code
> > (which I have not yet reviewed). However, it could presumably be
> > single-threaded, but use asynchronous HTTP.
> >
> > Regards, Upayavira
> >
> > On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" <goks...@gmail.com>
> > wrote:
> >> I would suggest that a DistributedRequestUpdateHandler run
> >> single-threaded, doing only one document at a time. If I want more
> >> than one, I run it twice or N times with my own program.
> >>
> >> Also, this should have a policy object which decides exactly how
> >> documents are distributed. There are different techniques for
> >> different use cases.
> >>
> >> Lance
> >>
> >> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood
> <soheb.luc...@gmail.com>
> >> wrote:
> >> > Hello Yonik,
> >> >
> >> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
> >> >> Making it easy for clients I think is key... one should be able
> to
> >> >> update any node in the solr cluster and have solr take care of
> the
> >> >> hard part about updating all relevant shards.  This will most
> likely
> >> >> involve an update processor.  This approach allows all existing
> update
> >> >> methods (including things like CSV file upload) to still work
> >> >> correctly.
> >> >>
> >> >> Also post.jar is really just for testing... a command-line
> replacement
> >> >> for "curl" for those who may not have it.  It's not really a
> >> >> recommended way for updating Solr servers in production.
> >> >
> >> > OK, I've abandoned the post.jar tool idea in favour of a
> >> > DistributedUpdateRequestProcessor class (I've been looking into
> other
> >> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
> >> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how
> they
> >> > are used/what data they store - hence why I've taken some time to
> >> > respond).
> >> >
> >> > My big question now is that is it necessary to have a Factory
> class for
> >> > DistributedUpdateRequestProcessor? I've seen this lots of times,
> as in
> >> > RunUpdateProcessorFactory (where the factory class was only a few
> lines
> >> > of code) to SignatureUpdateProcessorFactory? At first I was
> thinking it
> >> > would be a good design idea to include it in (in a generic
> sense), but
> >> > then I thought harder and I thought that the
> >> > DistributedUpdateRequestHander would only be running once, taking
> in all
> >> > the requests, so it seems sort of pointless to write one in.
> >> >
> >> > That is my "burning" question for now. I have got a few more
> questions,
> >> > but I'm sure that when I look further into the code, I'll either
> have
> >> > more or all of my questions are answered.
> >> >
> >> > Many thanks!
> >> >
> >> > Soheb Mahmood
> >> >
> >> >
> >> >
> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goks...@gmail.com
> >>
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Distributed Indexing

Reply via email to