Re: Distributed Indexing

William Mayor Tue, 01 Feb 2011 05:55:27 -0800

Hello

Thanks for your prompt reply.


In regards to using a SolrDocument instead of Strings (and I agree
that List<String> doesn't seem to be the best way of going) how do I
get reference to a SolrDoc?

As far as I can see I have access to a List<ContentStream> that
represents all of the files being POSTed. Do I want to open these
streams, get the info and then stream them out? This seems wasteful.

I had instead thought that the DistributedUpdatedRequestHandler would
take this List<ContentStream>, create some kind mapping between each
stream and a unique id and then pass the ids to the policy.

Thanks for your help

Billy

On Tue, Feb 1, 2011 at 11:27 AM, Upayavira <[email protected]> wrote:
> On Tue, 01 Feb 2011 00:26 +0000, "William Mayor"
> <[email protected]> wrote:
>> Hi Guys
>>
>> I've had a go at creating the ShardDistributionPolicy interface and a
>> few implementations. I've created a patch
>> (https://issues.apache.org/jira/browse/SOLR-2341) let me know what
>> needs doing.
>
>
>> Currently I assume that the documents passed to the policy will be
>> represented by some kind of identifier and that one needs only to
>> match the ID with a shard. This is better (I think) than reading the
>> document from the POST and figuring out some kind of unique
>> identifier?
>
> Your code looks fine to me, except it should take in a SolrDocument
> object or list of, rather than strings. Then, for your Hash version, you
> can take a hash of the "id" field.
>
>> A question we've had about this is who decides what policy to use and
>> where do they specify? I'm inclided to think that the user (the person
>> POSTing data) does not mind what policy is used but the administrator
>> might. This leads me to think that the policy should be set in the
>> solr config file? My collegues disagree that the user will not mind
>> and would rather see the policy be specified in the url. We've noticed
>> that request handlers can be specified in both so should we adopt this
>> idea instead (and as a kind of comprimise :) ).
>
> To stick with Solr conventions, you would specify the
> ShardDistributionPolicy in the solrconfig.xml, within the configuration
> of your DistributedUpdateRequestHandler, so in that sense, it is hidden
> from your users and managed by the administrator.
>
> However, if you follow this approach, an administrator could expose
> multiple policies by having multiple DistributedUpdateRequestHandler
> definitions in solrconfig.xml, with different URLs.
>
> To give you an example, but for search rather than indexing:
>
>  <requestHandler name="/dismax" class="solr.SearchHandler"
>  default="true">
>    <!-- default values for query parameters -->
>     <lst name="defaults">
>       <str name="defType">dismax</str>
>     </lst>
>  </requestHandler>
>
> This will configure requests to http://localhost:8983/solr/dismax?q=blah
>
> to be handled by the dismax query parser.
>
> More relevant to you:
>
>  <requestHandler name="/distrib" class="solr.SearchHandler"
>  default="true">
>    <!-- default values for query parameters -->
>     <lst name="defaults">
>       <str
>       
> name="shards">http://localhost:8983/solr,http://localhost:7983/solr</str>
>     </lst>
>  </requestHandler>
>
> This would, by default, distribute all queries to
> http://localhost:8983/solr/distrib?q=blah across two Solr instances at
> the URLs described.
>
> For now, I'd say see if you can add a
> distributionPolicyClass="org.apache.solr.blah" to define the class that
> this updateRequestHandler is going to use.
>
> To everyone else who got this far - please chip in if you see better
> ways of doing this.
>
> Upayavira
>
>> All the best
>>
>> William
>>
>> On Sat, Jan 29, 2011 at 11:56 PM, Upayavira <[email protected]> wrote:
>> > Lance,
>> >
>> > Firstly, we're proposing a ShardDistributionPolicy interface for which
>> > there is a default (mod of the doc ID) but other implementations are
>> > possible. Another easy implementation would be a randomised or round
>> > robin one.
>> >
>> > As to threading, the first task would be to put all of the source
>> > documents into "buckets", one bucket per shard, using the above
>> > ShardDistributionPolicy to assign documents to buckets/shards. Then all
>> > of the documents in a "bucket" could be sent to the relevant shard for
>> > indexing (which would be nothing more than a normal HTTP post (or solrj
>> > call?)).
>> >
>> > As to whether this would be single threaded or multithreaded, I would
>> > guess we would aim to do it the same as the distributed search code
>> > (which I have not yet reviewed). However, it could presumably be
>> > single-threaded, but use asynchronous HTTP.
>> >
>> > Regards, Upayavira
>> >
>> > On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" <[email protected]>
>> > wrote:
>> >> I would suggest that a DistributedRequestUpdateHandler run
>> >> single-threaded, doing only one document at a time. If I want more
>> >> than one, I run it twice or N times with my own program.
>> >>
>> >> Also, this should have a policy object which decides exactly how
>> >> documents are distributed. There are different techniques for
>> >> different use cases.
>> >>
>> >> Lance
>> >>
>> >> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood <[email protected]>
>> >> wrote:
>> >> > Hello Yonik,
>> >> >
>> >> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
>> >> >> Making it easy for clients I think is key... one should be able to
>> >> >> update any node in the solr cluster and have solr take care of the
>> >> >> hard part about updating all relevant shards.  This will most likely
>> >> >> involve an update processor.  This approach allows all existing update
>> >> >> methods (including things like CSV file upload) to still work
>> >> >> correctly.
>> >> >>
>> >> >> Also post.jar is really just for testing... a command-line replacement
>> >> >> for "curl" for those who may not have it.  It's not really a
>> >> >> recommended way for updating Solr servers in production.
>> >> >
>> >> > OK, I've abandoned the post.jar tool idea in favour of a
>> >> > DistributedUpdateRequestProcessor class (I've been looking into other
>> >> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
>> >> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they
>> >> > are used/what data they store - hence why I've taken some time to
>> >> > respond).
>> >> >
>> >> > My big question now is that is it necessary to have a Factory class for
>> >> > DistributedUpdateRequestProcessor? I've seen this lots of times, as in
>> >> > RunUpdateProcessorFactory (where the factory class was only a few lines
>> >> > of code) to SignatureUpdateProcessorFactory? At first I was thinking it
>> >> > would be a good design idea to include it in (in a generic sense), but
>> >> > then I thought harder and I thought that the
>> >> > DistributedUpdateRequestHander would only be running once, taking in all
>> >> > the requests, so it seems sort of pointless to write one in.
>> >> >
>> >> > That is my "burning" question for now. I have got a few more questions,
>> >> > but I'm sure that when I look further into the code, I'll either have
>> >> > more or all of my questions are answered.
>> >> >
>> >> > Many thanks!
>> >> >
>> >> > Soheb Mahmood
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: [email protected]
>> >> > For additional commands, e-mail: [email protected]
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Lance Norskog
>> >> [email protected]
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> > ---
>> > Enterprise Search Consultant at Sourcesense UK,
>> > Making Sense of Open Source
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Distributed Indexing

Reply via email to