Hello Thanks for your prompt reply.
In regards to using a SolrDocument instead of Strings (and I agree that List<String> doesn't seem to be the best way of going) how do I get reference to a SolrDoc? As far as I can see I have access to a List<ContentStream> that represents all of the files being POSTed. Do I want to open these streams, get the info and then stream them out? This seems wasteful. I had instead thought that the DistributedUpdatedRequestHandler would take this List<ContentStream>, create some kind mapping between each stream and a unique id and then pass the ids to the policy. Thanks for your help Billy On Tue, Feb 1, 2011 at 11:27 AM, Upayavira <u...@odoko.co.uk> wrote: > On Tue, 01 Feb 2011 00:26 +0000, "William Mayor" > <m...@williammayor.co.uk> wrote: >> Hi Guys >> >> I've had a go at creating the ShardDistributionPolicy interface and a >> few implementations. I've created a patch >> (https://issues.apache.org/jira/browse/SOLR-2341) let me know what >> needs doing. > > >> Currently I assume that the documents passed to the policy will be >> represented by some kind of identifier and that one needs only to >> match the ID with a shard. This is better (I think) than reading the >> document from the POST and figuring out some kind of unique >> identifier? > > Your code looks fine to me, except it should take in a SolrDocument > object or list of, rather than strings. Then, for your Hash version, you > can take a hash of the "id" field. > >> A question we've had about this is who decides what policy to use and >> where do they specify? I'm inclided to think that the user (the person >> POSTing data) does not mind what policy is used but the administrator >> might. This leads me to think that the policy should be set in the >> solr config file? My collegues disagree that the user will not mind >> and would rather see the policy be specified in the url. We've noticed >> that request handlers can be specified in both so should we adopt this >> idea instead (and as a kind of comprimise :) ). > > To stick with Solr conventions, you would specify the > ShardDistributionPolicy in the solrconfig.xml, within the configuration > of your DistributedUpdateRequestHandler, so in that sense, it is hidden > from your users and managed by the administrator. > > However, if you follow this approach, an administrator could expose > multiple policies by having multiple DistributedUpdateRequestHandler > definitions in solrconfig.xml, with different URLs. > > To give you an example, but for search rather than indexing: > > <requestHandler name="/dismax" class="solr.SearchHandler" > default="true"> > <!-- default values for query parameters --> > <lst name="defaults"> > <str name="defType">dismax</str> > </lst> > </requestHandler> > > This will configure requests to http://localhost:8983/solr/dismax?q=blah > > to be handled by the dismax query parser. > > More relevant to you: > > <requestHandler name="/distrib" class="solr.SearchHandler" > default="true"> > <!-- default values for query parameters --> > <lst name="defaults"> > <str > > name="shards">http://localhost:8983/solr,http://localhost:7983/solr</str> > </lst> > </requestHandler> > > This would, by default, distribute all queries to > http://localhost:8983/solr/distrib?q=blah across two Solr instances at > the URLs described. > > For now, I'd say see if you can add a > distributionPolicyClass="org.apache.solr.blah" to define the class that > this updateRequestHandler is going to use. > > To everyone else who got this far - please chip in if you see better > ways of doing this. > > Upayavira > >> All the best >> >> William >> >> On Sat, Jan 29, 2011 at 11:56 PM, Upayavira <u...@odoko.co.uk> wrote: >> > Lance, >> > >> > Firstly, we're proposing a ShardDistributionPolicy interface for which >> > there is a default (mod of the doc ID) but other implementations are >> > possible. Another easy implementation would be a randomised or round >> > robin one. >> > >> > As to threading, the first task would be to put all of the source >> > documents into "buckets", one bucket per shard, using the above >> > ShardDistributionPolicy to assign documents to buckets/shards. Then all >> > of the documents in a "bucket" could be sent to the relevant shard for >> > indexing (which would be nothing more than a normal HTTP post (or solrj >> > call?)). >> > >> > As to whether this would be single threaded or multithreaded, I would >> > guess we would aim to do it the same as the distributed search code >> > (which I have not yet reviewed). However, it could presumably be >> > single-threaded, but use asynchronous HTTP. >> > >> > Regards, Upayavira >> > >> > On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" <goks...@gmail.com> >> > wrote: >> >> I would suggest that a DistributedRequestUpdateHandler run >> >> single-threaded, doing only one document at a time. If I want more >> >> than one, I run it twice or N times with my own program. >> >> >> >> Also, this should have a policy object which decides exactly how >> >> documents are distributed. There are different techniques for >> >> different use cases. >> >> >> >> Lance >> >> >> >> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood <soheb.luc...@gmail.com> >> >> wrote: >> >> > Hello Yonik, >> >> > >> >> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote: >> >> >> Making it easy for clients I think is key... one should be able to >> >> >> update any node in the solr cluster and have solr take care of the >> >> >> hard part about updating all relevant shards. This will most likely >> >> >> involve an update processor. This approach allows all existing update >> >> >> methods (including things like CSV file upload) to still work >> >> >> correctly. >> >> >> >> >> >> Also post.jar is really just for testing... a command-line replacement >> >> >> for "curl" for those who may not have it. It's not really a >> >> >> recommended way for updating Solr servers in production. >> >> > >> >> > OK, I've abandoned the post.jar tool idea in favour of a >> >> > DistributedUpdateRequestProcessor class (I've been looking into other >> >> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor, >> >> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they >> >> > are used/what data they store - hence why I've taken some time to >> >> > respond). >> >> > >> >> > My big question now is that is it necessary to have a Factory class for >> >> > DistributedUpdateRequestProcessor? I've seen this lots of times, as in >> >> > RunUpdateProcessorFactory (where the factory class was only a few lines >> >> > of code) to SignatureUpdateProcessorFactory? At first I was thinking it >> >> > would be a good design idea to include it in (in a generic sense), but >> >> > then I thought harder and I thought that the >> >> > DistributedUpdateRequestHander would only be running once, taking in all >> >> > the requests, so it seems sort of pointless to write one in. >> >> > >> >> > That is my "burning" question for now. I have got a few more questions, >> >> > but I'm sure that when I look further into the code, I'll either have >> >> > more or all of my questions are answered. >> >> > >> >> > Many thanks! >> >> > >> >> > Soheb Mahmood >> >> > >> >> > >> >> > --------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Lance Norskog >> >> goks...@gmail.com >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> > --- >> > Enterprise Search Consultant at Sourcesense UK, >> > Making Sense of Open Source >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > --- > Enterprise Search Consultant at Sourcesense UK, > Making Sense of Open Source > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org