[jira] Created: (SOLR-2358) Distributing Indexing

2011-02-10 Thread William Mayor (JIRA)
Distributing Indexing
-

 Key: SOLR-2358
 URL: https://issues.apache.org/jira/browse/SOLR-2358
 Project: Solr
  Issue Type: New Feature
Reporter: William Mayor
Priority: Minor


The first steps towards creating distributed indexing functionality in Solr

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2341) Shard distribution policy

2011-02-06 Thread William Mayor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Mayor updated SOLR-2341:


Attachment: SOLR-2341.patch

This patch makes the implemented policy deterministic. This is missing from the 
previous patch. The policy code has also been refactored into it's own package.

 Shard distribution policy
 -

 Key: SOLR-2341
 URL: https://issues.apache.org/jira/browse/SOLR-2341
 Project: Solr
  Issue Type: New Feature
Reporter: William Mayor
Priority: Minor
 Attachments: SOLR-2341.patch, SOLR-2341.patch


 A first crack at creating policies to be used for determining to which of a 
 list of shards a document should go. See discussion on Distributed Indexing 
 on dev-list.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Distributed Indexing

2011-02-06 Thread William Mayor
Hi

Good call about the policies being deterministic, should've thought of that
earlier.

We've changed the patch to include this and I've removed the random
assignment one (for obvious reasons).

Take a look and let me know what's to do. (
https://issues.apache.org/jira/browse/SOLR-2341)

Cheers

William

On Thu, Feb 3, 2011 at 5:00 PM, Upayavira u...@odoko.co.uk wrote:


  On Thu, 03 Feb 2011 15:12 +, Alex Cowell alxc...@gmail.com wrote:

 Hi all,

 Just a couple of questions that have arisen.

 1. For handling non-distributed update requests (shards param is not
 present or is invalid), our code currently

- assumes the user would like the data indexed, so gets the request
handler assigned to /update
- executes the request using core.execute() for the SolrCore associated
with the original request

 Is this what we want it to do and is using core.execute() from within a
 request handler a valid method of passing on the update request?


 Take a look at how it is done in
 handler.component.SearchHandler.handleRequestBody(). I'd say try to follow
 as similar approach as possible. E.g. it is the SearchHandler that does much
 of the work, branching depending on whether it found a shards parameter.


 2. We have partially implemented an update processor which actually
 generates and sends the split update requests to each specified shard (as
 designated by the policy). As it stands, the code shares a lot in common
 with the HttpCommComponent class used for distributed search. Should we look
 at opening up the HttpCommComponent class so it could be used by our
 request handler as well or should we continue with our current
 implementation and worry about that later?


 I agree that you are going to want to implement an UpdateRequestProcessor.
 However, it would seem to me that, unlike search, you're not going to want
 to bother with the existing processor and associated component chain, you're
 going to want to replace the processor with a distributed version.

 As to the HttpCommComponent, I'd suggest you make your own educated
 decision. How similar is the class? Could one serve both needs effectively?


 3. Our update processor uses a MultiThreadedHttpConnectionManager to send
 parallel updates to shards, can anyone give some appropriate values to be
 used for the defaultMaxConnectionsPerHost and maxTotalConnections params?
 Won't the  values used for distributed search be a little high for
 distributed indexing?


 You are right, these will likely be lower for distributed indexing, however
 I'd suggest not worrying about it for now, as it is easy to tweak later.

 Upayavira

  ---
 Enterprise Search Consultant at Sourcesense UK,
 Making Sense of Open Source



[jira] Issue Comment Edited: (SOLR-2341) Shard distribution policy

2011-02-06 Thread William Mayor (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991225#comment-12991225
 ] 

William Mayor edited comment on SOLR-2341 at 2/6/11 10:00 PM:
--

This patch makes the implemented policy deterministic. This is missing from the 
previous patch. The policy code has also been refactored into its own package.

  was (Author: williammayor):
This patch makes the implemented policy deterministic. This is missing from 
the previous patch. The policy code has also been refactored into it's own 
package.
  
 Shard distribution policy
 -

 Key: SOLR-2341
 URL: https://issues.apache.org/jira/browse/SOLR-2341
 Project: Solr
  Issue Type: New Feature
Reporter: William Mayor
Priority: Minor
 Attachments: SOLR-2341.patch, SOLR-2341.patch


 A first crack at creating policies to be used for determining to which of a 
 list of shards a document should go. See discussion on Distributed Indexing 
 on dev-list.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Distributed Indexing

2011-02-01 Thread William Mayor
Hello

Thanks for your prompt reply.

In regards to using a SolrDocument instead of Strings (and I agree
that ListString doesn't seem to be the best way of going) how do I
get reference to a SolrDoc?

As far as I can see I have access to a ListContentStream that
represents all of the files being POSTed. Do I want to open these
streams, get the info and then stream them out? This seems wasteful.

I had instead thought that the DistributedUpdatedRequestHandler would
take this ListContentStream, create some kind mapping between each
stream and a unique id and then pass the ids to the policy.

Thanks for your help

Billy

On Tue, Feb 1, 2011 at 11:27 AM, Upayavira u...@odoko.co.uk wrote:
 On Tue, 01 Feb 2011 00:26 +, William Mayor
 m...@williammayor.co.uk wrote:
 Hi Guys

 I've had a go at creating the ShardDistributionPolicy interface and a
 few implementations. I've created a patch
 (https://issues.apache.org/jira/browse/SOLR-2341) let me know what
 needs doing.


 Currently I assume that the documents passed to the policy will be
 represented by some kind of identifier and that one needs only to
 match the ID with a shard. This is better (I think) than reading the
 document from the POST and figuring out some kind of unique
 identifier?

 Your code looks fine to me, except it should take in a SolrDocument
 object or list of, rather than strings. Then, for your Hash version, you
 can take a hash of the id field.

 A question we've had about this is who decides what policy to use and
 where do they specify? I'm inclided to think that the user (the person
 POSTing data) does not mind what policy is used but the administrator
 might. This leads me to think that the policy should be set in the
 solr config file? My collegues disagree that the user will not mind
 and would rather see the policy be specified in the url. We've noticed
 that request handlers can be specified in both so should we adopt this
 idea instead (and as a kind of comprimise :) ).

 To stick with Solr conventions, you would specify the
 ShardDistributionPolicy in the solrconfig.xml, within the configuration
 of your DistributedUpdateRequestHandler, so in that sense, it is hidden
 from your users and managed by the administrator.

 However, if you follow this approach, an administrator could expose
 multiple policies by having multiple DistributedUpdateRequestHandler
 definitions in solrconfig.xml, with different URLs.

 To give you an example, but for search rather than indexing:

  requestHandler name=/dismax class=solr.SearchHandler
  default=true
    !-- default values for query parameters --
     lst name=defaults
       str name=defTypedismax/str
     /lst
  /requestHandler

 This will configure requests to http://localhost:8983/solr/dismax?q=blah

 to be handled by the dismax query parser.

 More relevant to you:

  requestHandler name=/distrib class=solr.SearchHandler
  default=true
    !-- default values for query parameters --
     lst name=defaults
       str
       
 name=shardshttp://localhost:8983/solr,http://localhost:7983/solr/str
     /lst
  /requestHandler

 This would, by default, distribute all queries to
 http://localhost:8983/solr/distrib?q=blah across two Solr instances at
 the URLs described.

 For now, I'd say see if you can add a
 distributionPolicyClass=org.apache.solr.blah to define the class that
 this updateRequestHandler is going to use.

 To everyone else who got this far - please chip in if you see better
 ways of doing this.

 Upayavira

 All the best

 William

 On Sat, Jan 29, 2011 at 11:56 PM, Upayavira u...@odoko.co.uk wrote:
  Lance,
 
  Firstly, we're proposing a ShardDistributionPolicy interface for which
  there is a default (mod of the doc ID) but other implementations are
  possible. Another easy implementation would be a randomised or round
  robin one.
 
  As to threading, the first task would be to put all of the source
  documents into buckets, one bucket per shard, using the above
  ShardDistributionPolicy to assign documents to buckets/shards. Then all
  of the documents in a bucket could be sent to the relevant shard for
  indexing (which would be nothing more than a normal HTTP post (or solrj
  call?)).
 
  As to whether this would be single threaded or multithreaded, I would
  guess we would aim to do it the same as the distributed search code
  (which I have not yet reviewed). However, it could presumably be
  single-threaded, but use asynchronous HTTP.
 
  Regards, Upayavira
 
  On Sat, 29 Jan 2011 15:09 -0800, Lance Norskog goks...@gmail.com
  wrote:
  I would suggest that a DistributedRequestUpdateHandler run
  single-threaded, doing only one document at a time. If I want more
  than one, I run it twice or N times with my own program.
 
  Also, this should have a policy object which decides exactly how
  documents are distributed. There are different techniques for
  different use cases.
 
  Lance
 
  On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood soheb.luc...@gmail.com
  wrote

[jira] Created: (SOLR-2341) Shard distribution policy

2011-01-31 Thread William Mayor (JIRA)
Shard distribution policy
-

 Key: SOLR-2341
 URL: https://issues.apache.org/jira/browse/SOLR-2341
 Project: Solr
  Issue Type: New Feature
Reporter: William Mayor
Priority: Minor


A first crack at creating policies to be used for determining to which of a 
list of shards a document should go. See discussion on Distributed Indexing 
on dev-list.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2341) Shard distribution policy

2011-01-31 Thread William Mayor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Mayor updated SOLR-2341:


Attachment: SOLR-2341.patch

We've created an interface for making policies and then implemented a few basic 
ideas. There's tests for the abstract policies but not the concrete ones.

 Shard distribution policy
 -

 Key: SOLR-2341
 URL: https://issues.apache.org/jira/browse/SOLR-2341
 Project: Solr
  Issue Type: New Feature
Reporter: William Mayor
Priority: Minor
 Attachments: SOLR-2341.patch


 A first crack at creating policies to be used for determining to which of a 
 list of shards a document should go. See discussion on Distributed Indexing 
 on dev-list.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Distributed Indexing

2011-01-31 Thread William Mayor
Hi Guys

I've had a go at creating the ShardDistributionPolicy interface and a
few implementations. I've created a patch
(https://issues.apache.org/jira/browse/SOLR-2341) let me know what
needs doing.

Currently I assume that the documents passed to the policy will be
represented by some kind of identifier and that one needs only to
match the ID with a shard. This is better (I think) than reading the
document from the POST and figuring out some kind of unique
identifier?

A question we've had about this is who decides what policy to use and
where do they specify? I'm inclided to think that the user (the person
POSTing data) does not mind what policy is used but the administrator
might. This leads me to think that the policy should be set in the
solr config file? My collegues disagree that the user will not mind
and would rather see the policy be specified in the url. We've noticed
that request handlers can be specified in both so should we adopt this
idea instead (and as a kind of comprimise :) ).

All the best

William

On Sat, Jan 29, 2011 at 11:56 PM, Upayavira u...@odoko.co.uk wrote:
 Lance,

 Firstly, we're proposing a ShardDistributionPolicy interface for which
 there is a default (mod of the doc ID) but other implementations are
 possible. Another easy implementation would be a randomised or round
 robin one.

 As to threading, the first task would be to put all of the source
 documents into buckets, one bucket per shard, using the above
 ShardDistributionPolicy to assign documents to buckets/shards. Then all
 of the documents in a bucket could be sent to the relevant shard for
 indexing (which would be nothing more than a normal HTTP post (or solrj
 call?)).

 As to whether this would be single threaded or multithreaded, I would
 guess we would aim to do it the same as the distributed search code
 (which I have not yet reviewed). However, it could presumably be
 single-threaded, but use asynchronous HTTP.

 Regards, Upayavira

 On Sat, 29 Jan 2011 15:09 -0800, Lance Norskog goks...@gmail.com
 wrote:
 I would suggest that a DistributedRequestUpdateHandler run
 single-threaded, doing only one document at a time. If I want more
 than one, I run it twice or N times with my own program.

 Also, this should have a policy object which decides exactly how
 documents are distributed. There are different techniques for
 different use cases.

 Lance

 On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood soheb.luc...@gmail.com
 wrote:
  Hello Yonik,
 
  On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
  Making it easy for clients I think is key... one should be able to
  update any node in the solr cluster and have solr take care of the
  hard part about updating all relevant shards.  This will most likely
  involve an update processor.  This approach allows all existing update
  methods (including things like CSV file upload) to still work
  correctly.
 
  Also post.jar is really just for testing... a command-line replacement
  for curl for those who may not have it.  It's not really a
  recommended way for updating Solr servers in production.
 
  OK, I've abandoned the post.jar tool idea in favour of a
  DistributedUpdateRequestProcessor class (I've been looking into other
  classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
  SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they
  are used/what data they store - hence why I've taken some time to
  respond).
 
  My big question now is that is it necessary to have a Factory class for
  DistributedUpdateRequestProcessor? I've seen this lots of times, as in
  RunUpdateProcessorFactory (where the factory class was only a few lines
  of code) to SignatureUpdateProcessorFactory? At first I was thinking it
  would be a good design idea to include it in (in a generic sense), but
  then I thought harder and I thought that the
  DistributedUpdateRequestHander would only be running once, taking in all
  the requests, so it seems sort of pointless to write one in.
 
  That is my burning question for now. I have got a few more questions,
  but I'm sure that when I look further into the code, I'll either have
  more or all of my questions are answered.
 
  Many thanks!
 
  Soheb Mahmood
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 



 --
 Lance Norskog
 goks...@gmail.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

 ---
 Enterprise Search Consultant at Sourcesense UK,
 Making Sense of Open Source


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org