[jira] Created: (SOLR-2358) Distributing Indexing
Distributing Indexing - Key: SOLR-2358 URL: https://issues.apache.org/jira/browse/SOLR-2358 Project: Solr Issue Type: New Feature Reporter: William Mayor Priority: Minor The first steps towards creating distributed indexing functionality in Solr -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2341) Shard distribution policy
[ https://issues.apache.org/jira/browse/SOLR-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Mayor updated SOLR-2341: Attachment: SOLR-2341.patch This patch makes the implemented policy deterministic. This is missing from the previous patch. The policy code has also been refactored into it's own package. Shard distribution policy - Key: SOLR-2341 URL: https://issues.apache.org/jira/browse/SOLR-2341 Project: Solr Issue Type: New Feature Reporter: William Mayor Priority: Minor Attachments: SOLR-2341.patch, SOLR-2341.patch A first crack at creating policies to be used for determining to which of a list of shards a document should go. See discussion on Distributed Indexing on dev-list. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Distributed Indexing
Hi Good call about the policies being deterministic, should've thought of that earlier. We've changed the patch to include this and I've removed the random assignment one (for obvious reasons). Take a look and let me know what's to do. ( https://issues.apache.org/jira/browse/SOLR-2341) Cheers William On Thu, Feb 3, 2011 at 5:00 PM, Upayavira u...@odoko.co.uk wrote: On Thu, 03 Feb 2011 15:12 +, Alex Cowell alxc...@gmail.com wrote: Hi all, Just a couple of questions that have arisen. 1. For handling non-distributed update requests (shards param is not present or is invalid), our code currently - assumes the user would like the data indexed, so gets the request handler assigned to /update - executes the request using core.execute() for the SolrCore associated with the original request Is this what we want it to do and is using core.execute() from within a request handler a valid method of passing on the update request? Take a look at how it is done in handler.component.SearchHandler.handleRequestBody(). I'd say try to follow as similar approach as possible. E.g. it is the SearchHandler that does much of the work, branching depending on whether it found a shards parameter. 2. We have partially implemented an update processor which actually generates and sends the split update requests to each specified shard (as designated by the policy). As it stands, the code shares a lot in common with the HttpCommComponent class used for distributed search. Should we look at opening up the HttpCommComponent class so it could be used by our request handler as well or should we continue with our current implementation and worry about that later? I agree that you are going to want to implement an UpdateRequestProcessor. However, it would seem to me that, unlike search, you're not going to want to bother with the existing processor and associated component chain, you're going to want to replace the processor with a distributed version. As to the HttpCommComponent, I'd suggest you make your own educated decision. How similar is the class? Could one serve both needs effectively? 3. Our update processor uses a MultiThreadedHttpConnectionManager to send parallel updates to shards, can anyone give some appropriate values to be used for the defaultMaxConnectionsPerHost and maxTotalConnections params? Won't the values used for distributed search be a little high for distributed indexing? You are right, these will likely be lower for distributed indexing, however I'd suggest not worrying about it for now, as it is easy to tweak later. Upayavira --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source
[jira] Issue Comment Edited: (SOLR-2341) Shard distribution policy
[ https://issues.apache.org/jira/browse/SOLR-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991225#comment-12991225 ] William Mayor edited comment on SOLR-2341 at 2/6/11 10:00 PM: -- This patch makes the implemented policy deterministic. This is missing from the previous patch. The policy code has also been refactored into its own package. was (Author: williammayor): This patch makes the implemented policy deterministic. This is missing from the previous patch. The policy code has also been refactored into it's own package. Shard distribution policy - Key: SOLR-2341 URL: https://issues.apache.org/jira/browse/SOLR-2341 Project: Solr Issue Type: New Feature Reporter: William Mayor Priority: Minor Attachments: SOLR-2341.patch, SOLR-2341.patch A first crack at creating policies to be used for determining to which of a list of shards a document should go. See discussion on Distributed Indexing on dev-list. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Distributed Indexing
Hello Thanks for your prompt reply. In regards to using a SolrDocument instead of Strings (and I agree that ListString doesn't seem to be the best way of going) how do I get reference to a SolrDoc? As far as I can see I have access to a ListContentStream that represents all of the files being POSTed. Do I want to open these streams, get the info and then stream them out? This seems wasteful. I had instead thought that the DistributedUpdatedRequestHandler would take this ListContentStream, create some kind mapping between each stream and a unique id and then pass the ids to the policy. Thanks for your help Billy On Tue, Feb 1, 2011 at 11:27 AM, Upayavira u...@odoko.co.uk wrote: On Tue, 01 Feb 2011 00:26 +, William Mayor m...@williammayor.co.uk wrote: Hi Guys I've had a go at creating the ShardDistributionPolicy interface and a few implementations. I've created a patch (https://issues.apache.org/jira/browse/SOLR-2341) let me know what needs doing. Currently I assume that the documents passed to the policy will be represented by some kind of identifier and that one needs only to match the ID with a shard. This is better (I think) than reading the document from the POST and figuring out some kind of unique identifier? Your code looks fine to me, except it should take in a SolrDocument object or list of, rather than strings. Then, for your Hash version, you can take a hash of the id field. A question we've had about this is who decides what policy to use and where do they specify? I'm inclided to think that the user (the person POSTing data) does not mind what policy is used but the administrator might. This leads me to think that the policy should be set in the solr config file? My collegues disagree that the user will not mind and would rather see the policy be specified in the url. We've noticed that request handlers can be specified in both so should we adopt this idea instead (and as a kind of comprimise :) ). To stick with Solr conventions, you would specify the ShardDistributionPolicy in the solrconfig.xml, within the configuration of your DistributedUpdateRequestHandler, so in that sense, it is hidden from your users and managed by the administrator. However, if you follow this approach, an administrator could expose multiple policies by having multiple DistributedUpdateRequestHandler definitions in solrconfig.xml, with different URLs. To give you an example, but for search rather than indexing: requestHandler name=/dismax class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=defTypedismax/str /lst /requestHandler This will configure requests to http://localhost:8983/solr/dismax?q=blah to be handled by the dismax query parser. More relevant to you: requestHandler name=/distrib class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=shardshttp://localhost:8983/solr,http://localhost:7983/solr/str /lst /requestHandler This would, by default, distribute all queries to http://localhost:8983/solr/distrib?q=blah across two Solr instances at the URLs described. For now, I'd say see if you can add a distributionPolicyClass=org.apache.solr.blah to define the class that this updateRequestHandler is going to use. To everyone else who got this far - please chip in if you see better ways of doing this. Upayavira All the best William On Sat, Jan 29, 2011 at 11:56 PM, Upayavira u...@odoko.co.uk wrote: Lance, Firstly, we're proposing a ShardDistributionPolicy interface for which there is a default (mod of the doc ID) but other implementations are possible. Another easy implementation would be a randomised or round robin one. As to threading, the first task would be to put all of the source documents into buckets, one bucket per shard, using the above ShardDistributionPolicy to assign documents to buckets/shards. Then all of the documents in a bucket could be sent to the relevant shard for indexing (which would be nothing more than a normal HTTP post (or solrj call?)). As to whether this would be single threaded or multithreaded, I would guess we would aim to do it the same as the distributed search code (which I have not yet reviewed). However, it could presumably be single-threaded, but use asynchronous HTTP. Regards, Upayavira On Sat, 29 Jan 2011 15:09 -0800, Lance Norskog goks...@gmail.com wrote: I would suggest that a DistributedRequestUpdateHandler run single-threaded, doing only one document at a time. If I want more than one, I run it twice or N times with my own program. Also, this should have a policy object which decides exactly how documents are distributed. There are different techniques for different use cases. Lance On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood soheb.luc...@gmail.com wrote
[jira] Created: (SOLR-2341) Shard distribution policy
Shard distribution policy - Key: SOLR-2341 URL: https://issues.apache.org/jira/browse/SOLR-2341 Project: Solr Issue Type: New Feature Reporter: William Mayor Priority: Minor A first crack at creating policies to be used for determining to which of a list of shards a document should go. See discussion on Distributed Indexing on dev-list. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2341) Shard distribution policy
[ https://issues.apache.org/jira/browse/SOLR-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Mayor updated SOLR-2341: Attachment: SOLR-2341.patch We've created an interface for making policies and then implemented a few basic ideas. There's tests for the abstract policies but not the concrete ones. Shard distribution policy - Key: SOLR-2341 URL: https://issues.apache.org/jira/browse/SOLR-2341 Project: Solr Issue Type: New Feature Reporter: William Mayor Priority: Minor Attachments: SOLR-2341.patch A first crack at creating policies to be used for determining to which of a list of shards a document should go. See discussion on Distributed Indexing on dev-list. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Distributed Indexing
Hi Guys I've had a go at creating the ShardDistributionPolicy interface and a few implementations. I've created a patch (https://issues.apache.org/jira/browse/SOLR-2341) let me know what needs doing. Currently I assume that the documents passed to the policy will be represented by some kind of identifier and that one needs only to match the ID with a shard. This is better (I think) than reading the document from the POST and figuring out some kind of unique identifier? A question we've had about this is who decides what policy to use and where do they specify? I'm inclided to think that the user (the person POSTing data) does not mind what policy is used but the administrator might. This leads me to think that the policy should be set in the solr config file? My collegues disagree that the user will not mind and would rather see the policy be specified in the url. We've noticed that request handlers can be specified in both so should we adopt this idea instead (and as a kind of comprimise :) ). All the best William On Sat, Jan 29, 2011 at 11:56 PM, Upayavira u...@odoko.co.uk wrote: Lance, Firstly, we're proposing a ShardDistributionPolicy interface for which there is a default (mod of the doc ID) but other implementations are possible. Another easy implementation would be a randomised or round robin one. As to threading, the first task would be to put all of the source documents into buckets, one bucket per shard, using the above ShardDistributionPolicy to assign documents to buckets/shards. Then all of the documents in a bucket could be sent to the relevant shard for indexing (which would be nothing more than a normal HTTP post (or solrj call?)). As to whether this would be single threaded or multithreaded, I would guess we would aim to do it the same as the distributed search code (which I have not yet reviewed). However, it could presumably be single-threaded, but use asynchronous HTTP. Regards, Upayavira On Sat, 29 Jan 2011 15:09 -0800, Lance Norskog goks...@gmail.com wrote: I would suggest that a DistributedRequestUpdateHandler run single-threaded, doing only one document at a time. If I want more than one, I run it twice or N times with my own program. Also, this should have a policy object which decides exactly how documents are distributed. There are different techniques for different use cases. Lance On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood soheb.luc...@gmail.com wrote: Hello Yonik, On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote: Making it easy for clients I think is key... one should be able to update any node in the solr cluster and have solr take care of the hard part about updating all relevant shards. This will most likely involve an update processor. This approach allows all existing update methods (including things like CSV file upload) to still work correctly. Also post.jar is really just for testing... a command-line replacement for curl for those who may not have it. It's not really a recommended way for updating Solr servers in production. OK, I've abandoned the post.jar tool idea in favour of a DistributedUpdateRequestProcessor class (I've been looking into other classes like UpdateRequestProcessor, RunUpdateRequestProcessor, SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they are used/what data they store - hence why I've taken some time to respond). My big question now is that is it necessary to have a Factory class for DistributedUpdateRequestProcessor? I've seen this lots of times, as in RunUpdateProcessorFactory (where the factory class was only a few lines of code) to SignatureUpdateProcessorFactory? At first I was thinking it would be a good design idea to include it in (in a generic sense), but then I thought harder and I thought that the DistributedUpdateRequestHander would only be running once, taking in all the requests, so it seems sort of pointless to write one in. That is my burning question for now. I have got a few more questions, but I'm sure that when I look further into the code, I'll either have more or all of my questions are answered. Many thanks! Soheb Mahmood - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Lance Norskog goks...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org --- Enterprise Search Consultant at Sourcesense UK, Making Sense of Open Source - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org