[ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178840#comment-13178840 ]
Mark Miller commented on SOLR-3001: ----------------------------------- Also, it's worth noting that it's been a bit since you have needed to define your own update chain - the distrib update processor is now part of the default chain - so of course you can define a custom chain - but no need to. > Documents droping when using DistributedUpdateProcessor > ------------------------------------------------------- > > Key: SOLR-3001 > URL: https://issues.apache.org/jira/browse/SOLR-3001 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 4.0 > Environment: Windows 7, Ubuntu > Reporter: Rafał Kuć > > I have a problem with distributed indexing in solrcloud branch. I've setup a > cluster with three Solr servers. I'm using DistributedUpdateProcessor to do > the distributed indexing. What I've noticed is when indexing with > StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a > list which have more than one document the documents seems to be dropped. I > did some tests which tried to index 450k documents. If I was sending the > documents one by one, the indexing was properly executed and the three Solr > instances was holding 450k documents (when summed up). However if when I > tried to add documents in batches (for example with StreamingUpdateSolrServer > and a queue of 1000) the shard I was sending the documents to had a minimum > number of documents (about 100) while the other shards had about 150k > documents. > Each Solr was started with a single core and in Zookeeper mode. An example > solr.xml file: > {noformat} > <?xml version="1.0" encoding="UTF-8" ?> > <solr persistent="true"> > <cores defaultCoreName="collection1" adminPath="/admin/cores" > zkClientTimeout="10000" hostPort="8983" hostContext="solr"> > <core shard="shard1" instanceDir="." name="collection1" /> > </cores> > </solr> > {noformat} > The solrconfig.xml file on each of the shard consisted of the following > entries: > {noformat} > <requestHandler name="/update" class="solr.XmlUpdateRequestHandler"> > <lst name="defaults"> > <str name="update.chain">distrib</str> > </lst> > </requestHandler> > {noformat} > {noformat} > <updateRequestProcessorChain name="distrib"> > <processor > class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" /> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory"/> > </updateRequestProcessorChain> > {noformat} > I found a solution, but I don't know if it is a proper one. I've modified the > code that is responsible for handling the replicas in: > {{private List<String> setupRequest(int hash)}} of > {{DistributedUpdateProcessorFactory}} > I've added the following code: > {noformat} > if (urls == null) { > urls = new ArrayList<String>(1); > urls.add(leaderUrl); > } else { > if (!urls.contains(leaderUrl)) { > urls.add(leaderUrl); > } > } > {noformat} > after: > {noformat} > urls = getReplicaUrls(req, collection, shardId, nodeName); > {noformat} > If this is the proper approach I'll be glad to provide a patch with the > modification. > -- > Regards > Rafał Kuć > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org