[ https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179400#comment-13179400 ]
Rafał Kuć commented on SOLR-3001: --------------------------------- Mark, I've tried the newest solrcloud branch and I'm affraid the problem still exists. What I did to test is indexing 425543 using StreamingUpdateSolrServer (10000 queue size, 3 threads). Those documents were sent to the shard1. After indexation ended, the following number of documents were at all three shards: shard1: 5 documents shard2: 142424 documents shard3: 141275 documents and the query like: q=*:*&distrib=true returns 283704 documents total. So Solr dropped about 141839 which should probably be in the first shard, the one I'm sending the documents to. If I send the documents on by one with the use of CommonsHttpSolrServer the numbers are as follows: shard1: 141725 documents shard2: 142474 documents shard3: 141344 documents I'm using the Solr version: solr-spec-version 4.0.0.2012.01.04.10.42.06 (from Solr admin). I did the test with update.chain set and without it. Both times the same behavior. Btw. The fact is the indexing is much, much faster right now using distributed indexing as the shards are getting document in batches. > Documents droping when using DistributedUpdateProcessor > ------------------------------------------------------- > > Key: SOLR-3001 > URL: https://issues.apache.org/jira/browse/SOLR-3001 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 4.0 > Environment: Windows 7, Ubuntu > Reporter: Rafał Kuć > > I have a problem with distributed indexing in solrcloud branch. I've setup a > cluster with three Solr servers. I'm using DistributedUpdateProcessor to do > the distributed indexing. What I've noticed is when indexing with > StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a > list which have more than one document the documents seems to be dropped. I > did some tests which tried to index 450k documents. If I was sending the > documents one by one, the indexing was properly executed and the three Solr > instances was holding 450k documents (when summed up). However if when I > tried to add documents in batches (for example with StreamingUpdateSolrServer > and a queue of 1000) the shard I was sending the documents to had a minimum > number of documents (about 100) while the other shards had about 150k > documents. > Each Solr was started with a single core and in Zookeeper mode. An example > solr.xml file: > {noformat} > <?xml version="1.0" encoding="UTF-8" ?> > <solr persistent="true"> > <cores defaultCoreName="collection1" adminPath="/admin/cores" > zkClientTimeout="10000" hostPort="8983" hostContext="solr"> > <core shard="shard1" instanceDir="." name="collection1" /> > </cores> > </solr> > {noformat} > The solrconfig.xml file on each of the shard consisted of the following > entries: > {noformat} > <requestHandler name="/update" class="solr.XmlUpdateRequestHandler"> > <lst name="defaults"> > <str name="update.chain">distrib</str> > </lst> > </requestHandler> > {noformat} > {noformat} > <updateRequestProcessorChain name="distrib"> > <processor > class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" /> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory"/> > </updateRequestProcessorChain> > {noformat} > I found a solution, but I don't know if it is a proper one. I've modified the > code that is responsible for handling the replicas in: > {{private List<String> setupRequest(int hash)}} of > {{DistributedUpdateProcessorFactory}} > I've added the following code: > {noformat} > if (urls == null) { > urls = new ArrayList<String>(1); > urls.add(leaderUrl); > } else { > if (!urls.contains(leaderUrl)) { > urls.add(leaderUrl); > } > } > {noformat} > after: > {noformat} > urls = getReplicaUrls(req, collection, shardId, nodeName); > {noformat} > If this is the proper approach I'll be glad to provide a patch with the > modification. > -- > Regards > Rafał Kuć > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org