Documents droping when using DistributedUpdateProcessor
-------------------------------------------------------

                 Key: SOLR-3001
                 URL: https://issues.apache.org/jira/browse/SOLR-3001
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.0
         Environment: Windows 7, Ubuntu
            Reporter: Rafał Kuć


I have a problem with distributed indexing in solrcloud branch. I've setup a 
cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the 
distributed indexing. What I've noticed is when indexing with 
StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list 
which have more than one document the documents seems to be dropped. I did some 
tests which tried to index 450k documents. If I was sending the documents one 
by one, the indexing was properly executed and the three Solr instances was 
holding 450k documents (when summed up). However if when I tried to add 
documents in batches (for example with StreamingUpdateSolrServer and a queue of 
1000) the shard I was sending the documents to had a minimum number of 
documents (about 100) while the other shards had about 150k documents. 

Each Solr was started with a single core and in Zookeeper mode. An example 
solr.xml file:
{noformat} 
<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true">
 <cores defaultCoreName="collection1" adminPath="/admin/cores" 
zkClientTimeout="10000" hostPort="8983" hostContext="solr">
  <core shard="shard1" instanceDir="." name="collection1" />
 </cores>
</solr>
{noformat} 

The solrconfig.xml file on each of the shard consisted of the following entries:
{noformat} 
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
 <lst name="defaults">
  <str name="update.chain">distrib</str>
 </lst>
</requestHandler>
{noformat} 

{noformat} 
<updateRequestProcessorChain name="distrib">
 <processor 
class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
 <processor class="solr.LogUpdateProcessorFactory" />
 <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
{noformat} 

I found a solution, but I don't know if it is a proper one. I've modified the 
code that is responsible for handling the replicas in:
{{private List<String> setupRequest(int hash)}} of 
{{DistributedUpdateProcessorFactory}}

I've added the following code:
{noformat} 
if (urls == null) {
 urls = new ArrayList<String>(1);
 urls.add(leaderUrl);  
} else {
 if (!urls.contains(leaderUrl)) {
  urls.add(leaderUrl);  
 }
}
{noformat} 

after:
{noformat} 
urls = getReplicaUrls(req, collection, shardId, nodeName);
{noformat} 

If this is the proper approach I'll be glad to provide a patch with the 
modification. 

--
Regards
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to