Documents droping when using DistributedUpdateProcessor
-------------------------------------------------------
Key: SOLR-3001
URL: https://issues.apache.org/jira/browse/SOLR-3001
Project: Solr
Issue Type: Bug
Components: SolrCloud
Affects Versions: 4.0
Environment: Windows 7, Ubuntu
Reporter: Rafał Kuć
I have a problem with distributed indexing in solrcloud branch. I've setup a
cluster with three Solr servers. I'm using DistributedUpdateProcessor to do the
distributed indexing. What I've noticed is when indexing with
StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a list
which have more than one document the documents seems to be dropped. I did some
tests which tried to index 450k documents. If I was sending the documents one
by one, the indexing was properly executed and the three Solr instances was
holding 450k documents (when summed up). However if when I tried to add
documents in batches (for example with StreamingUpdateSolrServer and a queue of
1000) the shard I was sending the documents to had a minimum number of
documents (about 100) while the other shards had about 150k documents.
Each Solr was started with a single core and in Zookeeper mode. An example
solr.xml file:
{noformat}
<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true">
<cores defaultCoreName="collection1" adminPath="/admin/cores"
zkClientTimeout="10000" hostPort="8983" hostContext="solr">
<core shard="shard1" instanceDir="." name="collection1" />
</cores>
</solr>
{noformat}
The solrconfig.xml file on each of the shard consisted of the following entries:
{noformat}
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">distrib</str>
</lst>
</requestHandler>
{noformat}
{noformat}
<updateRequestProcessorChain name="distrib">
<processor
class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
{noformat}
I found a solution, but I don't know if it is a proper one. I've modified the
code that is responsible for handling the replicas in:
{{private List<String> setupRequest(int hash)}} of
{{DistributedUpdateProcessorFactory}}
I've added the following code:
{noformat}
if (urls == null) {
urls = new ArrayList<String>(1);
urls.add(leaderUrl);
} else {
if (!urls.contains(leaderUrl)) {
urls.add(leaderUrl);
}
}
{noformat}
after:
{noformat}
urls = getReplicaUrls(req, collection, shardId, nodeName);
{noformat}
If this is the proper approach I'll be glad to provide a patch with the
modification.
--
Regards
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]