[
https://issues.apache.org/jira/browse/SOLR-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180213#comment-13180213
]
Mark Miller commented on SOLR-3001:
-----------------------------------
bq. Btw. The fact is the indexing is much, much faster right now using
distributed indexing as the shards are getting document in batches.
You mean faster after you updated to the latest rev? There was no buffering
originally, so even if you where streaming it would use httpcommons server and
send docs around one by one. Late last week I added the buffering though. Right
now it buffers 10 docs per target shard, but I was thinking about whether or
not we should make that configurable and/or raise it.
> Documents droping when using DistributedUpdateProcessor
> -------------------------------------------------------
>
> Key: SOLR-3001
> URL: https://issues.apache.org/jira/browse/SOLR-3001
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.0
> Environment: Windows 7, Ubuntu
> Reporter: Rafał Kuć
> Assignee: Mark Miller
> Fix For: 4.0
>
>
> I have a problem with distributed indexing in solrcloud branch. I've setup a
> cluster with three Solr servers. I'm using DistributedUpdateProcessor to do
> the distributed indexing. What I've noticed is when indexing with
> StreamingUpdateSolrServer or CommonsHttpSolrServer and having a queue or a
> list which have more than one document the documents seems to be dropped. I
> did some tests which tried to index 450k documents. If I was sending the
> documents one by one, the indexing was properly executed and the three Solr
> instances was holding 450k documents (when summed up). However if when I
> tried to add documents in batches (for example with StreamingUpdateSolrServer
> and a queue of 1000) the shard I was sending the documents to had a minimum
> number of documents (about 100) while the other shards had about 150k
> documents.
> Each Solr was started with a single core and in Zookeeper mode. An example
> solr.xml file:
> {noformat}
> <?xml version="1.0" encoding="UTF-8" ?>
> <solr persistent="true">
> <cores defaultCoreName="collection1" adminPath="/admin/cores"
> zkClientTimeout="10000" hostPort="8983" hostContext="solr">
> <core shard="shard1" instanceDir="." name="collection1" />
> </cores>
> </solr>
> {noformat}
> The solrconfig.xml file on each of the shard consisted of the following
> entries:
> {noformat}
> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
> <lst name="defaults">
> <str name="update.chain">distrib</str>
> </lst>
> </requestHandler>
> {noformat}
> {noformat}
> <updateRequestProcessorChain name="distrib">
> <processor
> class="org.apache.solr.update.processor.DistributedUpdateProcessorFactory" />
> <processor class="solr.LogUpdateProcessorFactory" />
> <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>
> {noformat}
> I found a solution, but I don't know if it is a proper one. I've modified the
> code that is responsible for handling the replicas in:
> {{private List<String> setupRequest(int hash)}} of
> {{DistributedUpdateProcessorFactory}}
> I've added the following code:
> {noformat}
> if (urls == null) {
> urls = new ArrayList<String>(1);
> urls.add(leaderUrl);
> } else {
> if (!urls.contains(leaderUrl)) {
> urls.add(leaderUrl);
> }
> }
> {noformat}
> after:
> {noformat}
> urls = getReplicaUrls(req, collection, shardId, nodeName);
> {noformat}
> If this is the proper approach I'll be glad to provide a patch with the
> modification.
> --
> Regards
> Rafał Kuć
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]