You might find this useful:

https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

One tricky bit: Assuming docs have a random distribution amongst
shards, you should batch so at least 100 docs go to each _shard_. You
can see from the link that the speedup is mostly going from 1 to 100.
So if you have 5 shards, I'd create batches of at least 500. That was
a fairly simple test with stupid-simple docs. Large complicated
documents wouldn't show the same curve.

Setup for PULL and TLOG isn't hard, just specify the number of TLOG or
PULL replicas you want at collection creation time. NOTE: this is only
on Solr 7x. See:
https://lucene.apache.org/solr/guide/7_3/shards-and-indexing-data-in-solrcloud.html#types-of-replicas

About creating your own queue, mine usually look like
List<SolrInputDocument> list...
while (more docs) {
  list.add(new_doc);
  if (list.size > X) {
      client.add(list);
      list.clear();
  }
}

Not exactly a sophisticated queue ;).....

On Tue, May 15, 2018 at 8:15 AM, Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:
> Hi Erik,
>
> yes indeed, batching solved it.
> I used ConcurrentUpdateSolrClient with queue size of 10000 but
> CloudSolrClient doesn't have this feature.
> I build my own queue now.
>
> Ah!!! So I obviously use default NRT but actually don't need it because
> I don't have any NRT data to index. A latency of several hours is OK for me.
> Currently I'm testing with a 3x3 core-cluster (3 server, 3 cores per
> server).
>
> I also tested with 3x3 node-cluster (3 server, 3 nodes per server) which
> performed
> better, less influence of GarbageCollection.
>
> I have to read more about PULL or TLOG replicas, how to set this up and so
> on.
> If it is to complex I will go with NRT and indexing is anyway during the
> night.
> Thanks for pointing this out.
>
> Regards,
> Bernd
>
>
> Am 15.05.2018 um 13:28 schrieb Erick Erickson:
>>
>> What did you do to solve your performance problem?
>>
>> Batching updates is one thing that helps performance.
>>
>> bq.  I thought that only the leaders are under load
>> until any commit and then replicate to the other replicas.
>>
>> True if (and only if) you're using PULL or TLOG replicas.
>> When using the default NRT replicas, every replica indexes
>> the docs, it doesn't matter whether they are the leader or replica.
>> That's required for NRT. Using CloudSolrClient has no bearing
>> on that functionality.
>>
>> Best,
>> Erick
>>
>> On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>
>>> Thanks, solved, performance is good now.
>>>
>>> Regards,
>>> Bernd
>>>
>>>
>>> Am 15.05.2018 um 08:12 schrieb Bernd Fehling:
>>>>
>>>>
>>>> OK, I have the CloudSolrClient with SolrJ now running but it seams
>>>> a bit slower compared to ConcurrentUpdateSolrClient.
>>>> This was not expected.
>>>> The logs show that CloudSolrClient send the docs only to the leaders.
>>>>
>>>> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
>>>>
>>>> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
>>>> With CloudSolrClient I get only about 1200 docs/sec.
>>>>
>>>> The system monitoring shows that with CloudSolrClient all nodes and
>>>> cores
>>>> are under heavy load. I thought that only the leaders are under load
>>>> until any commit and then replicate to the other replicas.
>>>> And that the replicas which are no leader have capacity to answer search
>>>> requests.
>>>>
>>>> I think I still don't get the advantage of CloudSolrClient?
>>>>
>>>> Regards,
>>>> Bernd
>>>>
>>>>
>>>>
>>>> Am 09.05.2018 um 19:15 schrieb Erick Erickson:
>>>>>
>>>>>
>>>>> You may not need to deal with any of this.
>>>>>
>>>>> The default CloudSolrClient call creates a new LBHttpSolrClient for
>>>>> you. So unless you're doing something custom with any LBHttpSolrClient
>>>>> you create, you don't need to create one yourself.
>>>>>
>>>>> Second, the default for CloudSolrClient.add() is to take the list of
>>>>> documents you provide into sub-lists that consist of the docs destined
>>>>> for a particular shard and sends those to the leader.
>>>>>
>>>>> Do the default not work for you?
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
>>>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>>>>
>>>>>>
>>>>>> Hi list,
>>>>>>
>>>>>> while going from single core master/slave to cloud multi core/node
>>>>>> with leader/replica I want to change my SolrJ loading, because
>>>>>> ConcurrentUpdateSolrClient isn't cloud aware and has performance
>>>>>> impacts.
>>>>>> I want to use CloudSolrClient with LBHttpSolrClient and updates
>>>>>> should only go to shard leaders.
>>>>>>
>>>>>> Question, what is the difference between sendUpdatesOnlyToShardLeaders
>>>>>> and sendDirectUpdatesToShardLeadersOnly?
>>>>>>
>>>>>> Regards,
>>>>>> Bernd

Reply via email to