Interesting find. I have seen other reports on very slow deleteByQuery earlier. 
So it should be used sparingly, and under no circumstance bombard Solr with 
multiple deleteByQuery requests on each update.

Sounds like a better plan to switch to a truly unique ID like SKU. Or if you 
know the previous ID, use delete-by-id instead, which is much faster. While 
you're stuch with deleteByQuery, it would probably also be more efficient to 
collapse multiple deleteByQuery requests into one, i.e. (("sku: 123 AND uid != 
foo-123") OR ("sku: 124 AND uid != "bar-124")...) as one query rather than 
individual ones. And try to batch them 100 at a time once in a while rather 
than many small..

Jan

> 16. jun. 2022 kl. 10:59 skrev Marius Grigaitis 
> <[email protected]>:
> 
> Just a followup on the topic.
> 
> * We checked settings on solr, seem quite default (especially on merge,
> commit strategies, etc)
> * We commit every 10 minutes
> * Added NewRelic to the Solr instance to gather more data and graphs
> 
> In the end what caught our eye is a few deleteByQuery lines in stacks of
> running threads while Solr is overloaded. We temporarily removed
> deleteByQuery and it had around 10x performance improvement on indexing
> speed.
> 
> How are we using deleteByQuery?
> 
> update(add=[{uid: foo-123, sku: 123, ...}, {uid: bar-124, sku: 124} ...],
> deleteByQuery=["sku: 123 AND uid != foo-123", "sku: 123 AND uid !=
> bar-124"])
> 
> UID is the uniqueKey for the index. We do this because "foo" or "bar" could
> change and we no longer want the previous document present.
> 
> Ideally we should probably change our uniqueKey to be `sku` in this case
> and we would no longer need deleteByQuery but what could be interesting is
> why deleteByQuery causes such performance bottleneck as well as how we
> could potentially optimize it if we wanted to keep it?
> 
> Marius
> 
> On Wed, Jun 8, 2022 at 8:41 PM David Hastings <[email protected]>
> wrote:
> 
>>> * Do NOT commit after each batch of 1000 docs. Instead, commit as seldom
>> as your requirements allows, e.g. try commitWithin=60000 to commit every
>> minute
>> 
>> this is the big one.  commit after the entire process is done or on a
>> timer, if you don't need NRT searching, rarely does anyone ever need that.
>> the commit is a heavy operation and takes about the same time if you are
>> committing 1000 documents or 100k documents.
>> 
>> On Wed, Jun 8, 2022 at 10:40 AM Jan Høydahl <[email protected]> wrote:
>> 
>>> * Go multi threaded for each core as Shawn says. Try e.g. 2, 3 and 4
>>> threads
>>> * Experiment with different batch sizes, e.g. try 500 and 2000 - depends
>>> on your docs what is optimal
>>> * Do NOT commit after each batch of 1000 docs. Instead, commit as seldom
>>> as your requirements allows, e.g. try commitWithin=60000 to commit every
>>> minute
>>> 
>>> Tip: Try to push Solr metrics to DataDog or some other service, where you
>>> can see a dashboard with stats on requests/sec, RAM, CPU, threads, GC etc
>>> which may answer your last question.
>>> 
>>> Jan
>>> 
>>>> 8. jun. 2022 kl. 14:06 skrev Shawn Heisey <[email protected]>:
>>>> 
>>>> On 6/8/2022 3:35 AM, Marius Grigaitis wrote:
>>>>> * 9 different cores. Each weighs around ~100 MB on disk and has
>>>>> approximately 90k documents inside each.
>>>>> * Updating is performed using update method in batches of 1000,
>> around 9
>>>>> processes in parallel (split by core)
>>>> 
>>>> This means that indexing within each Solr core is single-threaded.  The
>>> way to increase indexing speed is to index in parallel with multiple
>>> threads or processes per index.  If you can increase the CPU power
>>> available on the Solr server when you increase the number of
>>> processes/threads sending data to Solr, that might help.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>> 
>>> 
>> 

Reply via email to