Just a followup on the topic.
* We checked settings on solr, seem quite default (especially on merge,
commit strategies, etc)
* We commit every 10 minutes
* Added NewRelic to the Solr instance to gather more data and graphs
In the end what caught our eye is a few deleteByQuery lines in stacks of
running threads while Solr is overloaded. We temporarily removed
deleteByQuery and it had around 10x performance improvement on indexing
speed.
How are we using deleteByQuery?
update(add=[{uid: foo-123, sku: 123, ...}, {uid: bar-124, sku: 124} ...],
deleteByQuery=["sku: 123 AND uid != foo-123", "sku: 123 AND uid !=
bar-124"])
UID is the uniqueKey for the index. We do this because "foo" or "bar" could
change and we no longer want the previous document present.
Ideally we should probably change our uniqueKey to be `sku` in this case
and we would no longer need deleteByQuery but what could be interesting is
why deleteByQuery causes such performance bottleneck as well as how we
could potentially optimize it if we wanted to keep it?
Marius
On Wed, Jun 8, 2022 at 8:41 PM David Hastings <[email protected]>
wrote:
> > * Do NOT commit after each batch of 1000 docs. Instead, commit as seldom
> as your requirements allows, e.g. try commitWithin=60000 to commit every
> minute
>
> this is the big one. commit after the entire process is done or on a
> timer, if you don't need NRT searching, rarely does anyone ever need that.
> the commit is a heavy operation and takes about the same time if you are
> committing 1000 documents or 100k documents.
>
> On Wed, Jun 8, 2022 at 10:40 AM Jan Høydahl <[email protected]> wrote:
>
> > * Go multi threaded for each core as Shawn says. Try e.g. 2, 3 and 4
> > threads
> > * Experiment with different batch sizes, e.g. try 500 and 2000 - depends
> > on your docs what is optimal
> > * Do NOT commit after each batch of 1000 docs. Instead, commit as seldom
> > as your requirements allows, e.g. try commitWithin=60000 to commit every
> > minute
> >
> > Tip: Try to push Solr metrics to DataDog or some other service, where you
> > can see a dashboard with stats on requests/sec, RAM, CPU, threads, GC etc
> > which may answer your last question.
> >
> > Jan
> >
> > > 8. jun. 2022 kl. 14:06 skrev Shawn Heisey <[email protected]>:
> > >
> > > On 6/8/2022 3:35 AM, Marius Grigaitis wrote:
> > >> * 9 different cores. Each weighs around ~100 MB on disk and has
> > >> approximately 90k documents inside each.
> > >> * Updating is performed using update method in batches of 1000,
> around 9
> > >> processes in parallel (split by core)
> > >
> > > This means that indexing within each Solr core is single-threaded. The
> > way to increase indexing speed is to index in parallel with multiple
> > threads or processes per index. If you can increase the CPU power
> > available on the Solr server when you increase the number of
> > processes/threads sending data to Solr, that might help.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
> >
>