Hi Marius, if I have understood correctly you have a deleteByQuery for each document, am I right?
On Thu, 16 Jun 2022 at 11:04, Marius Grigaitis <[email protected]> wrote: > Just a followup on the topic. > > * We checked settings on solr, seem quite default (especially on merge, > commit strategies, etc) > * We commit every 10 minutes > * Added NewRelic to the Solr instance to gather more data and graphs > > In the end what caught our eye is a few deleteByQuery lines in stacks of > running threads while Solr is overloaded. We temporarily removed > deleteByQuery and it had around 10x performance improvement on indexing > speed. > > How are we using deleteByQuery? > > update(add=[{uid: foo-123, sku: 123, ...}, {uid: bar-124, sku: 124} ...], > deleteByQuery=["sku: 123 AND uid != foo-123", "sku: 123 AND uid != > bar-124"]) > > UID is the uniqueKey for the index. We do this because "foo" or "bar" could > change and we no longer want the previous document present. > > Ideally we should probably change our uniqueKey to be `sku` in this case > and we would no longer need deleteByQuery but what could be interesting is > why deleteByQuery causes such performance bottleneck as well as how we > could potentially optimize it if we wanted to keep it? > > Marius > > On Wed, Jun 8, 2022 at 8:41 PM David Hastings < > [email protected]> > wrote: > > > > * Do NOT commit after each batch of 1000 docs. Instead, commit as > seldom > > as your requirements allows, e.g. try commitWithin=60000 to commit every > > minute > > > > this is the big one. commit after the entire process is done or on a > > timer, if you don't need NRT searching, rarely does anyone ever need > that. > > the commit is a heavy operation and takes about the same time if you are > > committing 1000 documents or 100k documents. > > > > On Wed, Jun 8, 2022 at 10:40 AM Jan Høydahl <[email protected]> > wrote: > > > > > * Go multi threaded for each core as Shawn says. Try e.g. 2, 3 and 4 > > > threads > > > * Experiment with different batch sizes, e.g. try 500 and 2000 - > depends > > > on your docs what is optimal > > > * Do NOT commit after each batch of 1000 docs. Instead, commit as > seldom > > > as your requirements allows, e.g. try commitWithin=60000 to commit > every > > > minute > > > > > > Tip: Try to push Solr metrics to DataDog or some other service, where > you > > > can see a dashboard with stats on requests/sec, RAM, CPU, threads, GC > etc > > > which may answer your last question. > > > > > > Jan > > > > > > > 8. jun. 2022 kl. 14:06 skrev Shawn Heisey <[email protected]>: > > > > > > > > On 6/8/2022 3:35 AM, Marius Grigaitis wrote: > > > >> * 9 different cores. Each weighs around ~100 MB on disk and has > > > >> approximately 90k documents inside each. > > > >> * Updating is performed using update method in batches of 1000, > > around 9 > > > >> processes in parallel (split by core) > > > > > > > > This means that indexing within each Solr core is single-threaded. > The > > > way to increase indexing speed is to index in parallel with multiple > > > threads or processes per index. If you can increase the CPU power > > > available on the Solr server when you increase the number of > > > processes/threads sending data to Solr, that might help. > > > > > > > > Thanks, > > > > Shawn > > > > > > > > > > > > > -- Vincenzo D'Amore
