Hi Vincenzo, Yes.
On Thu, Jun 16, 2022 at 12:39 PM Vincenzo D'Amore <[email protected]> wrote: > Hi Marius, if I have understood correctly you have a deleteByQuery for each > document, am I right? > > On Thu, 16 Jun 2022 at 11:04, Marius Grigaitis > <[email protected]> wrote: > > > Just a followup on the topic. > > > > * We checked settings on solr, seem quite default (especially on merge, > > commit strategies, etc) > > * We commit every 10 minutes > > * Added NewRelic to the Solr instance to gather more data and graphs > > > > In the end what caught our eye is a few deleteByQuery lines in stacks of > > running threads while Solr is overloaded. We temporarily removed > > deleteByQuery and it had around 10x performance improvement on indexing > > speed. > > > > How are we using deleteByQuery? > > > > update(add=[{uid: foo-123, sku: 123, ...}, {uid: bar-124, sku: 124} ...], > > deleteByQuery=["sku: 123 AND uid != foo-123", "sku: 123 AND uid != > > bar-124"]) > > > > UID is the uniqueKey for the index. We do this because "foo" or "bar" > could > > change and we no longer want the previous document present. > > > > Ideally we should probably change our uniqueKey to be `sku` in this case > > and we would no longer need deleteByQuery but what could be interesting > is > > why deleteByQuery causes such performance bottleneck as well as how we > > could potentially optimize it if we wanted to keep it? > > > > Marius > > > > On Wed, Jun 8, 2022 at 8:41 PM David Hastings < > > [email protected]> > > wrote: > > > > > > * Do NOT commit after each batch of 1000 docs. Instead, commit as > > seldom > > > as your requirements allows, e.g. try commitWithin=60000 to commit > every > > > minute > > > > > > this is the big one. commit after the entire process is done or on a > > > timer, if you don't need NRT searching, rarely does anyone ever need > > that. > > > the commit is a heavy operation and takes about the same time if you > are > > > committing 1000 documents or 100k documents. > > > > > > On Wed, Jun 8, 2022 at 10:40 AM Jan Høydahl <[email protected]> > > wrote: > > > > > > > * Go multi threaded for each core as Shawn says. Try e.g. 2, 3 and 4 > > > > threads > > > > * Experiment with different batch sizes, e.g. try 500 and 2000 - > > depends > > > > on your docs what is optimal > > > > * Do NOT commit after each batch of 1000 docs. Instead, commit as > > seldom > > > > as your requirements allows, e.g. try commitWithin=60000 to commit > > every > > > > minute > > > > > > > > Tip: Try to push Solr metrics to DataDog or some other service, where > > you > > > > can see a dashboard with stats on requests/sec, RAM, CPU, threads, GC > > etc > > > > which may answer your last question. > > > > > > > > Jan > > > > > > > > > 8. jun. 2022 kl. 14:06 skrev Shawn Heisey <[email protected]>: > > > > > > > > > > On 6/8/2022 3:35 AM, Marius Grigaitis wrote: > > > > >> * 9 different cores. Each weighs around ~100 MB on disk and has > > > > >> approximately 90k documents inside each. > > > > >> * Updating is performed using update method in batches of 1000, > > > around 9 > > > > >> processes in parallel (split by core) > > > > > > > > > > This means that indexing within each Solr core is single-threaded. > > The > > > > way to increase indexing speed is to index in parallel with multiple > > > > threads or processes per index. If you can increase the CPU power > > > > available on the Solr server when you increase the number of > > > > processes/threads sending data to Solr, that might help. > > > > > > > > > > Thanks, > > > > > Shawn > > > > > > > > > > > > > > > > > > > -- > Vincenzo D'Amore >
