We index rarely and in bulk as we’re an organisation that deals in enabling access to leaked documents for journalists.
The indexes are mostly static for 99% of the year. We only optimise after reindexing due to schema changes or when we have a new leak. Our workflow is to index on a staging server, optimise then trigger replication to a production instance of Solr. We cannot index straight to production as extracting text from documents is expensive (lots of EC2 machines running Extract<https://github.com/ICIJ/extract>) and we need to really hammer the Solr server with updates (up to 250 concurrent update request at some times). I’ve never done benchmark tests, but it’s an interesting question. I always worked on the assumption that if the optimise operation exists then there must be a reason. Also something tells me that having your index spread over 70 files must be bad. The OOM error is certainly due to something else as it happens when we try indexing text extracted from multi-gigabyte archives. On 3 Mar 2017, at 17:45, Erick Erickson <erickerick...@gmail.com<mailto:erickerick...@gmail.com>> wrote: Matthew: What load testing have you done on optimized .vs. unoptimized indexes? Is there enough of a performance gain to be worth the trouble? Toke's indexes are pretty static, and in his situation it's worth the effort. Before spending a lot of cycles on making optimization work/understanding the ins and outs I'd really recommend you see if any performance gain is worth it ;)... And as I mentioned earlier, optimizing is unlikely to be related to OOMs during indexing. You never know of course.... Best, Erick On Fri, Mar 3, 2017 at 3:40 AM, Caruana, Matthew <mcaru...@icij.org<mailto:mcaru...@icij.org>> wrote: Thank you, you’re right - only one of the four cores is hitting 100%. This is the correct answer. The bottleneck is CPU exacerbated by an absence of parallelisation. On 3 Mar 2017, at 12:32, Toke Eskildsen <t...@kb.dk<mailto:t...@kb.dk>> wrote: On Thu, 2017-03-02 at 15:39 +0000, Caruana, Matthew wrote: Thank you. The question remains however, if this is such a hefty operation then why is it walking to the destination instead of running, so to speak? We only do optimize on an old Solr 4.10 setup, but for that we have plenty of experience. At least for single-shard, and at least for most of the work, optimize is a single-threaded process: It takes us ~8 hours to optimize a ~900GB shard using SSDs, with 1 CPU-core at near 100% and the other ones not doing anything. The machine load number is a bit fuzzy, but if you do a top doing optimization, my guess is that you will see the same thing as we do: Only 1 CPU-core working. -- Toke Eskildsen, Royal Danish Library