We index rarely and in bulk as we’re an organisation that deals in enabling 
access to leaked documents for journalists.

The indexes are mostly static for 99% of the year. We only optimise after 
reindexing due to schema changes or when
we have a new leak.

Our workflow is to index on a staging server, optimise then trigger replication 
to a production instance of Solr. We cannot
index straight to production as extracting text from documents is expensive 
(lots of EC2 machines running Extract<https://github.com/ICIJ/extract>) and we 
need
to really hammer the Solr server with updates (up to 250 concurrent update 
request at some times).

I’ve never done benchmark tests, but it’s an interesting question. I always 
worked on the assumption that if the optimise
operation exists then there must be a reason. Also something tells me that 
having your index spread over 70 files must be bad.

The OOM error is certainly due to something else as it happens when we try 
indexing text extracted from multi-gigabyte
archives.

On 3 Mar 2017, at 17:45, Erick Erickson 
<erickerick...@gmail.com<mailto:erickerick...@gmail.com>> wrote:

Matthew:

What load testing have you done on optimized .vs. unoptimized indexes?
Is there enough of a performance gain to be worth the trouble? Toke's
indexes are pretty static, and in his situation it's worth the effort.
Before spending a lot of cycles on making optimization
work/understanding the ins and outs I'd really recommend you see if
any performance gain is worth it ;)...

And as I mentioned earlier, optimizing is unlikely to be related to
OOMs during indexing. You never know of course....

Best,
Erick

On Fri, Mar 3, 2017 at 3:40 AM, Caruana, Matthew 
<mcaru...@icij.org<mailto:mcaru...@icij.org>> wrote:
Thank you, you’re right - only one of the four cores is hitting 100%. This is 
the correct answer. The bottleneck is CPU exacerbated by an absence of 
parallelisation.

On 3 Mar 2017, at 12:32, Toke Eskildsen <t...@kb.dk<mailto:t...@kb.dk>> wrote:

On Thu, 2017-03-02 at 15:39 +0000, Caruana, Matthew wrote:
Thank you. The question remains however, if this is such a hefty
operation then why is it walking to the destination instead of
running, so to speak?

We only do optimize on an old Solr 4.10 setup, but for that we have
plenty of experience. At least for single-shard, and at least for most
of the work, optimize is a single-threaded process: It takes us ~8
hours to optimize a ~900GB shard using SSDs, with 1 CPU-core at near
100% and the other ones not doing anything.

The machine load number is a bit fuzzy, but if you do a top doing
optimization, my guess is that you will see the same thing as we do:
Only 1 CPU-core working.
--
Toke Eskildsen, Royal Danish Library


Reply via email to