large scale indexing issues / single threaded bottleneck

Roman Alekseenkov Fri, 28 Oct 2011 11:49:29 -0700

Hi everyone,

I'm looking for some help with Solr indexing issues on a large scale.


We are indexing few terabytes/month on a sizeable Solr cluster (8
masters / serving writes, 16 slaves / serving reads). After certain
amount of tuning we got to the point where a single Solr instance can
handle index size of 100GB without much issues, but after that we are
starting to observe noticeable delays on index flush and they are
getting larger. See the attached picture for details, it's done for a
single JVM on a single machine.

We are posting data in 8 threads using javabin format and doing commit
every 5K documents, merge factor 20, and ram buffer size about 384MB.
>From the picture it can be seen that a single-threaded index flushing
code kicks in on every commit and blocks all other indexing threads.
The hardware is decent (12 physical / 24 virtual cores per machine)
and it is mostly idle when the index is flushing. Very little CPU
utilization and disk I/O (<5%), with the exception of a single CPU
core which actually does index flush (95% CPU, 5% I/O wait).

My questions are:

1) will Solr changes from real-time branch help to resolve these
issues? I was reading
http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
and it looks like we have exactly the same problem

2) what would be the best way to port these (and only these) changes
to 3.4.0? I tried to dig into the branching and revisions, but got
lost quickly. Tried something like "svn diff
[…]realtime_search@r953476 […]realtime_search@r1097767", but I'm not
sure if it's even possible to merge these into 3.4.0

3) what would you recommend for production 24/7 use? 3.4.0?

4) is there a workaround that can be used? also, I listed the stack trace below

Thank you!
Roman

P.S. This single "index flushing" thread spends 99% of all the time in
"org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then
the merge seems to go quickly. I looked it up and it looks like the
intent here is deleting old commit points (we are keeping only 1
non-optimized commit point per config). Not sure why is it taking that
long.

pool-2-thread-1 [RUNNABLE] CPU time: 3:31
java.nio.Bits.copyToByteArray(long, Object, long, long)
java.nio.DirectByteBuffer.get(byte[], int, int)
org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, int)
org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
org.apache.lucene.index.SegmentTermEnum.next()
org.apache.lucene.index.TermInfosReader.<init>(Directory, String,
FieldInfos, int, int)
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentReader,
Directory, SegmentInfo, int, int)
org.apache.lucene.index.SegmentReader.get(boolean, Directory,
SegmentInfo, int, boolean, int)
org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
boolean, int, int)
org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool,
List)
org.apache.lucene.index.IndexWriter.doFlush(boolean)
org.apache.lucene.index.IndexWriter.flush(boolean, boolean)
org.apache.lucene.index.IndexWriter.closeInternal(boolean)
org.apache.lucene.index.IndexWriter.close(boolean)
org.apache.lucene.index.IndexWriter.close()
org.apache.solr.update.SolrIndexWriter.close()
org.apache.solr.update.DirectUpdateHandler2.closeWriter()
org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run()
java.util.concurrent.Executors$RunnableAdapter.call()
java.util.concurrent.FutureTask$Sync.innerRun()
java.util.concurrent.FutureTask.run()
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()

large scale indexing issues / single threaded bottleneck

Reply via email to