Hi everyone, I'm looking for some help with Solr indexing issues on a large scale.
We are indexing few terabytes/month on a sizeable Solr cluster (8 masters / serving writes, 16 slaves / serving reads). After certain amount of tuning we got to the point where a single Solr instance can handle index size of 100GB without much issues, but after that we are starting to observe noticeable delays on index flush and they are getting larger. See the attached picture for details, it's done for a single JVM on a single machine. We are posting data in 8 threads using javabin format and doing commit every 5K documents, merge factor 20, and ram buffer size about 384MB. >From the picture it can be seen that a single-threaded index flushing code kicks in on every commit and blocks all other indexing threads. The hardware is decent (12 physical / 24 virtual cores per machine) and it is mostly idle when the index is flushing. Very little CPU utilization and disk I/O (<5%), with the exception of a single CPU core which actually does index flush (95% CPU, 5% I/O wait). My questions are: 1) will Solr changes from real-time branch help to resolve these issues? I was reading http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html and it looks like we have exactly the same problem 2) what would be the best way to port these (and only these) changes to 3.4.0? I tried to dig into the branching and revisions, but got lost quickly. Tried something like "svn diff […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not sure if it's even possible to merge these into 3.4.0 3) what would you recommend for production 24/7 use? 3.4.0? 4) is there a workaround that can be used? also, I listed the stack trace below Thank you! Roman P.S. This single "index flushing" thread spends 99% of all the time in "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then the merge seems to go quickly. I looked it up and it looks like the intent here is deleting old commit points (we are keeping only 1 non-optimized commit point per config). Not sure why is it taking that long. pool-2-thread-1 [RUNNABLE] CPU time: 3:31 java.nio.Bits.copyToByteArray(long, Object, long, long) java.nio.DirectByteBuffer.get(byte[], int, int) org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, int) org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) org.apache.lucene.index.SegmentTermEnum.next() org.apache.lucene.index.TermInfosReader.<init>(Directory, String, FieldInfos, int, int) org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentReader, Directory, SegmentInfo, int, int) org.apache.lucene.index.SegmentReader.get(boolean, Directory, SegmentInfo, int, boolean, int) org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean, int, int) org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean) org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool, List) org.apache.lucene.index.IndexWriter.doFlush(boolean) org.apache.lucene.index.IndexWriter.flush(boolean, boolean) org.apache.lucene.index.IndexWriter.closeInternal(boolean) org.apache.lucene.index.IndexWriter.close(boolean) org.apache.lucene.index.IndexWriter.close() org.apache.solr.update.SolrIndexWriter.close() org.apache.solr.update.DirectUpdateHandler2.closeWriter() org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand) org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run() java.util.concurrent.Executors$RunnableAdapter.call() java.util.concurrent.FutureTask$Sync.innerRun() java.util.concurrent.FutureTask.run() java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run() java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) java.util.concurrent.ThreadPoolExecutor$Worker.run() java.lang.Thread.run()