problems with bulk indexing with concurrent DIH

Bernd Fehling Wed, 27 Jul 2016 06:59:34 -0700

After enhancing the server with SSDs I'm trying to speed up indexing.

The server has 16 CPUs and more than 100G RAM.
JAVA (1.8.0_92) has 24G.
SOLR is 4.10.4.
Plain XML data to load is 218G with about 96M records.
This will result in a single index of 299G.


I tried with 4, 8, 12 and 16 concurrent DIHs.
16 and 12 was to much because for 16 CPUs and my test continued with 8 
concurrent DIHs.
Then i was trying different <indexConfig> and <updateHandler> settings but now 
I'm stuck.
I can't figure out what is the best setting for bulk indexing.
What I see is that the indexing is "falling asleep" after some time of indexing.
It is only producing del-files, like _11_1.del, _w_2.del, _h_3.del,...

<indexConfig>
    <maxIndexingThreads>8</maxIndexingThreads>
    <ramBufferSizeMB>1024</ramBufferSizeMB>
    <maxBufferedDocs>-1</maxBufferedDocs>
    <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
      <int name="maxMergeAtOnce">8</int>
      <int name="segmentsPerTier">100</int>
      <int name="maxMergedSegmentMB">512</int>
    </mergePolicy>
    <mergeFactor>8</mergeFactor>
    <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
    <lockType>${solr.lock.type:native}</lockType>
    ...
</indexConfig>

<updateHandler class="solr.DirectUpdateHandler2">
     ### no autocommit at all
     <autoSoftCommit>
       <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
     </autoSoftCommit>
</updateHandler>


command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
After indexing finishes there is a final optimize.

My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
(maxIndexingThreads/maxMergeAtOnce/mergeFactor).
It should do no commit, no optimize.
ramBufferSizeMB is high because I have plenty of RAM and I want make use the 
speed of RAM.
segmentsPerTier is high to reduce merging.

But somewhere is a misconfiguration because indexing gets stalled.

Any idea what's going wrong?


Bernd

problems with bulk indexing with concurrent DIH

Reply via email to