Re: problems with bulk indexing with concurrent DIH

Bernd Fehling Wed, 27 Jul 2016 07:24:39 -0700

So writing some SolrJ doing the same job as the DIH script
and using that concurrent will solve my problem?
I'm not using Tika.


I don't think that DIH is my problem, even if it is not the best solution right 
now.
Nevertheless, you are right SolrJ has higher performance, but what
if I have the same problems with SolrJ like with DIH?

If it runs with DIH it should run with SolrJ with additional performance boost.

Bernd


On 27.07.2016 at 16:03, Erick Erickson:
> I'd actually recommend you move to a SolrJ solution
> or similar. Currently, you're putting a load on the Solr
> servers (especially if you're also using Tika) in addition
> to all indexing etc.
> 
> Here's a sample:
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> 
> Dodging the question I know, but DIH sometimes isn't
> the best solution.
> 
> Best,
> Erick
> 
> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
> <bernd.fehl...@uni-bielefeld.de> wrote:
>> After enhancing the server with SSDs I'm trying to speed up indexing.
>>
>> The server has 16 CPUs and more than 100G RAM.
>> JAVA (1.8.0_92) has 24G.
>> SOLR is 4.10.4.
>> Plain XML data to load is 218G with about 96M records.
>> This will result in a single index of 299G.
>>
>> I tried with 4, 8, 12 and 16 concurrent DIHs.
>> 16 and 12 was to much because for 16 CPUs and my test continued with 8 
>> concurrent DIHs.
>> Then i was trying different <indexConfig> and <updateHandler> settings but 
>> now I'm stuck.
>> I can't figure out what is the best setting for bulk indexing.
>> What I see is that the indexing is "falling asleep" after some time of 
>> indexing.
>> It is only producing del-files, like _11_1.del, _w_2.del, _h_3.del,...
>>
>> <indexConfig>
>>     <maxIndexingThreads>8</maxIndexingThreads>
>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>     <maxBufferedDocs>-1</maxBufferedDocs>
>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>>       <int name="maxMergeAtOnce">8</int>
>>       <int name="segmentsPerTier">100</int>
>>       <int name="maxMergedSegmentMB">512</int>
>>     </mergePolicy>
>>     <mergeFactor>8</mergeFactor>
>>     <mergeScheduler 
>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>>     <lockType>${solr.lock.type:native}</lockType>
>>     ...
>> </indexConfig>
>>
>> <updateHandler class="solr.DirectUpdateHandler2">
>>      ### no autocommit at all
>>      <autoSoftCommit>
>>        <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
>>      </autoSoftCommit>
>> </updateHandler>
>>
>>
>> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
>> After indexing finishes there is a final optimize.
>>
>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
>> It should do no commit, no optimize.
>> ramBufferSizeMB is high because I have plenty of RAM and I want make use the 
>> speed of RAM.
>> segmentsPerTier is high to reduce merging.
>>
>> But somewhere is a misconfiguration because indexing gets stalled.
>>
>> Any idea what's going wrong?
>>
>>
>> Bernd
>>

Re: problems with bulk indexing with concurrent DIH

Reply via email to