Re: problems with bulk indexing with concurrent DIH

Bernd Fehling Wed, 27 Jul 2016 08:20:52 -0700

If there is a problem in single index then it might also be in CloudSolr.
As far as I could figure out from INFOSTREAM, documents are added to segments
and terms are "collected". Duplicate term are "deleted" (or whatever).
These deletes (or whatever) are not concurrent.
I have a lines like:
BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes: infos=...
BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes took 
180028 msec
...
BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: infos=...
BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes took 
3411845 msec


3411545 msec are about 56 minutes where the system is doing what???
At least not indexing because only one JAVA process and no I/O at all!

How can SolrJ help me now with this problem?

Best
Bernd


Am 27.07.2016 um 16:41 schrieb Erick Erickson:
> Well, at least it'll be easier to debug in my experience. Simple example.
> At some point you'll call CloudSolrClient.add(doc list). Comment just that
> out and you'll be able to isolate whether the issue is querying the be or
> sending to Solr.
> 
> Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of
> routing...
> 
> Best
> Erick
> 
> On Jul 27, 2016 7:24 AM, "Bernd Fehling" <bernd.fehl...@uni-bielefeld.de>
> wrote:
> 
>> So writing some SolrJ doing the same job as the DIH script
>> and using that concurrent will solve my problem?
>> I'm not using Tika.
>>
>> I don't think that DIH is my problem, even if it is not the best solution
>> right now.
>> Nevertheless, you are right SolrJ has higher performance, but what
>> if I have the same problems with SolrJ like with DIH?
>>
>> If it runs with DIH it should run with SolrJ with additional performance
>> boost.
>>
>> Bernd
>>
>>
>> On 27.07.2016 at 16:03, Erick Erickson:
>>> I'd actually recommend you move to a SolrJ solution
>>> or similar. Currently, you're putting a load on the Solr
>>> servers (especially if you're also using Tika) in addition
>>> to all indexing etc.
>>>
>>> Here's a sample:
>>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>>>
>>> Dodging the question I know, but DIH sometimes isn't
>>> the best solution.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>> After enhancing the server with SSDs I'm trying to speed up indexing.
>>>>
>>>> The server has 16 CPUs and more than 100G RAM.
>>>> JAVA (1.8.0_92) has 24G.
>>>> SOLR is 4.10.4.
>>>> Plain XML data to load is 218G with about 96M records.
>>>> This will result in a single index of 299G.
>>>>
>>>> I tried with 4, 8, 12 and 16 concurrent DIHs.
>>>> 16 and 12 was to much because for 16 CPUs and my test continued with 8
>> concurrent DIHs.
>>>> Then i was trying different <indexConfig> and <updateHandler> settings
>> but now I'm stuck.
>>>> I can't figure out what is the best setting for bulk indexing.
>>>> What I see is that the indexing is "falling asleep" after some time of
>> indexing.
>>>> It is only producing del-files, like _11_1.del, _w_2.del, _h_3.del,...
>>>>
>>>> <indexConfig>
>>>>     <maxIndexingThreads>8</maxIndexingThreads>
>>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>>>     <maxBufferedDocs>-1</maxBufferedDocs>
>>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>>>>       <int name="maxMergeAtOnce">8</int>
>>>>       <int name="segmentsPerTier">100</int>
>>>>       <int name="maxMergedSegmentMB">512</int>
>>>>     </mergePolicy>
>>>>     <mergeFactor>8</mergeFactor>
>>>>     <mergeScheduler
>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>>>>     <lockType>${solr.lock.type:native}</lockType>
>>>>     ...
>>>> </indexConfig>
>>>>
>>>> <updateHandler class="solr.DirectUpdateHandler2">
>>>>      ### no autocommit at all
>>>>      <autoSoftCommit>
>>>>        <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
>>>>      </autoSoftCommit>
>>>> </updateHandler>
>>>>
>>>>
>>>>
>> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
>>>> After indexing finishes there is a final optimize.
>>>>
>>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
>>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
>>>> It should do no commit, no optimize.
>>>> ramBufferSizeMB is high because I have plenty of RAM and I want make
>> use the speed of RAM.
>>>> segmentsPerTier is high to reduce merging.
>>>>
>>>> But somewhere is a misconfiguration because indexing gets stalled.
>>>>
>>>> Any idea what's going wrong?
>>>>
>>>>
>>>> Bernd
>>>>
>>
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

Re: problems with bulk indexing with concurrent DIH

Reply via email to