Re: problems with bulk indexing with concurrent DIH

Bernd Fehling Tue, 02 Aug 2016 06:59:30 -0700

Well, concurrent DIH is very simple, just one little shell script :-)

5K-10K docs per second say nothing. Is it just data pushed plain into
the index or has it complex schema with many analyzers?


Bernd

Am 02.08.2016 um 15:44 schrieb Susheel Kumar:
> My experience with DIH was we couldn't scale to the level we wanted.  SorlJ
> with multi-threading & batch updates (parallel threads pushing data into
> solr) worked and were able to ingest 5K-10K docs per second.
> 
> Thanks,
> Susheel
> 
> On Tue, Aug 2, 2016 at 9:15 AM, Mikhail Khludnev <m...@apache.org> wrote:
> 
>> Bernd,
>> But why do you have so many deletes? Is it expected?
>> When you run DIHs concurrently, do you shard intput data by uniqueKey?
>>
>> On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling <
>> bernd.fehl...@uni-bielefeld.de> wrote:
>>
>>> If there is a problem in single index then it might also be in CloudSolr.
>>> As far as I could figure out from INFOSTREAM, documents are added to
>>> segments
>>> and terms are "collected". Duplicate term are "deleted" (or whatever).
>>> These deletes (or whatever) are not concurrent.
>>> I have a lines like:
>>> BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes:
>>> infos=...
>>> BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes
>> took
>>> 180028 msec
>>> ...
>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>>> infos=...
>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
>> took
>>> 3411845 msec
>>>
>>> 3411545 msec are about 56 minutes where the system is doing what???
>>> At least not indexing because only one JAVA process and no I/O at all!
>>>
>>> How can SolrJ help me now with this problem?
>>>
>>> Best
>>> Bernd
>>>
>>>
>>> Am 27.07.2016 um 16:41 schrieb Erick Erickson:
>>>> Well, at least it'll be easier to debug in my experience. Simple
>> example.
>>>> At some point you'll call CloudSolrClient.add(doc list). Comment just
>>> that
>>>> out and you'll be able to isolate whether the issue is querying the be
>> or
>>>> sending to Solr.
>>>>
>>>> Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of
>>>> routing...
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Jul 27, 2016 7:24 AM, "Bernd Fehling" <
>> bernd.fehl...@uni-bielefeld.de
>>>>
>>>> wrote:
>>>>
>>>>> So writing some SolrJ doing the same job as the DIH script
>>>>> and using that concurrent will solve my problem?
>>>>> I'm not using Tika.
>>>>>
>>>>> I don't think that DIH is my problem, even if it is not the best
>>> solution
>>>>> right now.
>>>>> Nevertheless, you are right SolrJ has higher performance, but what
>>>>> if I have the same problems with SolrJ like with DIH?
>>>>>
>>>>> If it runs with DIH it should run with SolrJ with additional
>> performance
>>>>> boost.
>>>>>
>>>>> Bernd
>>>>>
>>>>>
>>>>> On 27.07.2016 at 16:03, Erick Erickson:
>>>>>> I'd actually recommend you move to a SolrJ solution
>>>>>> or similar. Currently, you're putting a load on the Solr
>>>>>> servers (especially if you're also using Tika) in addition
>>>>>> to all indexing etc.
>>>>>>
>>>>>> Here's a sample:
>>>>>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>>>>>>
>>>>>> Dodging the question I know, but DIH sometimes isn't
>>>>>> the best solution.
>>>>>>
>>>>>> Best,
>>>>>> Erick
>>>>>>
>>>>>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
>>>>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>>>>> After enhancing the server with SSDs I'm trying to speed up
>> indexing.
>>>>>>>
>>>>>>> The server has 16 CPUs and more than 100G RAM.
>>>>>>> JAVA (1.8.0_92) has 24G.
>>>>>>> SOLR is 4.10.4.
>>>>>>> Plain XML data to load is 218G with about 96M records.
>>>>>>> This will result in a single index of 299G.
>>>>>>>
>>>>>>> I tried with 4, 8, 12 and 16 concurrent DIHs.
>>>>>>> 16 and 12 was to much because for 16 CPUs and my test continued
>> with 8
>>>>> concurrent DIHs.
>>>>>>> Then i was trying different <indexConfig> and <updateHandler>
>> settings
>>>>> but now I'm stuck.
>>>>>>> I can't figure out what is the best setting for bulk indexing.
>>>>>>> What I see is that the indexing is "falling asleep" after some time
>> of
>>>>> indexing.
>>>>>>> It is only producing del-files, like _11_1.del, _w_2.del,
>> _h_3.del,...
>>>>>>>
>>>>>>> <indexConfig>
>>>>>>>     <maxIndexingThreads>8</maxIndexingThreads>
>>>>>>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>>>>>>     <maxBufferedDocs>-1</maxBufferedDocs>
>>>>>>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>>>>>>>       <int name="maxMergeAtOnce">8</int>
>>>>>>>       <int name="segmentsPerTier">100</int>
>>>>>>>       <int name="maxMergedSegmentMB">512</int>
>>>>>>>     </mergePolicy>
>>>>>>>     <mergeFactor>8</mergeFactor>
>>>>>>>     <mergeScheduler
>>>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>>>>>>>     <lockType>${solr.lock.type:native}</lockType>
>>>>>>>     ...
>>>>>>> </indexConfig>
>>>>>>>
>>>>>>> <updateHandler class="solr.DirectUpdateHandler2">
>>>>>>>      ### no autocommit at all
>>>>>>>      <autoSoftCommit>
>>>>>>>        <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
>>>>>>>      </autoSoftCommit>
>>>>>>> </updateHandler>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
>> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
>>>>>>> After indexing finishes there is a final optimize.
>>>>>>>
>>>>>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
>>>>>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
>>>>>>> It should do no commit, no optimize.
>>>>>>> ramBufferSizeMB is high because I have plenty of RAM and I want make
>>>>> use the speed of RAM.
>>>>>>> segmentsPerTier is high to reduce merging.
>>>>>>>
>>>>>>> But somewhere is a misconfiguration because indexing gets stalled.
>>>>>>>
>>>>>>> Any idea what's going wrong?
>>>>>>>
>>>>>>>
>>>>>>> Bernd
>>>>>>>
>>>>>
>>>>
>>>
>>> --
>>> *************************************************************
>>> Bernd Fehling                    Bielefeld University Library
>>> Dipl.-Inform. (FH)                LibTec - Library Technology
>>> Universitätsstr. 25                  and Knowledge Management
>>> 33615 Bielefeld
>>> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>>>
>>> BASE - Bielefeld Academic Search Engine - www.base-search.net
>>> *************************************************************
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

Re: problems with bulk indexing with concurrent DIH

Reply via email to