Re: problems with bulk indexing with concurrent DIH

Erick Erickson Wed, 27 Jul 2016 07:41:32 -0700

Well, at least it'll be easier to debug in my experience. Simple example.
At some point you'll call CloudSolrClient.add(doc list). Comment just that
out and you'll be able to isolate whether the issue is querying the be or
sending to Solr.


Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of
routing...

Best
Erick

On Jul 27, 2016 7:24 AM, "Bernd Fehling" <bernd.fehl...@uni-bielefeld.de>
wrote:

> So writing some SolrJ doing the same job as the DIH script
> and using that concurrent will solve my problem?
> I'm not using Tika.
>
> I don't think that DIH is my problem, even if it is not the best solution
> right now.
> Nevertheless, you are right SolrJ has higher performance, but what
> if I have the same problems with SolrJ like with DIH?
>
> If it runs with DIH it should run with SolrJ with additional performance
> boost.
>
> Bernd
>
>
> On 27.07.2016 at 16:03, Erick Erickson:
> > I'd actually recommend you move to a SolrJ solution
> > or similar. Currently, you're putting a load on the Solr
> > servers (especially if you're also using Tika) in addition
> > to all indexing etc.
> >
> > Here's a sample:
> > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> >
> > Dodging the question I know, but DIH sometimes isn't
> > the best solution.
> >
> > Best,
> > Erick
> >
> > On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling
> > <bernd.fehl...@uni-bielefeld.de> wrote:
> >> After enhancing the server with SSDs I'm trying to speed up indexing.
> >>
> >> The server has 16 CPUs and more than 100G RAM.
> >> JAVA (1.8.0_92) has 24G.
> >> SOLR is 4.10.4.
> >> Plain XML data to load is 218G with about 96M records.
> >> This will result in a single index of 299G.
> >>
> >> I tried with 4, 8, 12 and 16 concurrent DIHs.
> >> 16 and 12 was to much because for 16 CPUs and my test continued with 8
> concurrent DIHs.
> >> Then i was trying different <indexConfig> and <updateHandler> settings
> but now I'm stuck.
> >> I can't figure out what is the best setting for bulk indexing.
> >> What I see is that the indexing is "falling asleep" after some time of
> indexing.
> >> It is only producing del-files, like _11_1.del, _w_2.del, _h_3.del,...
> >>
> >> <indexConfig>
> >>     <maxIndexingThreads>8</maxIndexingThreads>
> >>     <ramBufferSizeMB>1024</ramBufferSizeMB>
> >>     <maxBufferedDocs>-1</maxBufferedDocs>
> >>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >>       <int name="maxMergeAtOnce">8</int>
> >>       <int name="segmentsPerTier">100</int>
> >>       <int name="maxMergedSegmentMB">512</int>
> >>     </mergePolicy>
> >>     <mergeFactor>8</mergeFactor>
> >>     <mergeScheduler
> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> >>     <lockType>${solr.lock.type:native}</lockType>
> >>     ...
> >> </indexConfig>
> >>
> >> <updateHandler class="solr.DirectUpdateHandler2">
> >>      ### no autocommit at all
> >>      <autoSoftCommit>
> >>        <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> >>      </autoSoftCommit>
> >> </updateHandler>
> >>
> >>
> >>
> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false
> >> After indexing finishes there is a final optimize.
> >>
> >> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
> >> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
> >> It should do no commit, no optimize.
> >> ramBufferSizeMB is high because I have plenty of RAM and I want make
> use the speed of RAM.
> >> segmentsPerTier is high to reduce merging.
> >>
> >> But somewhere is a misconfiguration because indexing gets stalled.
> >>
> >> Any idea what's going wrong?
> >>
> >>
> >> Bernd
> >>
>

Re: problems with bulk indexing with concurrent DIH

Reply via email to