If there is a problem in single index then it might also be in CloudSolr. As far as I could figure out from INFOSTREAM, documents are added to segments and terms are "collected". Duplicate term are "deleted" (or whatever). These deletes (or whatever) are not concurrent. I have a lines like: BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes: infos=... BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes took 180028 msec ... BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: infos=... BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes took 3411845 msec
3411545 msec are about 56 minutes where the system is doing what??? At least not indexing because only one JAVA process and no I/O at all! How can SolrJ help me now with this problem? Best Bernd Am 27.07.2016 um 16:41 schrieb Erick Erickson: > Well, at least it'll be easier to debug in my experience. Simple example. > At some point you'll call CloudSolrClient.add(doc list). Comment just that > out and you'll be able to isolate whether the issue is querying the be or > sending to Solr. > > Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of > routing... > > Best > Erick > > On Jul 27, 2016 7:24 AM, "Bernd Fehling" <bernd.fehl...@uni-bielefeld.de> > wrote: > >> So writing some SolrJ doing the same job as the DIH script >> and using that concurrent will solve my problem? >> I'm not using Tika. >> >> I don't think that DIH is my problem, even if it is not the best solution >> right now. >> Nevertheless, you are right SolrJ has higher performance, but what >> if I have the same problems with SolrJ like with DIH? >> >> If it runs with DIH it should run with SolrJ with additional performance >> boost. >> >> Bernd >> >> >> On 27.07.2016 at 16:03, Erick Erickson: >>> I'd actually recommend you move to a SolrJ solution >>> or similar. Currently, you're putting a load on the Solr >>> servers (especially if you're also using Tika) in addition >>> to all indexing etc. >>> >>> Here's a sample: >>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ >>> >>> Dodging the question I know, but DIH sometimes isn't >>> the best solution. >>> >>> Best, >>> Erick >>> >>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling >>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>> After enhancing the server with SSDs I'm trying to speed up indexing. >>>> >>>> The server has 16 CPUs and more than 100G RAM. >>>> JAVA (1.8.0_92) has 24G. >>>> SOLR is 4.10.4. >>>> Plain XML data to load is 218G with about 96M records. >>>> This will result in a single index of 299G. >>>> >>>> I tried with 4, 8, 12 and 16 concurrent DIHs. >>>> 16 and 12 was to much because for 16 CPUs and my test continued with 8 >> concurrent DIHs. >>>> Then i was trying different <indexConfig> and <updateHandler> settings >> but now I'm stuck. >>>> I can't figure out what is the best setting for bulk indexing. >>>> What I see is that the indexing is "falling asleep" after some time of >> indexing. >>>> It is only producing del-files, like _11_1.del, _w_2.del, _h_3.del,... >>>> >>>> <indexConfig> >>>> <maxIndexingThreads>8</maxIndexingThreads> >>>> <ramBufferSizeMB>1024</ramBufferSizeMB> >>>> <maxBufferedDocs>-1</maxBufferedDocs> >>>> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> >>>> <int name="maxMergeAtOnce">8</int> >>>> <int name="segmentsPerTier">100</int> >>>> <int name="maxMergedSegmentMB">512</int> >>>> </mergePolicy> >>>> <mergeFactor>8</mergeFactor> >>>> <mergeScheduler >> class="org.apache.lucene.index.ConcurrentMergeScheduler"/> >>>> <lockType>${solr.lock.type:native}</lockType> >>>> ... >>>> </indexConfig> >>>> >>>> <updateHandler class="solr.DirectUpdateHandler2"> >>>> ### no autocommit at all >>>> <autoSoftCommit> >>>> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> >>>> </autoSoftCommit> >>>> </updateHandler> >>>> >>>> >>>> >> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false >>>> After indexing finishes there is a final optimize. >>>> >>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging >>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor). >>>> It should do no commit, no optimize. >>>> ramBufferSizeMB is high because I have plenty of RAM and I want make >> use the speed of RAM. >>>> segmentsPerTier is high to reduce merging. >>>> >>>> But somewhere is a misconfiguration because indexing gets stalled. >>>> >>>> Any idea what's going wrong? >>>> >>>> >>>> Bernd >>>> >> > -- ************************************************************* Bernd Fehling Bielefeld University Library Dipl.-Inform. (FH) LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *************************************************************