Well, concurrent DIH is very simple, just one little shell script :-) 5K-10K docs per second say nothing. Is it just data pushed plain into the index or has it complex schema with many analyzers?
Bernd Am 02.08.2016 um 15:44 schrieb Susheel Kumar: > My experience with DIH was we couldn't scale to the level we wanted. SorlJ > with multi-threading & batch updates (parallel threads pushing data into > solr) worked and were able to ingest 5K-10K docs per second. > > Thanks, > Susheel > > On Tue, Aug 2, 2016 at 9:15 AM, Mikhail Khludnev <m...@apache.org> wrote: > >> Bernd, >> But why do you have so many deletes? Is it expected? >> When you run DIHs concurrently, do you shard intput data by uniqueKey? >> >> On Wed, Jul 27, 2016 at 6:20 PM, Bernd Fehling < >> bernd.fehl...@uni-bielefeld.de> wrote: >> >>> If there is a problem in single index then it might also be in CloudSolr. >>> As far as I could figure out from INFOSTREAM, documents are added to >>> segments >>> and terms are "collected". Duplicate term are "deleted" (or whatever). >>> These deletes (or whatever) are not concurrent. >>> I have a lines like: >>> BD 0 [Wed Jul 27 13:28:48 GMT+01:00 2016; Thread-27879]: applyDeletes: >>> infos=... >>> BD 0 [Wed Jul 27 13:31:48 GMT+01:00 2016; Thread-27879]: applyDeletes >> took >>> 180028 msec >>> ... >>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: >>> infos=... >>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes >> took >>> 3411845 msec >>> >>> 3411545 msec are about 56 minutes where the system is doing what??? >>> At least not indexing because only one JAVA process and no I/O at all! >>> >>> How can SolrJ help me now with this problem? >>> >>> Best >>> Bernd >>> >>> >>> Am 27.07.2016 um 16:41 schrieb Erick Erickson: >>>> Well, at least it'll be easier to debug in my experience. Simple >> example. >>>> At some point you'll call CloudSolrClient.add(doc list). Comment just >>> that >>>> out and you'll be able to isolate whether the issue is querying the be >> or >>>> sending to Solr. >>>> >>>> Then CloudSolrClient (assuming SolrCloud) has efficiencies in terms of >>>> routing... >>>> >>>> Best >>>> Erick >>>> >>>> On Jul 27, 2016 7:24 AM, "Bernd Fehling" < >> bernd.fehl...@uni-bielefeld.de >>>> >>>> wrote: >>>> >>>>> So writing some SolrJ doing the same job as the DIH script >>>>> and using that concurrent will solve my problem? >>>>> I'm not using Tika. >>>>> >>>>> I don't think that DIH is my problem, even if it is not the best >>> solution >>>>> right now. >>>>> Nevertheless, you are right SolrJ has higher performance, but what >>>>> if I have the same problems with SolrJ like with DIH? >>>>> >>>>> If it runs with DIH it should run with SolrJ with additional >> performance >>>>> boost. >>>>> >>>>> Bernd >>>>> >>>>> >>>>> On 27.07.2016 at 16:03, Erick Erickson: >>>>>> I'd actually recommend you move to a SolrJ solution >>>>>> or similar. Currently, you're putting a load on the Solr >>>>>> servers (especially if you're also using Tika) in addition >>>>>> to all indexing etc. >>>>>> >>>>>> Here's a sample: >>>>>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ >>>>>> >>>>>> Dodging the question I know, but DIH sometimes isn't >>>>>> the best solution. >>>>>> >>>>>> Best, >>>>>> Erick >>>>>> >>>>>> On Wed, Jul 27, 2016 at 6:59 AM, Bernd Fehling >>>>>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>>>>> After enhancing the server with SSDs I'm trying to speed up >> indexing. >>>>>>> >>>>>>> The server has 16 CPUs and more than 100G RAM. >>>>>>> JAVA (1.8.0_92) has 24G. >>>>>>> SOLR is 4.10.4. >>>>>>> Plain XML data to load is 218G with about 96M records. >>>>>>> This will result in a single index of 299G. >>>>>>> >>>>>>> I tried with 4, 8, 12 and 16 concurrent DIHs. >>>>>>> 16 and 12 was to much because for 16 CPUs and my test continued >> with 8 >>>>> concurrent DIHs. >>>>>>> Then i was trying different <indexConfig> and <updateHandler> >> settings >>>>> but now I'm stuck. >>>>>>> I can't figure out what is the best setting for bulk indexing. >>>>>>> What I see is that the indexing is "falling asleep" after some time >> of >>>>> indexing. >>>>>>> It is only producing del-files, like _11_1.del, _w_2.del, >> _h_3.del,... >>>>>>> >>>>>>> <indexConfig> >>>>>>> <maxIndexingThreads>8</maxIndexingThreads> >>>>>>> <ramBufferSizeMB>1024</ramBufferSizeMB> >>>>>>> <maxBufferedDocs>-1</maxBufferedDocs> >>>>>>> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> >>>>>>> <int name="maxMergeAtOnce">8</int> >>>>>>> <int name="segmentsPerTier">100</int> >>>>>>> <int name="maxMergedSegmentMB">512</int> >>>>>>> </mergePolicy> >>>>>>> <mergeFactor>8</mergeFactor> >>>>>>> <mergeScheduler >>>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/> >>>>>>> <lockType>${solr.lock.type:native}</lockType> >>>>>>> ... >>>>>>> </indexConfig> >>>>>>> >>>>>>> <updateHandler class="solr.DirectUpdateHandler2"> >>>>>>> ### no autocommit at all >>>>>>> <autoSoftCommit> >>>>>>> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> >>>>>>> </autoSoftCommit> >>>>>>> </updateHandler> >>>>>>> >>>>>>> >>>>>>> >>>>> >>> >> command=full-import&optimize=false&clean=false&commit=false&waitSearcher=false >>>>>>> After indexing finishes there is a final optimize. >>>>>>> >>>>>>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging >>>>>>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor). >>>>>>> It should do no commit, no optimize. >>>>>>> ramBufferSizeMB is high because I have plenty of RAM and I want make >>>>> use the speed of RAM. >>>>>>> segmentsPerTier is high to reduce merging. >>>>>>> >>>>>>> But somewhere is a misconfiguration because indexing gets stalled. >>>>>>> >>>>>>> Any idea what's going wrong? >>>>>>> >>>>>>> >>>>>>> Bernd >>>>>>> >>>>> >>>> >>> >>> -- >>> ************************************************************* >>> Bernd Fehling Bielefeld University Library >>> Dipl.-Inform. (FH) LibTec - Library Technology >>> Universitätsstr. 25 and Knowledge Management >>> 33615 Bielefeld >>> Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de >>> >>> BASE - Bielefeld Academic Search Engine - www.base-search.net >>> ************************************************************* >>> >> >> >> >> -- >> Sincerely yours >> Mikhail Khludnev >> > -- ************************************************************* Bernd Fehling Bielefeld University Library Dipl.-Inform. (FH) LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *************************************************************