Phew! Thank you for raising this... it was a sneaky one. Mike
On Tue, Aug 11, 2009 at 4:13 PM, Jibo John<jiboj...@mac.com> wrote: > Mike, > > Yes, it works perfect ! > > I did observe a dip in the indexing throughput (1855 recs/sec vs. 2200 > recs/sec previously), but, more importantly, no data is lost this time. > > Thanks for helping me nail this down. > > -Jibo > > > > On Aug 11, 2009, at 11:12 AM, Michael McCandless wrote: > >> OK I found the problem! >> >> It was losing docs from the queue, when shutting down the thread pool, >> because we were calling super's addDocument(doc) not addDocument(doc, >> analyzer). IndexWriter was simply forwarding that call to >> ThreadedIndexWriter's addDocument(doc, analyzer) which in turn would >> do nothing because the thread pool was already told to shut down. >> Larger queues made it much more likely to happen. >> >> Can you try the new version (attached)? >> >> Also, make sure you add 'doc.reuse.fields=false' to your alg (on >> trunk). >> >> Mike >> >> On Tue, Aug 11, 2009 at 12:39 PM, Jibo John<jiboj...@mac.com> wrote: >>> >>> Mike, >>> >>> I wasn't exactly using the lucene core jar from MEAP. >>> >>> I have been building lucene from the source, and running the tests under >>> lucene/java/trunk/contrib/benchmark/ (checked out 2 weeks ago, I guess) >>> and, also under lucene/java/tags/lucene_2_4_1/contrib/benchmark/. >>> In both cases, copied CreateThreadedIndexTask to >>> org.apache.lucene.benchmark.byTask.tasks and ThreadedIndexWriter to >>> org.apache.lucene.index. >>> >>> I have observed the issue in both the versions of lucene. >>> >>> Indexes were optimized separately using Lucli. >>> >>> >>> PFA the classes and the alg. >>> >>> >>> >>> >>> >>> >>> Thank you for your help with this one. >>> >>> -Jibo >>> >>> >>> >>> >>> On Aug 11, 2009, at 3:13 AM, Michael McCandless wrote: >>> >>>> I'm baffled why you're losing docs w/ ThreadedIndexWriter. >>>> >>>> One question: your Lucene core JAR seems to be newer than the last >>>> MEAP update. Did you update it manually? >>>> >>>> Also, your indexes were optimized, but your algs don't have an >>>> optimize step -- did you separately run an optimize? >>>> >>>> Could you zip up the whole shebang (ThreadedIndexWriter.java, >>>> CreateThreadedIndexTask.java, the algs) & post? Please CC me directly >>>> so I can grab the zip file... thanks. >>>> >>>> Mike >>>> >>>> On Mon, Aug 3, 2009 at 12:37 PM, Jibo John<jiboj...@mac.com> wrote: >>>>> >>>>> Mike, >>>>> >>>>> Verified that I have the latest source code. >>>>> Here are the alg files and the checkindexer output. >>>>> >>>>> >>>>> ----------------------------------------- indexwriter >>>>> alg---------------------------------------------------------------- >>>>> >>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer >>>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker >>>>> directory=FSDirectory >>>>> >>>>> doc.stored = true #A >>>>> docs.file=wikipedia.lines.txt >>>>> ram.flush.mb=50 >>>>> compound=false >>>>> merge.factor=5 >>>>> doc.add.log.step=1000 >>>>> doc.term.vector=false >>>>> doc.term.vector.positions=false >>>>> doc.term.vector.offsets=false >>>>> >>>>> { "Rounds" #B >>>>> ResetSystemErase >>>>> { "BuildIndex" >>>>> -CreateIndex() >>>>> [ { "AddDocs" AddDoc > : 40000 ] : 5 >>>>> #C >>>>> -CloseIndex() >>>>> } >>>>> NewRound >>>>> } : 1 >>>>> >>>>> RepSumByPrefRound BuildIndex #D >>>>> >>>>> -----------------------------------------threadedindexwriter alg >>>>> ---------------------------------------------------------------- >>>>> >>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer >>>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker >>>>> directory=FSDirectory >>>>> >>>>> doc.stored = true #A >>>>> docs.file=wikipedia.lines.txt >>>>> ram.flush.mb=50 >>>>> compound=false >>>>> merge.factor=5 >>>>> doc.add.log.step=1000 >>>>> doc.term.vector=false >>>>> doc.term.vector.positions=false >>>>> doc.term.vector.offsets=false >>>>> writer.num.threads=15 >>>>> writer.max.thread.queue.size=75 >>>>> work.dir=work_t >>>>> >>>>> >>>>> { "Rounds" #B >>>>> ResetSystemErase >>>>> { "BuildIndex" >>>>> -CreateThreadedIndex() >>>>> { "AddDocs" AddDoc > : 200000 >>>>> -CloseIndex() >>>>> } >>>>> NewRound >>>>> } : 1 >>>>> >>>>> RepSumByPrefRound BuildIndex #D >>>>> >>>>> >>>>> -----------------------------------------------threadedindexwriter >>>>> checkindex ---------------------------------------------------------- >>>>> >>>>> >>>>> $ java -classpath >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar >>>>> org.apache.lucene.index.CheckIndex >>>>> >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index >>>>> >>>>> NOTE: testing will be more thorough if you run java with >>>>> '-ea:org.apache.lucene...', so assertions are enabled >>>>> >>>>> Opening index @ >>>>> >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index >>>>> >>>>> Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS >>>>> [Lucene >>>>> 2.9] >>>>> 1 of 1: name=_p docCount=199941 >>>>> compound=true >>>>> hasProx=true >>>>> numFiles=3 >>>>> size (MB)=317.1 >>>>> diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M - >>>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true, >>>>> mergeDocStores=false, java.vendor=Apple Inc., os.version=10.5.7, >>>>> source=merge, mergeFactor=5} >>>>> docStoreOffset=0 >>>>> docStoreSegment=_0 >>>>> docStoreIsCompoundFile=false >>>>> no deletions >>>>> test: open reader.........OK >>>>> test: fields, norms.......OK [4 fields] >>>>> test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docs >>>>> pairs; >>>>> 133241176 tokens] >>>>> test: stored fields.......OK [199941 total field count; avg 1 fields >>>>> per >>>>> doc] >>>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>>> vector >>>>> fields per doc] >>>>> >>>>> No problems were detected with this index. >>>>> >>>>> ------------------------------------------indexwriter checkindex >>>>> --------------------------------------------------------------- >>>>> >>>>> $ java -classpath >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar >>>>> org.apache.lucene.index.CheckIndex >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index >>>>> >>>>> NOTE: testing will be more thorough if you run java with >>>>> '-ea:org.apache.lucene...', so assertions are enabled >>>>> >>>>> Opening index @ >>>>> >>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index >>>>> >>>>> Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS >>>>> [Lucene >>>>> 2.9] >>>>> 1 of 1: name=_18 docCount=200000 >>>>> compound=true >>>>> hasProx=true >>>>> numFiles=1 >>>>> size (MB)=427.445 >>>>> diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M - >>>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true, >>>>> mergeDocStores=true, java.vendor=Apple Inc., os.version=10.5.7, >>>>> source=merge, mergeFactor=4} >>>>> no deletions >>>>> test: open reader.........OK >>>>> test: fields, norms.......OK [4 fields] >>>>> test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs >>>>> pairs; >>>>> 163219760 tokens] >>>>> test: stored fields.......OK [200000 total field count; avg 1 fields >>>>> per >>>>> doc] >>>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>>> vector >>>>> fields per doc] >>>>> >>>>> No problems were detected with this index. >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------------------------------------------- >>>>> >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> -Jibo >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote: >>>>> >>>>>> (Please note that ThreadedIndexWriter is source code available with >>>>>> the upcoming revision to Lucene in Action.) >>>>>> >>>>>> Phil, is it possible you are using an older version of the book's >>>>>> source code? In particular, can you check whether your version of >>>>>> ThreadedIndexWriter.java has this: >>>>>> >>>>>> public void close(boolean doWait) throws CorruptIndexException, >>>>>> IOException { >>>>>> finish(); >>>>>> super.close(doWait); >>>>>> } >>>>>> >>>>>> (I vaguely remember that being missing from earlier releases, which >>>>>> could explain what you're seeing). If you are missing that, can you >>>>>> download the current code from http://www.manning.com/hatcher3 and try >>>>>> again? >>>>>> >>>>>> If that's not the problem... can you post the benchmark alg you are >>>>>> using in each case? >>>>>> >>>>>> Mike >>>>>> >>>>>> On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<jiboj...@mac.com> wrote: >>>>>>> >>>>>>> Hi Phil, >>>>>>> >>>>>>> It's 5 threads for IndexWriter. >>>>>>> >>>>>>> For ThreadedIndexWriter, I used: >>>>>>> >>>>>>> writer.num.threads=16 >>>>>>> writer.max.thread.queue.size=80 >>>>>>> >>>>>>> Thanks, >>>>>>> -Jibo >>>>>>> >>>>>>> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote: >>>>>>> >>>>>>>> Hi Jibo, >>>>>>>> >>>>>>>> Your mergeFactor is different, and the resulting numFiles (segment >>>>>>>> files) is different. Maybe each thread is responsible for a segment >>>>>>>> file. Just curious - do you have 3 threads? >>>>>>>> >>>>>>>> Phil >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>>> >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>> >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org