Re: ThreadedIndexWriter vs. IndexWriter

Michael McCandless Tue, 11 Aug 2009 16:22:03 -0700

Phew!  Thank you for raising this... it was a sneaky one.

Mike


On Tue, Aug 11, 2009 at 4:13 PM, Jibo John<[email protected]> wrote:
> Mike,
>
> Yes, it works perfect !
>
> I did observe a dip in the indexing throughput (1855 recs/sec vs. 2200
> recs/sec previously), but, more importantly, no data is lost this time.
>
> Thanks for helping me nail this down.
>
> -Jibo
>
>
>
> On Aug 11, 2009, at 11:12 AM, Michael McCandless wrote:
>
>> OK I found the problem!
>>
>> It was losing docs from the queue, when shutting down the thread pool,
>> because we were calling super's addDocument(doc) not addDocument(doc,
>> analyzer).  IndexWriter was simply forwarding that call to
>> ThreadedIndexWriter's addDocument(doc, analyzer) which in turn would
>> do nothing because the thread pool was already told to shut down.
>> Larger queues made it much more likely to happen.
>>
>> Can you try the new version (attached)?
>>
>> Also, make sure you add 'doc.reuse.fields=false' to your alg (on
>> trunk).
>>
>> Mike
>>
>> On Tue, Aug 11, 2009 at 12:39 PM, Jibo John<[email protected]> wrote:
>>>
>>> Mike,
>>>
>>> I wasn't exactly using the lucene core jar from MEAP.
>>>
>>> I have been building lucene from the source, and running the tests under
>>> lucene/java/trunk/contrib/benchmark/ (checked out 2 weeks ago, I guess)
>>>  and, also under  lucene/java/tags/lucene_2_4_1/contrib/benchmark/.
>>> In both cases, copied CreateThreadedIndexTask to
>>> org.apache.lucene.benchmark.byTask.tasks and ThreadedIndexWriter to
>>> org.apache.lucene.index.
>>>
>>> I have observed the issue in both the versions of lucene.
>>>
>>> Indexes were optimized separately using Lucli.
>>>
>>>
>>> PFA the classes and the alg.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thank you for your help with this one.
>>>
>>> -Jibo
>>>
>>>
>>>
>>>
>>> On Aug 11, 2009, at 3:13 AM, Michael McCandless wrote:
>>>
>>>> I'm baffled why you're losing docs w/ ThreadedIndexWriter.
>>>>
>>>> One question: your Lucene core JAR seems to be newer than the last
>>>> MEAP update.  Did you update it manually?
>>>>
>>>> Also, your indexes were optimized, but your algs don't have an
>>>> optimize step -- did you separately run an optimize?
>>>>
>>>> Could you zip up the whole shebang (ThreadedIndexWriter.java,
>>>> CreateThreadedIndexTask.java, the algs) & post?  Please CC me directly
>>>> so I can grab the zip file... thanks.
>>>>
>>>> Mike
>>>>
>>>> On Mon, Aug 3, 2009 at 12:37 PM, Jibo John<[email protected]> wrote:
>>>>>
>>>>> Mike,
>>>>>
>>>>> Verified that I have the latest source code.
>>>>> Here are the alg files and the checkindexer output.
>>>>>
>>>>>
>>>>> ----------------------------------------- indexwriter
>>>>> alg----------------------------------------------------------------
>>>>>
>>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>>>>> directory=FSDirectory
>>>>>
>>>>> doc.stored = true                                                    #A
>>>>> docs.file=wikipedia.lines.txt
>>>>> ram.flush.mb=50
>>>>> compound=false
>>>>> merge.factor=5
>>>>> doc.add.log.step=1000
>>>>> doc.term.vector=false
>>>>> doc.term.vector.positions=false
>>>>> doc.term.vector.offsets=false
>>>>>
>>>>> { "Rounds"                                                           #B
>>>>>  ResetSystemErase
>>>>>  { "BuildIndex"
>>>>>  -CreateIndex()
>>>>>  [ { "AddDocs" AddDoc > : 40000 ] : 5
>>>>>  #C
>>>>>  -CloseIndex()
>>>>>  }
>>>>>  NewRound
>>>>> } : 1
>>>>>
>>>>> RepSumByPrefRound BuildIndex                                         #D
>>>>>
>>>>> -----------------------------------------threadedindexwriter alg
>>>>> ----------------------------------------------------------------
>>>>>
>>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>>>>> directory=FSDirectory
>>>>>
>>>>> doc.stored = true                                                    #A
>>>>> docs.file=wikipedia.lines.txt
>>>>> ram.flush.mb=50
>>>>> compound=false
>>>>> merge.factor=5
>>>>> doc.add.log.step=1000
>>>>> doc.term.vector=false
>>>>> doc.term.vector.positions=false
>>>>> doc.term.vector.offsets=false
>>>>> writer.num.threads=15
>>>>> writer.max.thread.queue.size=75
>>>>> work.dir=work_t
>>>>>
>>>>>
>>>>> { "Rounds"                                                           #B
>>>>>  ResetSystemErase
>>>>>  { "BuildIndex"
>>>>>  -CreateThreadedIndex()
>>>>>  { "AddDocs" AddDoc > : 200000
>>>>>  -CloseIndex()
>>>>>  }
>>>>>  NewRound
>>>>> } : 1
>>>>>
>>>>> RepSumByPrefRound BuildIndex                                         #D
>>>>>
>>>>>
>>>>> -----------------------------------------------threadedindexwriter
>>>>> checkindex ----------------------------------------------------------
>>>>>
>>>>>
>>>>> $ java -classpath
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
>>>>> org.apache.lucene.index.CheckIndex
>>>>>
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
>>>>>
>>>>> NOTE: testing will be more thorough if you run java with
>>>>> '-ea:org.apache.lucene...', so assertions are enabled
>>>>>
>>>>> Opening index @
>>>>>
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
>>>>>
>>>>> Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS
>>>>> [Lucene
>>>>> 2.9]
>>>>>  1 of 1: name=_p docCount=199941
>>>>>  compound=true
>>>>>  hasProx=true
>>>>>  numFiles=3
>>>>>  size (MB)=317.1
>>>>>  diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M -
>>>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
>>>>> mergeDocStores=false, java.vendor=Apple Inc., os.version=10.5.7,
>>>>> source=merge, mergeFactor=5}
>>>>>  docStoreOffset=0
>>>>>  docStoreSegment=_0
>>>>>  docStoreIsCompoundFile=false
>>>>>  no deletions
>>>>>  test: open reader.........OK
>>>>>  test: fields, norms.......OK [4 fields]
>>>>>  test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docs
>>>>> pairs;
>>>>> 133241176 tokens]
>>>>>  test: stored fields.......OK [199941 total field count; avg 1 fields
>>>>> per
>>>>> doc]
>>>>>  test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>> vector
>>>>> fields per doc]
>>>>>
>>>>> No problems were detected with this index.
>>>>>
>>>>> ------------------------------------------indexwriter checkindex
>>>>> ---------------------------------------------------------------
>>>>>
>>>>> $ java -classpath
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
>>>>> org.apache.lucene.index.CheckIndex
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
>>>>>
>>>>> NOTE: testing will be more thorough if you run java with
>>>>> '-ea:org.apache.lucene...', so assertions are enabled
>>>>>
>>>>> Opening index @
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
>>>>>
>>>>> Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS
>>>>> [Lucene
>>>>> 2.9]
>>>>>  1 of 1: name=_18 docCount=200000
>>>>>  compound=true
>>>>>  hasProx=true
>>>>>  numFiles=1
>>>>>  size (MB)=427.445
>>>>>  diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M -
>>>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
>>>>> mergeDocStores=true, java.vendor=Apple Inc., os.version=10.5.7,
>>>>> source=merge, mergeFactor=4}
>>>>>  no deletions
>>>>>  test: open reader.........OK
>>>>>  test: fields, norms.......OK [4 fields]
>>>>>  test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs
>>>>> pairs;
>>>>> 163219760 tokens]
>>>>>  test: stored fields.......OK [200000 total field count; avg 1 fields
>>>>> per
>>>>> doc]
>>>>>  test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>> vector
>>>>> fields per doc]
>>>>>
>>>>> No problems were detected with this index.
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> -Jibo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote:
>>>>>
>>>>>> (Please note that ThreadedIndexWriter is source code available with
>>>>>> the upcoming revision to Lucene in Action.)
>>>>>>
>>>>>> Phil, is it possible you are using an older version of the book's
>>>>>> source code?  In particular, can you check whether your version of
>>>>>> ThreadedIndexWriter.java has this:
>>>>>>
>>>>>>  public void close(boolean doWait) throws CorruptIndexException,
>>>>>> IOException {
>>>>>>  finish();
>>>>>>  super.close(doWait);
>>>>>>  }
>>>>>>
>>>>>> (I vaguely remember that being missing from earlier releases, which
>>>>>> could explain what you're seeing).  If you are missing that, can you
>>>>>> download the current code from http://www.manning.com/hatcher3 and try
>>>>>> again?
>>>>>>
>>>>>> If that's not the problem... can you post the benchmark alg you are
>>>>>> using in each case?
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<[email protected]> wrote:
>>>>>>>
>>>>>>> Hi Phil,
>>>>>>>
>>>>>>> It's 5 threads for IndexWriter.
>>>>>>>
>>>>>>> For ThreadedIndexWriter, I used:
>>>>>>>
>>>>>>> writer.num.threads=16
>>>>>>> writer.max.thread.queue.size=80
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Jibo
>>>>>>>
>>>>>>> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
>>>>>>>
>>>>>>>> Hi Jibo,
>>>>>>>>
>>>>>>>> Your mergeFactor is different, and the resulting numFiles (segment
>>>>>>>> files) is different. Maybe each thread is responsible for a segment
>>>>>>>> file. Just curious - do you have 3 threads?
>>>>>>>>
>>>>>>>> Phil
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: ThreadedIndexWriter vs. IndexWriter

Reply via email to