Re: ThreadedIndexWriter vs. IndexWriter

Michael McCandless Tue, 11 Aug 2009 03:14:11 -0700

I'm baffled why you're losing docs w/ ThreadedIndexWriter.

One question: your Lucene core JAR seems to be newer than the last
MEAP update.  Did you update it manually?


Also, your indexes were optimized, but your algs don't have an
optimize step -- did you separately run an optimize?

Could you zip up the whole shebang (ThreadedIndexWriter.java,
CreateThreadedIndexTask.java, the algs) & post?  Please CC me directly
so I can grab the zip file... thanks.

Mike

On Mon, Aug 3, 2009 at 12:37 PM, Jibo John<[email protected]> wrote:
> Mike,
>
> Verified that I have the latest source code.
> Here are the alg files and the checkindexer output.
>
>
> ----------------------------------------- indexwriter
> alg----------------------------------------------------------------
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
> directory=FSDirectory
>
> doc.stored = true                                                    #A
> docs.file=wikipedia.lines.txt
> ram.flush.mb=50
> compound=false
> merge.factor=5
> doc.add.log.step=1000
> doc.term.vector=false
> doc.term.vector.positions=false
> doc.term.vector.offsets=false
>
> { "Rounds"                                                           #B
>  ResetSystemErase
>  { "BuildIndex"
>  -CreateIndex()
>  [ { "AddDocs" AddDoc > : 40000 ] : 5                                    #C
>  -CloseIndex()
>  }
>  NewRound
> } : 1
>
> RepSumByPrefRound BuildIndex                                         #D
>
> -----------------------------------------threadedindexwriter alg
> ----------------------------------------------------------------
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
> directory=FSDirectory
>
> doc.stored = true                                                    #A
> docs.file=wikipedia.lines.txt
> ram.flush.mb=50
> compound=false
> merge.factor=5
> doc.add.log.step=1000
> doc.term.vector=false
> doc.term.vector.positions=false
> doc.term.vector.offsets=false
> writer.num.threads=15
> writer.max.thread.queue.size=75
> work.dir=work_t
>
>
> { "Rounds"                                                           #B
>  ResetSystemErase
>  { "BuildIndex"
>  -CreateThreadedIndex()
>   { "AddDocs" AddDoc > : 200000
>  -CloseIndex()
>  }
>  NewRound
> } : 1
>
> RepSumByPrefRound BuildIndex                                         #D
>
>
> -----------------------------------------------threadedindexwriter
> checkindex ----------------------------------------------------------
>
>
> $ java -classpath
> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
> org.apache.lucene.index.CheckIndex
> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
>
> NOTE: testing will be more thorough if you run java with
> '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @
> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
>
> Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene
> 2.9]
>  1 of 1: name=_p docCount=199941
>   compound=true
>   hasProx=true
>   numFiles=3
>   size (MB)=317.1
>   diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M -
> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
> mergeDocStores=false, java.vendor=Apple Inc., os.version=10.5.7,
> source=merge, mergeFactor=5}
>   docStoreOffset=0
>   docStoreSegment=_0
>   docStoreIsCompoundFile=false
>   no deletions
>   test: open reader.........OK
>   test: fields, norms.......OK [4 fields]
>   test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docs pairs;
> 133241176 tokens]
>   test: stored fields.......OK [199941 total field count; avg 1 fields per
> doc]
>   test: term vectors........OK [0 total vector count; avg 0 term/freq vector
> fields per doc]
>
> No problems were detected with this index.
>
> ------------------------------------------indexwriter checkindex
> ---------------------------------------------------------------
>
> $ java -classpath
> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
> org.apache.lucene.index.CheckIndex
> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
>
> NOTE: testing will be more thorough if you run java with
> '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @
> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
>
> Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene
> 2.9]
>  1 of 1: name=_18 docCount=200000
>   compound=true
>   hasProx=true
>   numFiles=1
>   size (MB)=427.445
>   diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M -
> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
> mergeDocStores=true, java.vendor=Apple Inc., os.version=10.5.7,
> source=merge, mergeFactor=4}
>   no deletions
>   test: open reader.........OK
>   test: fields, norms.......OK [4 fields]
>   test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs pairs;
> 163219760 tokens]
>   test: stored fields.......OK [200000 total field count; avg 1 fields per
> doc]
>   test: term vectors........OK [0 total vector count; avg 0 term/freq vector
> fields per doc]
>
> No problems were detected with this index.
>
> ---------------------------------------------------------------------------------------------------------
>
>
>
>
> Thanks,
> -Jibo
>
>
>
>
>
>
>
> On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote:
>
>> (Please note that ThreadedIndexWriter is source code available with
>> the upcoming revision to Lucene in Action.)
>>
>> Phil, is it possible you are using an older version of the book's
>> source code?  In particular, can you check whether your version of
>> ThreadedIndexWriter.java has this:
>>
>>  public void close(boolean doWait) throws CorruptIndexException,
>> IOException {
>>   finish();
>>   super.close(doWait);
>>  }
>>
>> (I vaguely remember that being missing from earlier releases, which
>> could explain what you're seeing).  If you are missing that, can you
>> download the current code from http://www.manning.com/hatcher3 and try
>> again?
>>
>> If that's not the problem... can you post the benchmark alg you are
>> using in each case?
>>
>> Mike
>>
>> On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<[email protected]> wrote:
>>>
>>> Hi Phil,
>>>
>>> It's 5 threads for IndexWriter.
>>>
>>> For ThreadedIndexWriter, I used:
>>>
>>> writer.num.threads=16
>>> writer.max.thread.queue.size=80
>>>
>>> Thanks,
>>> -Jibo
>>>
>>> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
>>>
>>>> Hi Jibo,
>>>>
>>>> Your mergeFactor is different, and the resulting numFiles (segment
>>>> files) is different. Maybe each thread is responsible for a segment
>>>> file. Just curious - do you have 3 threads?
>>>>
>>>> Phil
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: ThreadedIndexWriter vs. IndexWriter

Reply via email to