Re: ThreadedIndexWriter vs. IndexWriter

Jibo John Tue, 11 Aug 2009 13:14:17 -0700

Mike,

Yes, it works perfect !

I did observe a dip in the indexing throughput (1855 recs/sec vs. 2200recs/sec previously), but, more importantly, no data is lost this time.


Thanks for helping me nail this down.

-Jibo



On Aug 11, 2009, at 11:12 AM, Michael McCandless wrote:

OK I found the problem!

It was losing docs from the queue, when shutting down the thread pool,
because we were calling super's addDocument(doc) not addDocument(doc,
analyzer).  IndexWriter was simply forwarding that call to
ThreadedIndexWriter's addDocument(doc, analyzer) which in turn would
do nothing because the thread pool was already told to shut down.
Larger queues made it much more likely to happen.

Can you try the new version (attached)?

Also, make sure you add 'doc.reuse.fields=false' to your alg (on
trunk).

Mike

On Tue, Aug 11, 2009 at 12:39 PM, Jibo John<[email protected]> wrote:

Mike,

I wasn't exactly using the lucene core jar from MEAP.
I have been building lucene from the source, and running the testsunderlucene/java/trunk/contrib/benchmark/ (checked out 2 weeks ago, Iguess)
 and, also under  lucene/java/tags/lucene_2_4_1/contrib/benchmark/.
In both cases, copied CreateThreadedIndexTask to
org.apache.lucene.benchmark.byTask.tasks and ThreadedIndexWriter to
org.apache.lucene.index.

I have observed the issue in both the versions of lucene.

Indexes were optimized separately using Lucli.


PFA the classes and the alg.






Thank you for your help with this one.

-Jibo




On Aug 11, 2009, at 3:13 AM, Michael McCandless wrote:
I'm baffled why you're losing docs w/ ThreadedIndexWriter.

One question: your Lucene core JAR seems to be newer than the last
MEAP update.  Did you update it manually?

Also, your indexes were optimized, but your algs don't have an
optimize step -- did you separately run an optimize?

Could you zip up the whole shebang (ThreadedIndexWriter.java,
CreateThreadedIndexTask.java, the algs) & post? Please CC medirectly
so I can grab the zip file... thanks.

Mike

On Mon, Aug 3, 2009 at 12:37 PM, Jibo John<[email protected]> wrote:
Mike,

Verified that I have the latest source code.
Here are the alg files and the checkindexer output.


----------------------------------------- indexwriter
alg----------------------------------------------------------------

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
directory=FSDirectory
doc.stored =true #A
docs.file=wikipedia.lines.txt
ram.flush.mb=50
compound=false
merge.factor=5
doc.add.log.step=1000
doc.term.vector=false
doc.term.vector.positions=false
doc.term.vector.offsets=false
{ "Rounds" #B
 ResetSystemErase
 { "BuildIndex"
 -CreateIndex()
 [ { "AddDocs" AddDoc > : 40000 ] : 5
 #C
 -CloseIndex()
 }
 NewRound
} : 1
RepSumByPrefRoundBuildIndex #D
-----------------------------------------threadedindexwriter alg
----------------------------------------------------------------

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
directory=FSDirectory
doc.stored =true #A
docs.file=wikipedia.lines.txt
ram.flush.mb=50
compound=false
merge.factor=5
doc.add.log.step=1000
doc.term.vector=false
doc.term.vector.positions=false
doc.term.vector.offsets=false
writer.num.threads=15
writer.max.thread.queue.size=75
work.dir=work_t
{ "Rounds" #B
 ResetSystemErase
 { "BuildIndex"
 -CreateThreadedIndex()
 { "AddDocs" AddDoc > : 200000
 -CloseIndex()
 }
 NewRound
} : 1
RepSumByPrefRoundBuildIndex #D
-----------------------------------------------threadedindexwriter
checkindex----------------------------------------------------------
$ java -classpath
/Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
org.apache.lucene.index.CheckIndex
/Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
NOTE: testing will be more thorough if you run java with
'-ea:org.apache.lucene...', so assertions are enabled

Opening index @
/Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS[Lucene
2.9]
 1 of 1: name=_p docCount=199941
 compound=true
 hasProx=true
 numFiles=3
 size (MB)=317.1
diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev779767M -
2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
mergeDocStores=false, java.vendor=Apple Inc., os.version=10.5.7,
source=merge, mergeFactor=5}
 docStoreOffset=0
 docStoreSegment=_0
 docStoreIsCompoundFile=false
 no deletions
 test: open reader.........OK
 test: fields, norms.......OK [4 fields]
test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docspairs;
133241176 tokens]
test: stored fields.......OK [199941 total field count; avg 1fields per
doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq
vector
fields per doc]

No problems were detected with this index.

------------------------------------------indexwriter checkindex
---------------------------------------------------------------

$ java -classpath
/Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
org.apache.lucene.index.CheckIndex
/Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
NOTE: testing will be more thorough if you run java with
'-ea:org.apache.lucene...', so assertions are enabled

Opening index @
/Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS[Lucene
2.9]
 1 of 1: name=_18 docCount=200000
 compound=true
 hasProx=true
 numFiles=1
 size (MB)=427.445
diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev779767M -
2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
mergeDocStores=true, java.vendor=Apple Inc., os.version=10.5.7,
source=merge, mergeFactor=4}
 no deletions
 test: open reader.........OK
 test: fields, norms.......OK [4 fields]
test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docspairs;
163219760 tokens]
test: stored fields.......OK [200000 total field count; avg 1fields per
doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq
vector
fields per doc]

No problems were detected with this index.


---------------------------------------------------------------------------------------------------------




Thanks,
-Jibo







On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote:
(Please note that ThreadedIndexWriter is source code availablewith
the upcoming revision to Lucene in Action.)

Phil, is it possible you are using an older version of the book's
source code?  In particular, can you check whether your version of
ThreadedIndexWriter.java has this:

 public void close(boolean doWait) throws CorruptIndexException,
IOException {
 finish();
 super.close(doWait);
 }
(I vaguely remember that being missing from earlier releases,whichcould explain what you're seeing). If you are missing that, canyoudownload the current code from http://www.manning.com/hatcher3and try
again?
If that's not the problem... can you post the benchmark alg youare
using in each case?

Mike
On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<[email protected]>wrote:
Hi Phil,

It's 5 threads for IndexWriter.

For ThreadedIndexWriter, I used:

writer.num.threads=16
writer.max.thread.queue.size=80

Thanks,
-Jibo

On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
Hi Jibo,
Your mergeFactor is different, and the resulting numFiles(segmentfiles) is different. Maybe each thread is responsible for asegment
file. Just curious - do you have 3 threads?

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: java-user-[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: ThreadedIndexWriter vs. IndexWriter

Reply via email to