Michael McCandless wrote: > OK, when you say "fair" I think you mean because you already had a > previous run that used compound file, you had to use compound file in > the run with the LUCENE-843 patch (etc)?
Yes, that's true. > The recommendations above should speed up Lucene with or without my patch. Sure, makes sense. > > When trying with MEM=512MB, it at first seemed faster, but then there > > were now and then local slow-downs, and eventually it became > a bit slower > > than the previous run. I know these are not merges, so they are either > > flushes (RAM directed), or GC activity. I will perhaps run > with GC debug > > flags and perhaps add a print at flush so to tell the culpritfor these > > local slow-downs. > > Hurm, odd. I haven't pushed RAM buffer up to 512 MB so it could be GC > cost somehow makes things worse ... curious. Ok I tried this with -verbose=gc and GC seems ok here - Here is a log snippet for the RAM=512MB setting run: ---------- log start ----------------- Speller: added 3290000 words in 241 seconds = 4 minutes 1 seconds Speller: added 3300000 words in 241 seconds = 4 minutes 1 seconds Speller: added 3310000 words in 242 seconds = 4 minutes 2 seconds Speller: added 3320000 words in 242 seconds = 4 minutes 2 seconds RAM: now flush @ usedMB=512.012 allocMB=512.012 triggerMB=512 flush: flushDocs=true flushDeletes=false flushDocStores=true numDocs=3328167 closeDocStore: 2 files to flush to segment _0 flush postings as segment _0 numDocs=3328167 <af type="tenured" id="40" timestamp="Mon Jun 25 22:50:18 2007" intervalms="7451.237"> <minimum requested_bytes="955824" /> <time exclusiveaccessms="0.039" /> <tenured freebytes="59197944" totalbytes="1572864000" percent="3" > <soa freebytes="59197944" totalbytes="1572864000" percent="3" /> <loa freebytes="0" totalbytes="0" percent="0" /> </tenured> <gc type="global" id="40" totalid="40" intervalms="7451.371"> <compaction movecount="7497075" movebytes="1061942632" reason="compact to meet allocation" /> <refs_cleared soft="0" threshold="32" weak="0" phantom="0" /> <finalization objectsqueued="1" /> <timesms mark="2507.521" sweep="25.223" compact="3992.243" total="6525.114" /> <tenured freebytes="297819120" totalbytes="1572864000" percent="18" > <soa freebytes="297819120" totalbytes="1572864000" percent="18" /> <loa freebytes="0" totalbytes="0" percent="0" /> </tenured> </gc> <tenured freebytes="296863296" totalbytes="1572864000" percent="18" > <soa freebytes="296863296" totalbytes="1572864000" percent="18" /> <loa freebytes="0" totalbytes="0" percent="0" /> </tenured> <time totalms="6525.340" /> </af> <af type="tenured" id="41" timestamp="Mon Jun 25 22:51:16 2007" intervalms="51348.671"> <minimum requested_bytes="24" /> <time exclusiveaccessms="0.045" /> <tenured freebytes="0" totalbytes="1572864000" percent="0" > <soa freebytes="0" totalbytes="1572864000" percent="0" /> <loa freebytes="0" totalbytes="0" percent="0" /> </tenured> <gc type="global" id="41" totalid="41" intervalms="51348.857"> <refs_cleared soft="0" threshold="32" weak="0" phantom="0" /> <finalization objectsqueued="0" /> <timesms mark="1369.963" sweep="22.935" compact="0.000" total="1392.988" /> <tenured freebytes="294024904" totalbytes="1572864000" percent="18" > <soa freebytes="294024904" totalbytes="1572864000" percent="18" /> <loa freebytes="0" totalbytes="0" percent="0" /> </tenured> </gc> <tenured freebytes="294024392" totalbytes="1572864000" percent="18" > <soa freebytes="294024392" totalbytes="1572864000" percent="18" /> <loa freebytes="0" totalbytes="0" percent="0" /> </tenured> <time totalms="1393.264" /> </af> oldRAMSize=339488720 newFlushedSize=299712406 docs/MB=11,643.949 new/old=88.283% [EMAIL PROTECTED] main: now checkpoint "segments_2" [1 segments ; isCommit = true] [EMAIL PROTECTED] main: IncRef "_0.fnm": pre-incr count is 0 [EMAIL PROTECTED] main: IncRef "_0.frq": pre-incr count is 0 [EMAIL PROTECTED] main: IncRef "_0.prx": pre-incr count is 0 [EMAIL PROTECTED] main: IncRef "_0.tis": pre-incr count is 0 [EMAIL PROTECTED] main: IncRef "_0.tii": pre-incr count is 0 [EMAIL PROTECTED] main: IncRef "_0.nrm": pre-incr count is 0 [EMAIL PROTECTED] main: IncRef "_0.fdx": pre-incr count is 0 [EMAIL PROTECTED] main: IncRef "_0.fdt": pre-incr count is 0 [EMAIL PROTECTED] main: deleteCommits: now remove commit "segments_1" [EMAIL PROTECTED] main: DecRef "segments_1": pre-decr count is 1 [EMAIL PROTECTED] main: delete "segments_1" [EMAIL PROTECTED] main: now checkpoint "segments_3" [1 segments ; isCommit = true] [EMAIL PROTECTED] main: IncRef "_0.cfs": pre-incr count is 0 [EMAIL PROTECTED] main: deleteCommits: now remove commit "segments_2" [EMAIL PROTECTED] main: DecRef "_0.fnm": pre-decr count is 1 [EMAIL PROTECTED] main: delete "_0.fnm" [EMAIL PROTECTED] main: DecRef "_0.frq": pre-decr count is 1 [EMAIL PROTECTED] main: delete "_0.frq" [EMAIL PROTECTED] main: DecRef "_0.prx": pre-decr count is 1 [EMAIL PROTECTED] main: delete "_0.prx" [EMAIL PROTECTED] main: DecRef "_0.tis": pre-decr count is 1 [EMAIL PROTECTED] main: delete "_0.tis" [EMAIL PROTECTED] main: DecRef "_0.tii": pre-decr count is 1 [EMAIL PROTECTED] main: delete "_0.tii" [EMAIL PROTECTED] main: DecRef "_0.nrm": pre-decr count is 1 [EMAIL PROTECTED] main: delete "_0.nrm" [EMAIL PROTECTED] main: DecRef "_0.fdx": pre-decr count is 1 [EMAIL PROTECTED] main: delete "_0.fdx" [EMAIL PROTECTED] main: DecRef "_0.fdt": pre-decr count is 1 [EMAIL PROTECTED] main: delete "_0.fdt" [EMAIL PROTECTED] main: DecRef "segments_2": pre-decr count is 1 [EMAIL PROTECTED] main: delete "segments_2" Speller: added 3330000 words in 339 seconds = 5 minutes 39 seconds Speller: added 3340000 words in 340 seconds = 5 minutes 40 seconds Speller: added 3350000 words in 341 seconds = 5 minutes 41 seconds ---------- log start ----------------- So there is about a 100 seconds gap, out of which 8 seconds are GC, rest is the flush. I am not saying this is a problem, just bringing the info. The behavior along the run seems similar - that was the first flush, after adding 3.3M docs (words). The next flush was after adding 6.5M docs, ~100 secs again, similar GC/flush times. So I guess it makes sense one have to pay some time for flushing that large number of added docs. It is interesting that beyond a certain value there's no point in allowing more RAM, question is what would be the recommended value... I sort of followed (all the way:-)) the "let it have as much memory as possible" - guess the best recommendation should be lower than that. > > > Other than that, I will perhaps try to index .GOV2 (25 > Million HTML docs) > > with this patch. The way I indexed it before it took about 4 days - > > running in 4 threads, and creating 36 indexes. This is even more a real > > life scenario, it involves HTML parsing, standard analysis, and merging > > (to some extent). Since there are 4 threads each one will get, say, > > 250MB. Again, for a "fair" comparison, I will remain with compound. > > OK, because you're doing StandardAnalyzer and HTML parsing and > presumably loading one-doc-per-file, most of your time is spent > outside of Lucene indexing so I'd expect less that 50% speedup in > this case. ~25M docs are in ~27K zip files, so IO side should be brighter... --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]