Michael McCandless wrote:

> OK, when you say "fair" I think you mean because you already had a
> previous run that used compound file, you had to use compound file in
> the run with the LUCENE-843 patch (etc)?

Yes, that's true.

> The recommendations above should speed up Lucene with or without my
patch.

Sure, makes sense.

> > When trying with  MEM=512MB, it at first seemed faster, but then there
> > were now and then local slow-downs, and eventually it became
> a bit slower
> > than the previous run. I know these are not merges, so they are either
> > flushes (RAM directed), or GC activity. I will perhaps run
> with GC debug
> > flags and perhaps add a print at flush so to tell the culpritfor these
> > local slow-downs.
>
> Hurm, odd.  I haven't pushed RAM buffer up to 512 MB so it could be GC
> cost somehow makes things worse ... curious.

Ok I tried this with -verbose=gc and GC seems ok here -

Here is a log snippet for the RAM=512MB setting run:

---------- log start -----------------
Speller: added 3290000 words in 241 seconds = 4 minutes 1 seconds
Speller: added 3300000 words in 241 seconds = 4 minutes 1 seconds
Speller: added 3310000 words in 242 seconds = 4 minutes 2 seconds
Speller: added 3320000 words in 242 seconds = 4 minutes 2 seconds

  RAM: now flush @ usedMB=512.012 allocMB=512.012 triggerMB=512
  flush: flushDocs=true flushDeletes=false flushDocStores=true
numDocs=3328167

closeDocStore: 2 files to flush to segment _0

flush postings as segment _0 numDocs=3328167
<af type="tenured" id="40" timestamp="Mon Jun 25 22:50:18 2007"
intervalms="7451.237">
  <minimum requested_bytes="955824" />
  <time exclusiveaccessms="0.039" />
  <tenured freebytes="59197944" totalbytes="1572864000" percent="3" >
    <soa freebytes="59197944" totalbytes="1572864000" percent="3" />
    <loa freebytes="0" totalbytes="0" percent="0" />
  </tenured>
  <gc type="global" id="40" totalid="40" intervalms="7451.371">
    <compaction movecount="7497075" movebytes="1061942632" reason="compact
to meet allocation" />
    <refs_cleared soft="0" threshold="32" weak="0" phantom="0" />
    <finalization objectsqueued="1" />
    <timesms mark="2507.521" sweep="25.223" compact="3992.243"
total="6525.114" />
    <tenured freebytes="297819120" totalbytes="1572864000" percent="18" >
      <soa freebytes="297819120" totalbytes="1572864000" percent="18" />
      <loa freebytes="0" totalbytes="0" percent="0" />
    </tenured>
  </gc>
  <tenured freebytes="296863296" totalbytes="1572864000" percent="18" >
    <soa freebytes="296863296" totalbytes="1572864000" percent="18" />
    <loa freebytes="0" totalbytes="0" percent="0" />
  </tenured>
  <time totalms="6525.340" />
</af>

<af type="tenured" id="41" timestamp="Mon Jun 25 22:51:16 2007"
intervalms="51348.671">
  <minimum requested_bytes="24" />
  <time exclusiveaccessms="0.045" />
  <tenured freebytes="0" totalbytes="1572864000" percent="0" >
    <soa freebytes="0" totalbytes="1572864000" percent="0" />
    <loa freebytes="0" totalbytes="0" percent="0" />
  </tenured>
  <gc type="global" id="41" totalid="41" intervalms="51348.857">
    <refs_cleared soft="0" threshold="32" weak="0" phantom="0" />
    <finalization objectsqueued="0" />
    <timesms mark="1369.963" sweep="22.935" compact="0.000"
total="1392.988" />
    <tenured freebytes="294024904" totalbytes="1572864000" percent="18" >
      <soa freebytes="294024904" totalbytes="1572864000" percent="18" />
      <loa freebytes="0" totalbytes="0" percent="0" />
    </tenured>
  </gc>
  <tenured freebytes="294024392" totalbytes="1572864000" percent="18" >
    <soa freebytes="294024392" totalbytes="1572864000" percent="18" />
    <loa freebytes="0" totalbytes="0" percent="0" />
  </tenured>
  <time totalms="1393.264" />
</af>

  oldRAMSize=339488720 newFlushedSize=299712406 docs/MB=11,643.949
new/old=88.283%
[EMAIL PROTECTED] main: now checkpoint
"segments_2" [1 segments ; isCommit = true]
[EMAIL PROTECTED] main:   IncRef "_0.fnm":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.frq":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.prx":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.tis":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.tii":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.nrm":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.fdx":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.fdt":
pre-incr count is 0
[EMAIL PROTECTED] main: deleteCommits: now
remove commit "segments_1"
[EMAIL PROTECTED] main:   DecRef
"segments_1": pre-decr count is 1
[EMAIL PROTECTED] main: delete "segments_1"
[EMAIL PROTECTED] main: now checkpoint
"segments_3" [1 segments ; isCommit = true]
[EMAIL PROTECTED] main:   IncRef "_0.cfs":
pre-incr count is 0
[EMAIL PROTECTED] main: deleteCommits: now
remove commit "segments_2"
[EMAIL PROTECTED] main:   DecRef "_0.fnm":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.fnm"
[EMAIL PROTECTED] main:   DecRef "_0.frq":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.frq"
[EMAIL PROTECTED] main:   DecRef "_0.prx":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.prx"
[EMAIL PROTECTED] main:   DecRef "_0.tis":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.tis"
[EMAIL PROTECTED] main:   DecRef "_0.tii":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.tii"
[EMAIL PROTECTED] main:   DecRef "_0.nrm":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.nrm"
[EMAIL PROTECTED] main:   DecRef "_0.fdx":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.fdx"
[EMAIL PROTECTED] main:   DecRef "_0.fdt":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.fdt"
[EMAIL PROTECTED] main:   DecRef
"segments_2": pre-decr count is 1
[EMAIL PROTECTED] main: delete "segments_2"
Speller: added 3330000 words in 339 seconds = 5 minutes 39 seconds
Speller: added 3340000 words in 340 seconds = 5 minutes 40 seconds
Speller: added 3350000 words in 341 seconds = 5 minutes 41 seconds
---------- log start -----------------

So there is about a 100 seconds gap, out of which 8 seconds are GC, rest is
the flush. I am not saying this is a problem, just bringing the info. The
behavior along the run seems similar - that was the first flush, after
adding 3.3M docs (words). The next flush was after adding 6.5M docs, ~100
secs again, similar GC/flush times. So I guess it makes sense one have to
pay some time for flushing that large number of added docs. It is
interesting that beyond a certain value there's no point in allowing more
RAM, question is what would be the recommended value... I sort of followed
(all the way:-)) the "let it have as much memory as possible" - guess the
best recommendation should be lower than that.

>
> > Other than that, I will perhaps try to index .GOV2 (25
> Million HTML docs)
> > with this patch. The way I indexed it before it took about 4 days -
> > running in 4 threads, and creating 36 indexes. This is even more a real
> > life scenario, it involves HTML parsing, standard analysis, and merging
> > (to some extent). Since there are 4 threads each one will get, say,
> > 250MB. Again, for a "fair" comparison, I will remain with compound.
>
> OK, because you're doing StandardAnalyzer and HTML parsing and
> presumably loading one-doc-per-file, most of your time is spent
> outside of Lucene indexing so I'd expect less that 50% speedup in
> this case.

~25M docs are in ~27K zip files, so IO side should be brighter...


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to