subject:"improve how IndexWriter uses RAM to buffer added documents"

[jira] Resolved: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-08-12 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-843.
---

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-12 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512275
 ] 

Michael McCandless commented on LUCENE-843:
---

Woops ... you are right; thanks for catching it!  I will add a unit
test & fix it.  I will also make the flush(boolean triggerMerge,
boolean flushDocStores) protected, not public, and move the javadoc
back to the public flush().


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-12 Thread Steven Parkes (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512264
 ] 

Steven Parkes commented on LUCENE-843:
--

Did we lose the triggered merge stuff from 887, i.e.,, should it be

if (triggerMerge) {
  /* new merge policy
  if (0 == docWriter.getMaxBufferedDocs())
maybeMergeSegments(mergeFactor * numDocs / 2);
  else
maybeMergeSegments(docWriter.getMaxBufferedDocs());
  */
  maybeMergeSegments(docWriter.getMaxBufferedDocs());
}
 

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Reopened: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-06 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-843:
---

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Re-opening this issue: I saw one failure of the contrib/benchmark
TestPerfTasksLogic.testParallelDocMaker() tests due to an intermittant
thread-safety issue.  It's hard to get the failure to happen (it's
happened only once in ~20 runs of contrib/benchmark) but I see where
the issue is.  Will commit a fix shortly.

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-04 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-843.
---

   Resolution: Fixed
Fix Version/s: 2.3
Lucene Fields: [New, Patch Available]  (was: [New])

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-25 Thread Doron Cohen

Michael McCandless wrote:

> OK, when you say "fair" I think you mean because you already had a
> previous run that used compound file, you had to use compound file in
> the run with the LUCENE-843 patch (etc)?

Yes, that's true.

> The recommendations above should speed up Lucene with or without my
patch.

Sure, makes sense.

> > When trying with  MEM=512MB, it at first seemed faster, but then there
> > were now and then local slow-downs, and eventually it became
> a bit slower
> > than the previous run. I know these are not merges, so they are either
> > flushes (RAM directed), or GC activity. I will perhaps run
> with GC debug
> > flags and perhaps add a print at flush so to tell the culpritfor these
> > local slow-downs.
>
> Hurm, odd.  I haven't pushed RAM buffer up to 512 MB so it could be GC
> cost somehow makes things worse ... curious.

Ok I tried this with -verbose=gc and GC seems ok here -

Here is a log snippet for the RAM=512MB setting run:

-- log start -
Speller: added 329 words in 241 seconds = 4 minutes 1 seconds
Speller: added 330 words in 241 seconds = 4 minutes 1 seconds
Speller: added 331 words in 242 seconds = 4 minutes 2 seconds
Speller: added 332 words in 242 seconds = 4 minutes 2 seconds

  RAM: now flush @ usedMB=512.012 allocMB=512.012 triggerMB=512
  flush: flushDocs=true flushDeletes=false flushDocStores=true
numDocs=3328167

closeDocStore: 2 files to flush to segment _0

flush postings as segment _0 numDocs=3328167

  
  
  


  
  





  
  

  
  


  
  



  
  
  


  
  




  
  

  
  


  
  


  oldRAMSize=339488720 newFlushedSize=299712406 docs/MB=11,643.949
new/old=88.283%
[EMAIL PROTECTED] main: now checkpoint
"segments_2" [1 segments ; isCommit = true]
[EMAIL PROTECTED] main:   IncRef "_0.fnm":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.frq":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.prx":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.tis":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.tii":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.nrm":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.fdx":
pre-incr count is 0
[EMAIL PROTECTED] main:   IncRef "_0.fdt":
pre-incr count is 0
[EMAIL PROTECTED] main: deleteCommits: now
remove commit "segments_1"
[EMAIL PROTECTED] main:   DecRef
"segments_1": pre-decr count is 1
[EMAIL PROTECTED] main: delete "segments_1"
[EMAIL PROTECTED] main: now checkpoint
"segments_3" [1 segments ; isCommit = true]
[EMAIL PROTECTED] main:   IncRef "_0.cfs":
pre-incr count is 0
[EMAIL PROTECTED] main: deleteCommits: now
remove commit "segments_2"
[EMAIL PROTECTED] main:   DecRef "_0.fnm":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.fnm"
[EMAIL PROTECTED] main:   DecRef "_0.frq":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.frq"
[EMAIL PROTECTED] main:   DecRef "_0.prx":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.prx"
[EMAIL PROTECTED] main:   DecRef "_0.tis":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.tis"
[EMAIL PROTECTED] main:   DecRef "_0.tii":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.tii"
[EMAIL PROTECTED] main:   DecRef "_0.nrm":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.nrm"
[EMAIL PROTECTED] main:   DecRef "_0.fdx":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.fdx"
[EMAIL PROTECTED] main:   DecRef "_0.fdt":
pre-decr count is 1
[EMAIL PROTECTED] main: delete "_0.fdt"
[EMAIL PROTECTED] main:   DecRef
"segments_2": pre-decr count is 1
[EMAIL PROTECTED] main: delete "segments_2"
Speller: added 333 words in 339 seconds = 5 minutes 39 seconds
Speller: added 334 words in 340 seconds = 5 minutes 40 seconds
Speller: added 335 words in 341 seconds = 5 minutes 41 seconds
-- log start -

So there is about a 100 seconds gap, out of which 8 seconds are GC, rest is
the flush. I am not saying this is a problem, just bringing the info. The
behavior along the run seems similar - that was the first flush, after
adding 3.3M docs (words). The next flush was after adding 6.5M docs, ~100
secs again, similar GC/flush times. So I guess it makes sense one have to
pay some time for flushing that large number of added docs. It is
interesting that beyond a certain value there's no point in allowing more
RAM, question is what would be the recommended value... I sort of followed
(all the way:-)) the "let it have as much memory as possible" - guess the
best recommendation should be lower than that.

>
> > Other than that, I will perhaps try to index .GOV2 (25
> Million HTML docs)
> > with this patch. The way I indexed it before it took about 4 days -
> > running in 4 threads, and creating 36 indexes. This is even more a real
> > life scenario, it involves HTML parsing, standard analysis, and merging
> > (to some extent). Since there are 4 threads each

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-24 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507716
 ] 

Michael McCandless commented on LUCENE-843:
---


> Just to clarify your comment on reusing field and doc instances - to my
> understanding reusing a field instance is ok *only* after the containing
> doc was added to the index.

Right, if your documents are very "regular" you should get a sizable
speedup (especially for tiny docs), with or without this patch, if you
make a single Document and add *separate* Field instances to it for
each field, and then reuse the Document and Field instances for all
the docs you want to add.

It's not easy to reuse Field instances now (there's no
setStringValue()).  I made a ReusableStringReader to do this but you
could also make your own class that implements Fieldable.

> For a "fair" comparison I ended up not following most of your
> recommendations, including the reuse field/docs one and the non-compound
> one (apologies:-)), but I might use them later.

OK, when you say "fair" I think you mean because you already had a
previous run that used compound file, you had to use compound file in
the run with the LUCENE-843 patch (etc)?  The recommendations above
should speed up Lucene with or without my patch.

> For the first 100,000,000 docs (==speller words) the speed-up is quite
> amazing:
> Orig:Speller: added 1 words in 10912 seconds = 3 hours 1
> minutes 52 seconds
> New:   Speller: added 1 words in 58490 seconds = 16 hours 14
> minutes 50 seconds
> This is 5.3 times faster !!!

Wow!  I think the speedup might be even more if both of your runs followed
the suggestions above.

> This btw was with maxBufDocs=100,000 (I forgot to set the MEM param).
> I stopped the run now, I don't expect to learn anything new by letting it
> continue.
>
> When trying with  MEM=512MB, it at first seemed faster, but then there
> were now and then local slow-downs, and eventually it became a bit slower
> than the previous run. I know these are not merges, so they are either
> flushes (RAM directed), or GC activity. I will perhaps run with GC debug
> flags and perhaps add a print at flush so to tell the culprit for these
> local slow-downs.

Hurm, odd.  I haven't pushed RAM buffer up to 512 MB so it could be GC
cost somehow makes things worse ... curious.

> Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs)
> with this patch. The way I indexed it before it took about 4 days -
> running in 4 threads, and creating 36 indexes. This is even more a real
> life scenario, it involves HTML parsing, standard analysis, and merging
> (to some extent). Since there are 4 threads each one will get, say,
> 250MB. Again, for a "fair" comparison, I will remain with compound.

OK, because you're doing StandardAnalyzer and HTML parsing and
presumably loading one-doc-per-file, most of your time is spent
outside of Lucene indexing so I'd expect less that 50% speedup in
this case.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lu

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-24 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507708
 ] 

Doron Cohen commented on LUCENE-843:


Just to clarify your comment on reusing field and doc instances - to my 
understanding reusing a field instance is ok *only* after the containing doc 
was added to the index.

For a "fair" comparison I ended up not following most of your recommendations, 
including the reuse field/docs one and the non-compound one (apologies:-)), but 
I might use them later. 

For the first 100,000,000 docs (==speller words) the speed-up is quite amazing:
Orig:Speller: added 1 words in 10912 seconds = 3 hours 1 
minutes 52 seconds
New:   Speller: added 1 words in 58490 seconds = 16 hours 14 
minutes 50 seconds
This is 5.3 times faster !!!

This btw was with maxBufDocs=100,000 (I forgot to set the MEM param). 
I stopped the run now, I don't expect to learn anything new by letting it 
continue.

When trying with  MEM=512MB, it at first seemed faster, but then there were now 
and then local slow-downs, and eventually it became a bit slower than the 
previous run. I know these are not merges, so they are either flushes (RAM 
directed), or GC activity. I will perhaps run with GC debug flags and perhaps 
add a print at flush so to tell the culprit for these local slow-downs.

Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) with 
this patch. The way I indexed it before it took about 4 days - running in 4 
threads, and creating 36 indexes. This is even more a real life scenario, it 
involves HTML parsing, standard analysis, and merging (to some extent). Since 
there are 4 threads each one will get, say, 250MB. Again, for a "fair" 
comparison, I will remain with compound.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-23 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507587
 ] 

Michael McCandless commented on LUCENE-843:
---


> I thought it would be interesting to see how the new code performs in this 
> scenario, what do you think?

Yes I'd be very interested to see the results of this.  It's a
somewhat "unusual" indexing situation (such tiny docs) but it's a real
world test case.  Thanks!

>  - what settings do you recommend?

I think these are likely the important ones in this case:

  * Flush by RAM instead of doc count
(writer.setRAMBufferSizeMB(...)).

  * Give it as much RAM as you can.

  * Use maybe 3 indexing threads (if you can).

  * Turn off compound file.

  * If you have stored fields/vectors (seems not in this case) use
autoCommit=false.

  * Use a trivial analyzer that doesn't create new String/new Token
(re-use the same Token, and use the char[] based term text
storage instead of the String one).

  * Re-use Document/Field instances.  The DocumentsWriter is fine with
this and it saves substantial time from GC especially because your
docs are so tiny (per-doc overhead is otherwise a killer).  In
IndexLineFiles I made a StringReader that lets me reset its String
value; this way I didn't have to change the Field instances stored
in the Document.

>  - is there any chance for speed-up in optimize()?  I didn't read
>your new code yet, but at least from some comments here it seems
>that on disk merging was not changed... is this (still) so? I would

Correct: my patch doesn't touch merging and optimizing.  All it does
now is gather many docs in RAM and then flush a new segment when it's
time.  I've opened a separate issue (LUCENE-856) for optimizations
in segment merging.

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-23 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507567
 ] 

Doron Cohen commented on LUCENE-843:


Mike, I am considering testing the performance of this patch on a somewhat 
different use case, real one I think. After indexing 25M docs of TREC .gov2 
(~500GB of docs) I pushed the index terms to create a spell correction index, 
by using the contrib spell checker. Docs here are *very* short - For each index 
term a document is created, containing some N-GRAMS. On the specific machine I 
used there are 2 CPUs but the SpellChecker indexing does not take advantage of 
that. Anyhow, 126,684,685 words==documents were indexed. 
For the docs adding step I had:
mergeFactor = 100,000
maxBufferedDocs = 10,000
So no merging took place.
This step took 21 hours, and created 12,685 segments, total size 15 - 20 GB. 
Then I optimized the index with
mergeFactor = 400
(Larger values were hard on the open files limits.)

I thought it would be interesting to see how the new code performs in this 
scenario, what do you think?

If you too find this comparison interesting, I have two more questions:
  - what settings do you recommend? 
  - is there any chance for speed-up in optimize()?  I didn't read your 
new code yet, but at least from some comments here it seems that 
on disk merging was not changed... is this (still) so? I would skip the 
optimize part if this is not of interest for the comparison. (In fact I am 
still waiting for my optimize() to complete, but if it is not of interest I 
will just interrupt it...)

Thanks,
Doron


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-22 Thread Michael McCandless


Hi Grant,

The benchmarking code I've been using is in all but the first & last
patches I attached on LUCENE-843.  Really it's just a modified version
of the demo IndexFiles code, plus a new analyzer (SimpleSpaceAnalyzer)
that is the same as WhitespaceAnalyzer except it re-uses Token/String
instead of allocating a new one for each term.

But, I'd also like to port these into the benchmark contrib framework.
My plan is to make a new DocMaker that knows how to read documents
"line by line" from a previously created file, to not pay the IO cost
of opening a separate file per document, and then make a new class
(maybe a task?) that can read documents from a DocMaker and write a
single file with one document per line.

I just haven't quite gotten to this yet, but I will :)

Mike

"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> Hi Michael,
> 
> I know you've got your hands full, but was wondering if you could  
> either post your benchmark code, or better yet, hook it into the  
> benchmarker contrib (it is quite easy).
> 
> Let me know if I can help,
> Grant
> 
> On Jun 21, 2007, at 10:01 AM, Michael McCandless (JIRA) wrote:
> 
> >
> > [ https://issues.apache.org/jira/browse/LUCENE-843? 
> > page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> > tabpanel#action_12506907 ]
> >
> > Michael McCandless commented on LUCENE-843:
> > ---
> >
> > OK I ran tests comparing analyzer performance.
> >
> > It's the same test framework as above, using the ~5,500 byte Europarl
> > docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
> > vectors, and CFS=false, indexing 200,000 documents.
> >
> > The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
> > GC cost by not allocating a Term or String for every token in every
> > document.
> >
> > Each run is best time of 2 runs:
> >
> >   ANALYZERPATCH (sec) TRUNK (sec)  SPEEDUP
> >   SimpleSpaceAnalyzer  79.0   326.54.1 X
> >   StandardAnalyzer    449.0   674.1        1.5 X
> >   WhitespaceAnalyzer  104.0   338.93.3 X
> >   SimpleAnalyzer  104.7   328.03.1 X
> >
> > StandardAnalyzer is definiteely rather time consuming!
> >
> >
> >> improve how IndexWriter uses RAM to buffer added documents
> >> --
> >>
> >> Key: LUCENE-843
> >> URL: https://issues.apache.org/jira/browse/LUCENE-843
> >> Project: Lucene - Java
> >>  Issue Type: Improvement
> >>  Components: Index
> >>Affects Versions: 2.2
> >>Reporter: Michael McCandless
> >>Assignee: Michael McCandless
> >>Priority: Minor
> >> Attachments: index.presharedstores.cfs.zip,  
> >> index.presharedstores.nocfs.zip, LUCENE-843.patch,  
> >> LUCENE-843.take2.patch, LUCENE-843.take3.patch,  
> >> LUCENE-843.take4.patch, LUCENE-843.take5.patch,  
> >> LUCENE-843.take6.patch, LUCENE-843.take7.patch,  
> >> LUCENE-843.take8.patch, LUCENE-843.take9.patch
> >>
> >>
> >> I'm working on a new class (MultiDocumentWriter) that writes more  
> >> than
> >> one document directly into a single Lucene segment, more efficiently
> >> than the current approach.
> >> This only affects the creation of an initial segment from added
> >> documents.  I haven't changed anything after that, eg how segments  
> >> are
> >> merged.
> >> The basic ideas are:
> >>   * Write stored fields and term vectors directly to disk (don't
> >> use up RAM for these).
> >>   * Gather posting lists & term infos in RAM, but periodically do
> >> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> >> merge them later when it's time to make a real segment).
> >>   * Recycle objects/buffers to reduce time/stress in GC.
> >>   * Other various optimizations.
> >> Some of these changes are similar to how KinoSearch builds a segment.
> >> But, I haven't made any changes to Lucene's file format nor added
> >> requirements for a global fields schema.
> >> So far the only externally visible change is a new method
> >> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> >> deprecated) so that it flushes according to RAM usage and not a fixed
> >> number do

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-22 Thread Grant Ingersoll


Hi Michael,

I know you've got your hands full, but was wondering if you could  
either post your benchmark code, or better yet, hook it into the  
benchmarker contrib (it is quite easy).


Let me know if I can help,
Grant

On Jun 21, 2007, at 10:01 AM, Michael McCandless (JIRA) wrote:



[ https://issues.apache.org/jira/browse/LUCENE-843? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel#action_12506907 ]


Michael McCandless commented on LUCENE-843:
---

OK I ran tests comparing analyzer performance.

It's the same test framework as above, using the ~5,500 byte Europarl
docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
vectors, and CFS=false, indexing 200,000 documents.

The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
GC cost by not allocating a Term or String for every token in every
document.

Each run is best time of 2 runs:

  ANALYZERPATCH (sec) TRUNK (sec)  SPEEDUP
  SimpleSpaceAnalyzer  79.0   326.54.1 X
  StandardAnalyzer449.0   674.11.5 X
  WhitespaceAnalyzer  104.0   338.93.3 X
  SimpleAnalyzer  104.7   328.03.1 X

StandardAnalyzer is definiteely rather time consuming!



improve how IndexWriter uses RAM to buffer added documents
--

Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
 Issue Type: Improvement
 Components: Index
   Affects Versions: 2.2
   Reporter: Michael McCandless
   Assignee: Michael McCandless
   Priority: Minor
Attachments: index.presharedstores.cfs.zip,  
index.presharedstores.nocfs.zip, LUCENE-843.patch,  
LUCENE-843.take2.patch, LUCENE-843.take3.patch,  
LUCENE-843.take4.patch, LUCENE-843.take5.patch,  
LUCENE-843.take6.patch, LUCENE-843.take7.patch,  
LUCENE-843.take8.patch, LUCENE-843.take9.patch



I'm working on a new class (MultiDocumentWriter) that writes more  
than

one document directly into a single Lucene segment, more efficiently
than the current approach.
This only affects the creation of an initial segment from added
documents.  I haven't changed anything after that, eg how segments  
are

merged.
The basic ideas are:
  * Write stored fields and term vectors directly to disk (don't
use up RAM for these).
  * Gather posting lists & term infos in RAM, but periodically do
in-RAM merges.  Once RAM is full, flush buffers to disk (and
merge them later when it's time to make a real segment).
  * Recycle objects/buffers to reduce time/stress in GC.
  * Other various optimizations.
Some of these changes are similar to how KinoSearch builds a segment.
But, I haven't made any changes to Lucene's file format nor added
requirements for a global fields schema.
So far the only externally visible change is a new method
"setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
deprecated) so that it flushes according to RAM usage and not a fixed
number documents added.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506974
 ] 

Michael McCandless commented on LUCENE-843:
---

> Do you think your code is easily extensible in this regard? I'm 
> wondering because of all the optimizations you're doing like e. g.
> sharing byte arrays. But I'm certainly not familiar enough with your code 
> yet, so I'm only guessing here.

Good question!

DocumentsWriter is definitely more complex than DocumentWriter, but it
doesn't prevent extensibility and I think will work very well when we
do flexible indexing.

The patch now has dedicated methods for writing into the freq/prox/etc
streams ('writeFreqByte', 'writeFreqVInt', 'writeProxByte',
'writeProxVInt', etc.), but, this could easily be changed to instead
use true IndexOutput streams.  This would then hide all details of
shared byte arrays from whoever is doing the writing.

The way I roughly see flexible indexing working in the future is
DocumentsWriter will be responsible for keeping track of unique terms
seen (in its hash table), holding the Posting instance (which could be
subclassed in the future) for each term, flushing a real segment when
full, handling shared byte arrays, etc.  Ie all the "infrastructure".

But then the specific logic of what bytes are written into which
streams (freq/prox/vectors/others) will be handled by a separate class
or classes that we can plug/unplug according to some "schema".
DocumentsWriter would call on these classes and provide the
IndexOutput's for all streams for the Posting, per position, and these
classes write their own format into the IndexOutputs.

I think a separation like that would work well: we could have good
performance and also extensibility.  Devil is in the details of
course...

I obviously haven't factored DocumentsWriter in this way (it has its
own addPosition that writes the current Lucene index format) but I
think this is very doable in the future.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506961
 ] 

Michael Busch commented on LUCENE-843:
--

> OK I ran tests comparing analyzer performance.

Thanks for the numbers Mike. Yes the gain is less with StandardAnalyzer
but 1.5X faster is still very good!


I have some question about the extensibility of your code. For flexible
indexing we want to be able in the future to implement different posting
formats and we might even want to allow our users to implement own 
posting formats.

When I implemented multi-level skipping I tried to keep this in mind. 
Therefore I put most of the functionality in the two abstract classes
MultiLevelSkipListReader/Writer. Subclasses implement the actual format
of the skip data. I think with this design it should be quite easy to
implement different formats in the future while limiting the code
complexity.

With the old DocumentWriter I think this is quite simple to do too by
adding a class like PostingListWriter, where subclasses define the actual 
format (because DocumentWriter is so simple).

Do you think your code is easily extensible in this regard? I'm 
wondering because of all the optimizations you're doing like e. g.
sharing byte arrays. But I'm certainly not familiar enough with your code 
yet, so I'm only guessing here.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506907
 ] 

Michael McCandless commented on LUCENE-843:
---

OK I ran tests comparing analyzer performance.

It's the same test framework as above, using the ~5,500 byte Europarl
docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
vectors, and CFS=false, indexing 200,000 documents.

The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
GC cost by not allocating a Term or String for every token in every
document.

Each run is best time of 2 runs:

  ANALYZERPATCH (sec) TRUNK (sec)  SPEEDUP
  SimpleSpaceAnalyzer  79.0   326.54.1 X
  StandardAnalyzer449.0   674.11.5 X
  WhitespaceAnalyzer  104.0   338.93.3 X
  SimpleAnalyzer  104.7   328.03.1 X

StandardAnalyzer is definiteely rather time consuming!


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-21 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506811
 ] 

Michael McCandless commented on LUCENE-843:
---

> Does DocumentsWriter also solve the problem DocumentWriter had
> before LUCENE-880? I believe the answer is yes. Even though you
> close the TokenStreams in the finally clause of invertField() like
> DocumentWriter did before 880 this is safe, because addPosition()
> serializes the term strings and payload bytes into the posting hash
> table right away. Is that right?

That's right.  When I merged in the fix for LUCENE-880, I realized
with this patch it's fine to close the token stream immediately after
processing all of its tokens because everything about the token stream
has been "absorbed" into postings hash.

> the benchmarks you run focus on measuring the pure indexing
> performance. I think it would be interesting to know how big the
> speedup is in real-life scenarios, i. e. with StandardAnalyzer and
> maybe even HTML parsing? For sure the speedup will be less, but it
> should still be a significant improvement. Did you run those kinds
> of benchmarks already?

Good question ... I haven't measured the performance cost of using
StandardAnalyzer or HTML parsing but I will test & post back.

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506778
 ] 

Michael Busch commented on LUCENE-843:
--

Mike,

the benchmarks you run focus on measuring the pure indexing performance. I 
think it would be interesting to know how big the speedup is in real-life 
scenarios, i. e. with StandardAnalyzer and maybe even HTML parsing? For sure 
the speedup will be less, but it should still be a significant improvement. Did 
you run those kinds of benchmarks already?

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506752
 ] 

Michael Busch commented on LUCENE-843:
--

Hi Mike,

my first comment on this patch is: Impressive!

It's also quite overwhelming at the beginning, but I'm trying to dig into it. 
I'll probably have more questions, here's the first one:

Does DocumentsWriter also solve the problem DocumentWriter had before 
LUCENE-880? I believe the answer is yes. Even though you close the TokenStreams 
in the finally clause of invertField() like DocumentWriter did before 880 this 
is safe, because addPosition() serializes the term strings and payload bytes 
into the posting hash table right away. Is that right?

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506718
 ] 

Michael McCandless commented on LUCENE-843:
---

> Yeah, that was it.

Phew!

> I'll be delving more into the code as I try to figure out how it will
> dove tail with the merge policy factoring.

OK, thanks.  I am very eager to get some other eyeballs looking for
issues with this patch!

I *think* this patch and the merge policy refactoring should be fairly
separate.

With this patch, "flushing" RAM -> Lucene segment is no longer a
"mergeSegments" call which I think simplifies IndexWriter.  Previously
mergeSegments had lots of extra logic to tell if it was merging RAM
segments (= a flush) vs merging "real" segments but now it's simpler
because mergeSegments really only merges segments.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Steven Parkes (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506609
 ] 

Steven Parkes commented on LUCENE-843:
--

Yeah, that was it.

I'll be delving more into the code as I try to figure out how it will dove tail 
with the merge policy factoring.

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-843:
--

Attachment: index.presharedstores.nocfs.zip
index.presharedstores.cfs.zip

Oh, were the test failures only in the TestBackwardsCompatibility?

Because I changed the index file format, I added 2 more ZIP files to
that unit test, but, "svn diff" doesn't pick up the new zip files.  So
I'm attaching them.  Can you pull off these zip files into your
src/test/org/apache/lucene/index and test again?  Thanks.



> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-20 Thread Steven Parkes (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506576
 ] 

Steven Parkes commented on LUCENE-843:
--

I've started looking at this, what it would take to merge with the merge policy 
stuff (LUCENE-847). Noticed that there are a couple of test failures?

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-18 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-843:
--

Attachment: LUCENE-843.take9.patch

OK, I attached a new version (take9) of the patch that reverts back to
the default of "flush after every 10 documents added" in IndexWriter.
This removes the dependency on LUCENE-845.

However, I still think we should later (once LUCENE-845 is done)
default IndexWriter to flush by RAM usage since this will generally
give the best "out of the box" performance.  I will open a separate
issue to change the default after this issue is resolved.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-15 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505418
 ] 

Michael McCandless commented on LUCENE-843:
---

> > When merging segments we don't merge the "doc store" files when all 
> > segments are sharing the same ones (big performance gain),
> 
> Is this only in the case where the segments have no deleted docs? 

Right.  Also the segments must be contiguous which the current merge
policy ensures but future merge policies may not.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-15 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505373
 ] 

Yonik Seeley commented on LUCENE-843:
-

> When merging segments we don't merge the "doc store" files when all segments 
> are sharing the same ones (big performance gain), 

Is this only in the case where the segments have no deleted docs?


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-15 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-843:
--

Attachment: LUCENE-843.take8.patch

Attached latest patch.

I think this patch is ready to commit.  I will let it sit for a while
so people can review it.

We still need to do LUCENE-845 before it can be committed as is.

However one option instead would be to commit this patch, but leave
IndexWriter flushing by doc count by default and then later switch it
to flush by net RAM usage once LUCENE-845 is done.  I like this option
best.

All tests pass (I've re-enabled the disk full tests and fixed error
handling so they now pass) on Windows XP, Debian Linux and OS X.

Summary of the changes in this rev:

  * Finished cleaning up & commenting code

  * Exception handling: if there is a disk full or any other exception
while adding a document or flushing then the index is rolled back
to the last commit point.

  * Added more unit tests

  * Removed my profiling tool from the patch (not intended to be
committed)

  * Fixed a thread safety issue where if you flush by doc count you
would sometimes get more than the doc count at flush than you
requested.  I moved the thread synchronization for determining
flush time down into DocumentsWriter.

  * Also fixed thread safety of calling flush with one thread while
other threads are still adding documents.

  * The biggest change is: absorbed all merging logic back into
IndexWriter.

Previously in DocumentsWriter I was tracking my own
flushed/partial segments and merging them on my own (but using
SegmentMerger).  This makes DocumentsWriter much simpler: now its
sole purpose is to gather added docs and write a new segment.

This turns out to be a big win:

  - Code is much simpler (no duplication of "merging"
policy/logic)

  - 21-25% additional performance gain for autoCommit=false case
when stored fields & vectors are used

  - IndexWriter.close() no longer takes an unexpected long time to
close in autoCommit=false case

However I had to make a change to the index format to do this.
The basic idea is to allow multiple segments to share access to
the "doc store" (stored fields, vectors) index files.

The change is quite simple: FieldsReader/VectorsReader are now
told the doc offset that they should start from when seeking in
the index stream (this info is stored in SegmentInfo).  When
merging segments we don't merge the "doc store" files when all
segments are sharing the same ones (big performance gain), else,
we make a private copy of the "doc store" files (ie as segments
normally are on the trunk today).

The change is fully backwards compatible (I added a test case to
the backwards compatibility unit test to be sure) and the change
is only used when autoCommit=false.

When autoCommit=false, the writer will append stored fields /
vectors to a single set of files even though it is flushing normal
segments whenever RAM is full.  These normal segments all refer to
the single shared set of "doc store" files.  Then when segments
are merged, the newly merged segment has its own "private" doc
stores again.  So the sharing only occurs for the "level 0"
    segments.

    I still need to update fileformats doc with this change.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-08 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502793
 ] 

Michael McCandless commented on LUCENE-843:
---

I ran a benchmark using more than 1 thread to do indexing, in order to
test & compare concurrency of trunk and the patch.  The test is the
same as above, and runs on a 4 core Mac Pro (OS X) box with 4 drive
RAID 0 IO system.

Here are the raw results:

DOCS = ~5,500 bytes plain text
RAM = 32 MB
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)

NUM THREADS = 1

new
  20 docs in 172.3 secs
  index size = 1.7G

old
  20 docs in 539.5 secs
  index size = 1.7G

Total Docs/sec: old   370.7; new  1161.0 [  213.2% faster]
Docs/MB @ flush:old47.9; new   334.6 [  598.7% more]
Avg RAM used (MB) @ flush:  old   131.9; new33.1 [   74.9% less]


NUM THREADS = 2

new
  21 docs in 130.8 secs
  index size = 1.7G

old
  21 docs in 452.8 secs
  index size = 1.7G

Total Docs/sec: old   441.7; new  1529.3 [  246.2% faster]
Docs/MB @ flush:old47.9; new   301.5 [  529.7% more]
Avg RAM used (MB) @ flush:  old   226.1; new35.2 [   84.4% less]


NUM THREADS = 3

new
  22 docs in 105.4 secs
  index size = 1.7G

old
  22 docs in 428.4 secs
  index size = 1.7G

Total Docs/sec: old   466.8; new  1897.9 [  306.6% faster]
Docs/MB @ flush:old47.9; new   277.8 [  480.2% more]
Avg RAM used (MB) @ flush:  old   289.8; new37.0 [   87.2% less]


NUM THREADS = 4

new
  23 docs in 104.8 secs
  index size = 1.7G

old
  23 docs in 440.4 secs
  index size = 1.7G

Total Docs/sec: old   454.1; new  1908.5 [  320.3% faster]
Docs/MB @ flush:old47.9; new   259.9 [  442.9% more]
Avg RAM used (MB) @ flush:  old   293.7; new37.1 [   87.3% less]


NUM THREADS = 5

new
  24 docs in 99.5 secs
  index size = 1.7G

old
  24 docs in 425.0 secs
  index size = 1.7G

Total Docs/sec: old   470.6; new  2010.5 [  327.2% faster]
Docs/MB @ flush:old47.9; new   245.3 [  412.6% more]
Avg RAM used (MB) @ flush:  old   390.9; new38.3 [   90.2% less]


NUM THREADS = 6

new
  25 docs in 106.3 secs
  index size = 1.7G

old
  25 docs in 427.1 secs
  index size = 1.7G

Total Docs/sec: old   468.2; new  1882.3 [  302.0% faster]
Docs/MB @ flush:old47.8; new   248.5 [  419.3% more]
Avg RAM used (MB) @ flush:  old   340.9; new38.7 [   88.6% less]


NUM THREADS = 7

new
  26 docs in 106.1 secs
  index size = 1.7G

old
  26 docs in 435.2 secs
  index size = 1.7G

Total Docs/sec: old   459.6; new  1885.3 [  310.2% faster]
Docs/MB @ flush:old47.8; new   248.7 [  420.0% more]
Avg RAM used (MB) @ flush:  old   408.6; new39.1 [   90.4% less]


NUM THREADS = 8

new
  27 docs in 109.0 secs
  index size = 1.7G

old
  27 docs in 469.2 secs
  index size = 1.7G

Total Docs/sec: old   426.3; new  1835.2 [  330.5% faster]
Docs/MB @ flush:old47.8; new   251.3 [  425.5% more]
Avg RAM used (MB) @ flush:  old   448.9; new39.0 [   91.3% less]



Some quick comments:

  * Both trunk & the patch show speedups if you use more than 1 thread
to do indexing.  This is expected since the machine has concurrency. 

  * The biggest speedup is from 1->2 threads but still good gains from
2->5 threads.

  * Best seems to be 5 threads.

  * The patch allows better concurrency: relatively speaking it speeds
up faster than the trunk (the % faster increases as we add
threads) as you increase # threads.  I think this makes sense
because we flush less often with the patch, and, flushing is time
consuming and single threaded.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee:

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-06-08 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-843:
--

Attachment: LUCENE-843.take7.patch

Latest working patch attached.

I've cutover to using Lucene's normal segment merging for all merging
(ie, I no longer use a different merge-efficient format for segments
when autoCommit=false); this has substantially simplified the code.

All unit tests pass except disk-full test and certain contrib tests
(gdata-server, lucli, similarity, wordnet) that I think I'm not
causing.

Other changes:

  * Consolidated flushing of a new segment back into IndexWriter
(previously DocumentsWriter would do its own flushing when
autoCommit=false).

I would also like to consolidate merging entirely into
IndexWriter; right now DocumentsWriter does its own merging of the
flushed segments when autoCommit=false (this is because those
segments are "partial" meaning they do not have their own stored
fields or term vectors).  I'm trying to find a clean way to do
this...

  * Thread concurrency now works: each thread writes into a separate
Postings hash (up until a limit (currently 5) at which point the
threads share the Postings hashes) and then when flushing the
segment I merge the docIDs together. I flush when the total RAM
used across threads is over the limit.  I ran a test comparing
thread concurrency on current trunk vs this patch, which I'll post
next.

  * Reduced bytes used per-unique-term to be lower than current
Lucene.  This means the worst-case document (many terms, all of
which are unique) should use less RAM overall than Lucene trunk
does.

  * Added some new unit test cases; added missing "writer.close()" to
one of the contrib tests.

  * Cleanup, comments, etc.  I think the code is getting more
"approachable" now.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-05-21 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-843:
--

Attachment: LUCENE-843.take6.patch

Attached latest patch.

I'm now working towards simplify & cleaning up the code & design:
eliminated dead code leftover from the previous iterations, use
existing RAMFile instead of my own new class, refactored
duplicate/confusing code, added comments, etc. It's getting closer to
a committable state but still has a ways to go.

I also renamed the class from MultiDocumentWriter to DocumentsWriter.

To summarize the current design:

  1. Write stored fields & term vectors to files in the Directory
 immediately (don't buffer these in RAM).

  2. Write freq & prox postings to RAM directly as a byte stream
 instead of first pass as int[] and then second pass as a byte
 stream.  This single-pass instead of double-pass is a big
 savings.  I use slices into shared byte[] arrays to efficiently
 allocate bytes to the postings the need them.

  3. Build Postings hash that holds the Postings for many documents at
 once instead of a single doc, keyed by unique term.  Not tearing
 down & rebuilding the Postings hash w/ every doc saves alot of
 time.  Also when term vectors are off this saves quicksort for
 every doc and this gives very good performance gain.

 When the Postings hash is full (used up the allowed RAM usage) I
 then create a real Lucene segment when autoCommit=true, else a
 "partial segment".

  4. Use my own "partial segment" format that differs from Lucene's
 normal segments in that it is optimized for merging (and unusable
 for searching).  This format, and the merger I created to work
 with this format, performs merging mostly by copying blocks of
 bytes instead of reinterpreting every vInt in each Postings list.
 These partial segments are are only created when IndexWriter has
 autoCommit=false, and then on commit they are merged into the
 real Lucene segment format.

  5. Reuse the Posting, PostingVector, char[] and byte[] objects that
 are used by the Postings hash.

I plan to keep simplifying the design & implementation.  Specifically,
I'm going to test removing #4 above entirely (using my own "partial
segment" format that's optimized for merging not searching).

While doing this may give back some of the performance gains, that
code is the source of much added complexity in the patch, and, it
duplicates the current SegmentMerger code.  It was more necessary
before (when we would merge thousands of single-doc segments in
memory) but now that each segment contains many docs I think we are no
longer gaining as much performance from it.

I plan instead to write all segments in the "real" Lucene segment
format and use the current SegmentMerger, possibly w/ some small
changes, to do the merges even when autoCommit=false.  Since we have
another issue (LUCENE-856) to optimize segment merging I can carry
over any optimizations that we may want to keep into that issue.  If
this doesn't lose much performance it will make the approach here even
simpler.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requireme

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492748
 ] 

Michael McCandless commented on LUCENE-843:
---

> How does this work with pending deletes?
> I assume that if autocommit is false, then you need to wait until the end 
> when you get a real lucene segment to delete the pending terms?

Yes, all of this sits "below" the pending deletes layer since this
change writes a single segment either when RAM is full
(autoCommit=true) or when writer is closed (autoCommit=false).  Then
the deletes get applied like normal (I haven't changed that part).

> Also, how has the merge policy (or index invariants) of lucene segments 
> changed?
> If autocommit is off, then you wait until the end to create a big lucene 
> segment.  This new segment may be much larger than segments to it's "left".  
> I suppose the idea of merging rightmost segments should just be dropped in 
> favor of merging the smallest adjacent segments?  Sorry if this has already 
> been covered... as I said, I'm trying to follow along at a high level.

Has not been covered, and as usual these are excellent questions
Yonik!

I haven't yet changed anything about merge policy, but you're right
the current invariants won't hold anymore.  In fact they already don't
hold if you "flush by RAM" now (APIs are exposed in 2.1 to let you do
this).  So we need to do something.

I like your idea to relax merge policy (& invariants) to allow
"merging of any adjacent segments" (not just rightmost ones) and then
make the policy merge the smallest ones / most similarly sized ones,
measuring size by net # bytes in the segment.  This would preserve the
"docID monotonicity invariance".

If we take that approach then it would automatically resolve
LUCENE-845 as well (which would otherwise block this issue).


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492674
 ] 

Yonik Seeley commented on LUCENE-843:
-

How does this work with pending deletes?
I assume that if autocommit is false, then you need to wait until the end when 
you get a real lucene segment to delete the pending terms?

Also, how has the merge policy (or index invariants) of lucene segments changed?
If autocommit is off, then you wait until the end to create a big lucene 
segment.  This new segment may be much larger than segments to it's "left".  I 
suppose the idea of merging rightmost segments should just be dropped in favor 
of merging the smallest adjacent segments?  Sorry if this has already been 
covered... as I said, I'm trying to follow along at a high level.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)

]
Docs/MB @ flush:old25.0; new47.5 [   89.7% more]
Avg RAM used (MB) @ flush:  old   111.1; new42.5 [   61.7% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  2 docs in 323.1 secs
  index size = 1.4G

new
  2 docs in 183.9 secs
  index size = 1.4G

Total Docs/sec: old61.9; new   108.7 [   75.7% faster]
Docs/MB @ flush:old10.4; new46.8 [  348.3% more]
Avg RAM used (MB) @ flush:  old74.2; new44.9 [   39.5% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  2 docs in 323.5 secs
  index size = 1.4G

new
  2 docs in 135.6 secs
  index size = 1.4G

Total Docs/sec: old61.8; new   147.5 [  138.5% faster]
Docs/MB @ flush:old10.4; new47.5 [  354.8% more]
Avg RAM used (MB) @ flush:  old74.3; new42.9 [   42.2% less]



> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Yonik Seeley


On 4/30/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:

After discussion on java-dev last time, I decided to retry the
"persistent hash" approach, where the Postings hash lasts across many
docs and then a single flush produces a partial segment containing all
of those docs.  This is in contrast to the previous approach where
each doc makes its own segment and then they are merged.

It turns out this is even faster than my previous approach,


Go, Mike, go!


With this new approach, as I process each term in the document I
immediately write the prox/freq in their compact (vints) format into
shared byte[] buffers, rather than accumulating int[] arrays that then
need to be re-processed into the vint encoding.  This speeds things up
because we don't double-process the postings.


Good idea!


 It also uses less
per-document RAM overhead because intermediate postings are stored as
vints not as ints.


I'm just trying to follow along at a high level...how do you handle
intermediate termdocs?

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492658
 ] 

Michael McCandless commented on LUCENE-843:
---


> How are you writing the frq data in compressed format? The works fine for
> prx data, because the deltas are all within a single doc -- but for the freq
> data, the deltas are tied up in doc num deltas, so you have to decompress it
> when performing merges.

For each Posting I keep track of the last docID that its term occurred
in; when this differs from the current docID I record the "delta code"
that needs to be written and then I later write it with the final freq
for this document.

> * I haven't been able to come up with a file format tweak that
>   gets around this doc-num-delta-decompression problem to enhance the speed
>   of frq data merging. I toyed with splitting off the freq from the
>   doc_delta, at the price of increasing the file size in the common case of
>   freq == 1, but went back to the old design. It's not worth the size
>   increase for what's at best a minor indexing speedup.

I'm just doing the "stitching" approach here: it's only the very first
docCode (& freq when freq==1) that must be re-encoded on merging.  The
one catch is you must store the last docID of the previous segment so
you can compute the new delta at the boundary.  Then I do a raw
"copyBytes" for the remainder of the freq postings.

Note that I'm only doing this for the "internal" merges (of partial
RAM segments and flushed partial segments) I do before creating a real
Lucene segment.  I haven't changed how the "normal" Lucene segment
merging works (though I think we should look into it -- I opened a
separate issue): it still re-interprets and then re-encodes all
docID/freq's.

> * I've added a custom MemoryPool class to KS which grabs memory in 1 meg
>   chunks, allows resizing (downwards) of only the last allocation, and can
>   only release everything at once. From one of these pools, I'm allocating
>RawPosting objects, each of which is a doc_num, a freq, the term_text, and
>   the pre-packed prx data (which varies based on which Posting subclass
>   created the RawPosting object). I haven't got things 100% stable yet, but
>   preliminary results seem to indicate that this technique, which is a riff
>   on your persistent arrays, improves indexing speed by about 15%.

Fabulous!!

I think it's the custom memory management I'm doing with slices into
shared byte[] arrays for the postings that made the persistent hash
approach work well, this time around (when I had previously tried this
it was slower).


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492655
 ] 

Marvin Humphrey commented on LUCENE-843:


How are you writing the frq data in compressed format?  The  works fine for
prx data, because the deltas are all within a single doc -- but for  the freq
data, the deltas are tied up in doc num deltas, so you have to decompress it
when performing merges.  

To continue our discussion from java-dev... 

 * I haven't been able to come up with a file format tweak that 
   gets around this doc-num-delta-decompression problem to enhance the speed
   of frq data merging. I toyed with splitting off the freq from the
   doc_delta, at the price of increasing the file size in the common case of
   freq == 1, but went back to the old design.  It's not worth the size
   increase for what's at best a minor indexing speedup.
 * I've added a custom MemoryPool class to KS which grabs memory in 1 meg
   chunks, allows resizing (downwards) of only the last allocation, and can
   only release everything at once.  From one of these pools, I'm allocating
   RawPosting objects, each of which is a doc_num, a freq, the term_text, and
   the pre-packed prx data (which varies based on which Posting subclass
   created the RawPosting object).  I haven't got things 100% stable yet, but
   preliminary results seem to indicate that this technique, which is a riff
   on your persistent arrays, improves indexing speed by about 15%.

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-843:
--

Attachment: LUCENE-843.take5.patch

I attached a new iteration of the patch.  It's quite different from
the last patch.

After discussion on java-dev last time, I decided to retry the
"persistent hash" approach, where the Postings hash lasts across many
docs and then a single flush produces a partial segment containing all
of those docs.  This is in contrast to the previous approach where
each doc makes its own segment and then they are merged.

It turns out this is even faster than my previous approach, especially
for smaller docs and especially when term vectors are off (because no
quicksort() is needed until the segment is flushed).  I will attach
new benchmark results.

Other changes:

  * Changed my benchmarking tool / testing (IndexLineFiles):

- I turned off compound file (to reduce time NOT spent on
  indexing).

- I noticed I was not downcasing the terms, so I fixed that

- I now do my own line processing to reduce GC cost of
  "BufferedReader.readLine" (to reduct time NOT spent on
  indexing).

  * Norms now properly flush to disk in the autoCommit=false case

  * All unit tests pass except disk full

  * I turned on asserts for unit tests (jvm arg -ea added to junit ant
task).  I think we should use asserts when running tests.  I have
quite a few asserts now.

With this new approach, as I process each term in the document I
immediately write the prox/freq in their compact (vints) format into
shared byte[] buffers, rather than accumulating int[] arrays that then
need to be re-processed into the vint encoding.  This speeds things up
because we don't double-process the postings.  It also uses less
per-document RAM overhead because intermediate postings are stored as
vints not as ints.

When enough RAM is used by the Posting entries plus the byte[]
buffers, I flush them to a partial RAM segment.  When enough of these
RAM segments have accumulated I flush to a real Lucene segment
(autoCommit=true) or to on-disk partial segments (autoCommit=false)
which are then merged in the end to create a real Lucene segment.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-07 Thread Otis Gospodnetic

Mike - thanks for explanation, it makes perfect sense!

Otis

- Original Message 
From: Michael McCandless <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 8:03:44 PM
Subject: Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to 
  buffer added documents

Hi Otis!

"Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:
> You talk about a RAM buffer from 1MB - 96MB, but then you have the amount
> of RAM @ flush time (e.g. Avg RAM used (MB) @ flush:  old34.5; new   
>  3.4 [   90.1% less]).
> 
> I don't follow 100% of what you are doing in LUCENE-843, so could you
> please explain what these 2 different amounts of RAM are?
> Is the first (1-96) the RAM you use for in-memory merging of segments?
> What is the RAM used @ flush?  More precisely, why is it that that amount
> of RAM exceeds the RAM buffer?

Very good questions!

When I say "the RAM buffer size is set to 96 MB", what I mean is I
flush the writer when the in-memory segments are using 96 MB RAM.  On
trunk, I just call ramSizeInBytes().  I do the analogous thing with my
patch (sum up size of RAM buffers used by segments).  I call this part
of the RAM usage the "indexed documents RAM".  With every added
document, this grows.

But: this does not account for all data structures (Posting instances,
HashMap, FieldsWriter, TermVectorsWriter, int[] arrays, aetc.) used,
but not saved away, during the indexing of a single document.  All the
"things" used temporarily while indexing a document take up RAM too.
I call this part of the RAM usage the "document processing RAM".  This
RAM does not grow with every added document, though its size is in
proportion to the how big each document is.  This memory is always
re-used (does not grow with time).  But with the trunk, this is done
by creating garbage, whereas with my patch, I explicitly reuse it.

When I measure "amount of RAM @ flush time", I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
process memory usage which should be (for my tests) around the sum of
the above two types of RAM usage.

With the trunk, the actual process memory usage tends to be quite a
bit higher than the RAM buffer size and also tends to be very "noisy"
(jumps around with each flush).  I think this is because of
delays/unpredictability on when GC kicks in to reclaim the garbage
created during indexing of the doc.  Whereas with my patch, it's
usually quite a bit closer to the "indexed documents RAM" and does not
jump around nearly as much.

So the "actual process RAM used" will always exceed my "RAM buffer
size".  The amount of excess is a measure of the "overhead" required
to process the document.  The trunk has far worse overhead than with
my patch, which I think means a given application will be able to use
a *larger* RAM buffer size with LUCENE-843.

Does that make sense?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-06 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
>
> On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:
> 
> >> What we need to do is cut down on decompression and conflict
> >> resolution costs when reading from one segment to another.  KS has
> >> solved this problem for stored fields.  Field defs are global and
> >> field values are keyed by name rather than field number in the field
> >> data file.  Benefits:
> >>
> >>* Whole documents can be read from one segment to
> >>  another as blobs.
> >>* No flags byte.
> >>* No remapping of field numbers.
> >>* No conflict resolution at all.
> >>* Compressed, uncompressed... doesn't matter.
> >>* Less code.
> >>* The possibility of allowing the user to provide their
> >>  own subclass for reading and writing fields. (For
> >>  Lucy, in the language of your choice.)
> >
> > I hear you, and I really really love those benefits, but, we just
> > don't have this freedom with Lucene.
> 
> Yeah, too bad.  This is one area where Lucene and Lucy are going to  
> differ.  Balmain and I are of one mind about global field defs.
> 
> > I think the ability to suddenly birth a new field,
> 
> You can do that in KS as of version 0.20_02.  :)

Excellent!

> > or change a field's attributes like "has vectors", "stores norms",
> > etc., with a new document,
> 
> Can't do that, though, and I make no apologies.  I think it's a  
> misfeature.

Alas, I don't think we (Lucene) can change this now.

> > I suppose if we had a
> > single mapping of field names -> numbers in the index, that would gain
> > us many of the above benefits?  Hmmm.
> 
> You'll still have to be able to remap field numbers when adding  
> entire indexes.

True, but it'd still be good progress for the frequent case of
adding/deleting docs to an existing index.  Progress not perfection...

> > Here's one idea I just had: assuming there are no deletions, you can
> > almost do a raw bytes copy from input segment to output (merged)
> > segment of the postings for a given term X.  I think for prox postings
> > you can.
> 
> You can probably squeeze out some nice gains using a skipVint()  
> function, even with deletions.

Good point.  I think likewise with copyVInt(int numToCopy).

> > But for freq postings, you can't, because they are delta coded.
> 
> I'm working on this task right now for KS.
> 
> KS implements the "Flexible Indexing" paradigm, so all posting data  
> goes in a single file.
> 
> I've applied an additional constraint to KS:  Every binary file must  
> consist of one type of record repeated over and over.  Every indexed  
> field gets its own dedicated posting file with the suffix .pNNN to  
> allow per-field posting formats.
> 
> The I/O code is isolated in subclasses of a new class called  
> "Stepper":  You can turn any Stepper loose on its file and read it  
> from top to tail.  When the file format changes, Steppers will get  
> archived, like old plugins.
> 
> My present task is to write the code for the Stepper subclasses  
> MatchPosting, ScorePosting, and RichPosting. (PayloadPosting can  
> wait.)  As I write them, I will see if I can figure out format that  
> can be merged as speedily as possible.  Perhaps the precise variant  
> of delta encoding used in Lucene's .frq file should be avoided.

Neat!  Yes, designing the file format to accommodate "merging"
efficiently (plus searching of course) is a good idea since we lose so
much indexing time to this.

> > Except: it's only the first entry of the incoming segments's freq
> > postings that needs to be re-interpreted?  So you could read that one,
> > encode the delta based on "last docID" for previous segment (I think
> > we'd have to store this in index, probably only if termFreq >
> > threshold), and then copyBytes the rest of the posting?  I will try
> > this out on the merges I'm doing in LUCENE-843; I think it should
> > work and make merging faster (assuming no deletes)?
> 
> Ugh, more special case code.
> 
> I have to say, I started trying to go over your patch, and the  
> overwhelming impression I got coming back to this part of the Lucene  
> code base in earnest for the first time since using 1.4.3 as a  
> porting reference was: simplicity seems to be nobody's priority these  
> days.

Unfortunately this is just a tough tradeoff... higher performance code
is often not "simple".  I also still need to clean up the code, add
comments, etc, but even after that, it's not going to look "simple".
I think this is just the reality of performance optimization.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey



On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:


What we need to do is cut down on decompression and conflict
resolution costs when reading from one segment to another.  KS has
solved this problem for stored fields.  Field defs are global and
field values are keyed by name rather than field number in the field
data file.  Benefits:

   * Whole documents can be read from one segment to
 another as blobs.
   * No flags byte.
   * No remapping of field numbers.
   * No conflict resolution at all.
   * Compressed, uncompressed... doesn't matter.
   * Less code.
   * The possibility of allowing the user to provide their
 own subclass for reading and writing fields. (For
 Lucy, in the language of your choice.)


I hear you, and I really really love those benefits, but, we just
don't have this freedom with Lucene.


Yeah, too bad.  This is one area where Lucene and Lucy are going to  
differ.  Balmain and I are of one mind about global field defs.



I think the ability to suddenly birth a new field,


You can do that in KS as of version 0.20_02.  :)


or change a field's attributes like "has vectors", "stores norms",
etc., with a new document,


Can't do that, though, and I make no apologies.  I think it's a  
misfeature.



I suppose if we had a
single mapping of field names -> numbers in the index, that would gain
us many of the above benefits?  Hmmm.


You'll still have to be able to remap field numbers when adding  
entire indexes.



Here's one idea I just had: assuming there are no deletions, you can
almost do a raw bytes copy from input segment to output (merged)
segment of the postings for a given term X.  I think for prox postings
you can.


You can probably squeeze out some nice gains using a skipVint()  
function, even with deletions.



But for freq postings, you can't, because they are delta coded.


I'm working on this task right now for KS.

KS implements the "Flexible Indexing" paradigm, so all posting data  
goes in a single file.


I've applied an additional constraint to KS:  Every binary file must  
consist of one type of record repeated over and over.  Every indexed  
field gets its own dedicated posting file with the suffix .pNNN to  
allow per-field posting formats.


The I/O code is isolated in subclasses of a new class called  
"Stepper":  You can turn any Stepper loose on its file and read it  
from top to tail.  When the file format changes, Steppers will get  
archived, like old plugins.


My present task is to write the code for the Stepper subclasses  
MatchPosting, ScorePosting, and RichPosting. (PayloadPosting can  
wait.)  As I write them, I will see if I can figure out format that  
can be merged as speedily as possible.  Perhaps the precise variant  
of delta encoding used in Lucene's .frq file should be avoided.



Except: it's only the first entry of the incoming segments's freq
postings that needs to be re-interpreted?  So you could read that one,
encode the delta based on "last docID" for previous segment (I think
we'd have to store this in index, probably only if termFreq >
threshold), and then copyBytes the rest of the posting?  I will try
this out on the merges I'm doing in LUCENE-843; I think it should
work and make merging faster (assuming no deletes)?


Ugh, more special case code.

I have to say, I started trying to go over your patch, and the  
overwhelming impression I got coming back to this part of the Lucene  
code base in earnest for the first time since using 1.4.3 as a  
porting reference was: simplicity seems to be nobody's priority these  
days.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> 
> Michael, like everyone else, I am watching this very closely.  So far  
> it sounds great!
> 
> On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote:
> 
> > When I measure "amount of RAM @ flush time", I'm calling
> > MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
> > process memory usage which should be (for my tests) around the sum of
> > the above two types of RAM usage.
> 
> One thing caught my eye, though, MemoryMXBean is JDK 1.5.  :-(
> 
> http://java.sun.com/j2se/1.5.0/docs/api/java/lang/management/ 
> MemoryMXBean.html

Yeah, thanks for pointing this out.  I'm only using that to do my
benchmarking, not to actually measure RAM usage for "when to flush",
so I will definitely remove it before committing (I always go to a
1.4.2 environment and do a "ant clean test" to be certain I didn't do
something like this by accident :).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Grant Ingersoll



Michael, like everyone else, I am watching this very closely.  So far  
it sounds great!


On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote:


When I measure "amount of RAM @ flush time", I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
process memory usage which should be (for my tests) around the sum of
the above two types of RAM usage.


One thing caught my eye, though, MemoryMXBean is JDK 1.5.  :-(

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/management/ 
MemoryMXBean.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:
> 
> >>> (I think for KS you "add" a previous segment not that
> >>> differently from how you "add" a document)?
> >>
> >> Yeah.  KS has to decompress and serialize posting content, which sux.
> >>
> >> The one saving grace is that with the Fibonacci merge schedule and
> >> the seg-at-a-time indexing strategy, segments don't get merged nearly
> >> as often as they do in Lucene.
> >
> > Yeah we need to work on this one.
> 
> What we need to do is cut down on decompression and conflict  
> resolution costs when reading from one segment to another.  KS has  
> solved this problem for stored fields.  Field defs are global and  
> field values are keyed by name rather than field number in the field  
> data file.  Benefits:
> 
>* Whole documents can be read from one segment to
>  another as blobs.
>* No flags byte.
>* No remapping of field numbers.
>* No conflict resolution at all.
>* Compressed, uncompressed... doesn't matter.
>* Less code.
>* The possibility of allowing the user to provide their
>  own subclass for reading and writing fields. (For
>  Lucy, in the language of your choice.)

I hear you, and I really really love those benefits, but, we just
don't have this freedom with Lucene.

I think the ability to suddenly birth a new field, or change a field's
attributes like "has vectors", "stores norms", etc., with a new
document, is something we just can't break at this point with Lucene?

If we could get those benefits without breaking backwards
compatibility then that would be awesome.  I suppose if we had a
single mapping of field names -> numbers in the index, that would gain
us many of the above benefits?  Hmmm.

> What I haven't got yet is a way to move terms and postings  
> economically from one segment to another.  But I'm working on it.  :)

Here's one idea I just had: assuming there are no deletions, you can
almost do a raw bytes copy from input segment to output (merged)
segment of the postings for a given term X.  I think for prox postings
you can.  But for freq postings, you can't, because they are delta
coded.

Except: it's only the first entry of the incoming segments's freq
postings that needs to be re-interpreted?  So you could read that one,
encode the delta based on "last docID" for previous segment (I think
we'd have to store this in index, probably only if termFreq >
threshold), and then copyBytes the rest of the posting?  I will try
this out on the merges I'm doing in LUCENE-843; I think it should
work and make merging faster (assuming no deletes)?

> > One thing that irks me about the
> > current Lucene merge policy (besides that it gets confused when you
> > flush-by-RAM-usage) is that it's a "pay it forward" design so you're
> > alwa>ys over-paying when you build a given index size.  With KS's
> > Fibonacci merge policy, you don't.  LUCENE-854 has some more details.
> 
> However, even under Fibo, when you get socked with a big merge, you  
> really get socked.  It bothers me that the time for adding to your  
> index can vary so unpredictably.

Yeah, I think that's best solved by concurrency (either with threads
or with our own "scheduling" eg on adding a doc you go and merge
another N terms in the running merge)?  There have been several
proposals recently for making Lucene's merging concurrent
(backgrounded), as part of LUCENE-847.

> > Segment merging really is costly.  In building a large (86 GB, 10 MM
> > docs) index, 65.6% of the time was spent merging!  Details are in
> > LUCENE-856...
> 
> > This is a great model.  Are there Python bindings to Lucy yet/coming?
> 
> I'm sure that they will appear once the C core is ready.  The  
> approach I am taking is to make some high-level design decisions  
> collaboratively on lucy-dev, then implement them in KS.  There's a  
> large amount of code that has been written according to our specs  
> that is working in KS and ready to commit to Lucy after trivial  
> changes.  There's more that's ready for review.  However, release of  
> KS 0.20 is taking priority, so code flow into the Lucy repository has  
> slowed.

OK, good to hear.

> I'll also be looking for a job in about a month.  That may slow us  
> down some more, though it won't stop things --  I've basically  
> decided that I'll do what it takes to Lucy off the ground.  I'll go  
> with something stopgap if nothing materializes which is compatible  
> with that commitment.

Whoa, I'm sorry to hear that :(  I hope you land, quickly, somewhere
that takes Lucy/KS seriously.  It's clearly excellent work.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

Hi Otis!

"Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:
> You talk about a RAM buffer from 1MB - 96MB, but then you have the amount
> of RAM @ flush time (e.g. Avg RAM used (MB) @ flush:  old34.5; new   
>  3.4 [   90.1% less]).
> 
> I don't follow 100% of what you are doing in LUCENE-843, so could you
> please explain what these 2 different amounts of RAM are?
> Is the first (1-96) the RAM you use for in-memory merging of segments?
> What is the RAM used @ flush?  More precisely, why is it that that amount
> of RAM exceeds the RAM buffer?

Very good questions!

When I say "the RAM buffer size is set to 96 MB", what I mean is I
flush the writer when the in-memory segments are using 96 MB RAM.  On
trunk, I just call ramSizeInBytes().  I do the analogous thing with my
patch (sum up size of RAM buffers used by segments).  I call this part
of the RAM usage the "indexed documents RAM".  With every added
document, this grows.

But: this does not account for all data structures (Posting instances,
HashMap, FieldsWriter, TermVectorsWriter, int[] arrays, aetc.) used,
but not saved away, during the indexing of a single document.  All the
"things" used temporarily while indexing a document take up RAM too.
I call this part of the RAM usage the "document processing RAM".  This
RAM does not grow with every added document, though its size is in
proportion to the how big each document is.  This memory is always
re-used (does not grow with time).  But with the trunk, this is done
by creating garbage, whereas with my patch, I explicitly reuse it.

When I measure "amount of RAM @ flush time", I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
process memory usage which should be (for my tests) around the sum of
the above two types of RAM usage.

With the trunk, the actual process memory usage tends to be quite a
bit higher than the RAM buffer size and also tends to be very "noisy"
(jumps around with each flush).  I think this is because of
delays/unpredictability on when GC kicks in to reclaim the garbage
created during indexing of the doc.  Whereas with my patch, it's
usually quite a bit closer to the "indexed documents RAM" and does not
jump around nearly as much.

So the "actual process RAM used" will always exceed my "RAM buffer
size".  The amount of excess is a measure of the "overhead" required
to process the document.  The trunk has far worse overhead than with
my patch, which I think means a given application will be able to use
a *larger* RAM buffer size with LUCENE-843.

Does that make sense?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless


"Mike Klaas" <[EMAIL PROTECTED]> wrote:
> On 4/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> > : Thanks!  But remember many Lucene apps won't see these speedups since I've
> > : carefully minimized cost of tokenization and cost of document retrieval.  
> > I
> > : think for many Lucene apps these are a sizable part of time spend 
> > indexing.
> >
> > true, but as long as the changes you are making has no impact on the
> > tokenization/docbuilding times, that shouldn't be a factor -- that should
> > be consiered a "cosntant time" adjunct to the code you are varying ...
> > people with expensive analysis may not see any significant increases, but
> > that's their own problem -- people concerned about performance will
> > already have that as fast as they can get it, and now the internals of
> > document adding will get faster as well.
> 
> Especially since it is relatively easy for users to tweak the analysis
> bits for performance--compared to the messy guts of index creation.
> 
> I am eagerly tracking the progress of your work.

Thanks Mike (and Hoss).

Hoss, what you said is correct: I'm only affecting the actual indexing of
a document, nothing before that.

I just want to make sure I get that disclaimer out, as much as possible, so
nobody tries the patch and says "Hey!  My app only got 10% faster!  This was
false advertising!".

People who indeed have minimized their doc retrieval and tokenization time
should see speedups around what I'm seeing with the benchmarks (I hope!).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Mike Klaas


On 4/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: Thanks!  But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval.  I
: think for many Lucene apps these are a sizable part of time spend indexing.

true, but as long as the changes you are making has no impact on the
tokenization/docbuilding times, that shouldn't be a factor -- that should
be consiered a "cosntant time" adjunct to the code you are varying ...
people with expensive analysis may not see any significant increases, but
that's their own problem -- people concerned about performance will
already have that as fast as they can get it, and now the internals of
document adding will get faster as well.


Especially since it is relatively easy for users to tweak the analysis
bits for performance--compared to the messy guts of index creation.

I am eagerly tracking the progress of your work.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Chris Hostetter


: Thanks!  But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval.  I
: think for many Lucene apps these are a sizable part of time spend indexing.

true, but as long as the changes you are making has no impact on the
tokenization/docbuilding times, that shouldn't be a factor -- that should
be consiered a "cosntant time" adjunct to the code you are varying ...
people with expensive analysis may not see any significant increases, but
that's their own problem -- people concerned about performance will
already have that as fast as they can get it, and now the internals of
document adding will get faster as well.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey



On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:


(I think for KS you "add" a previous segment not that
differently from how you "add" a document)?


Yeah.  KS has to decompress and serialize posting content, which sux.

The one saving grace is that with the Fibonacci merge schedule and
the seg-at-a-time indexing strategy, segments don't get merged nearly
as often as they do in Lucene.


Yeah we need to work on this one.


What we need to do is cut down on decompression and conflict  
resolution costs when reading from one segment to another.  KS has  
solved this problem for stored fields.  Field defs are global and  
field values are keyed by name rather than field number in the field  
data file.  Benefits:


  * Whole documents can be read from one segment to
another as blobs.
  * No flags byte.
  * No remapping of field numbers.
  * No conflict resolution at all.
  * Compressed, uncompressed... doesn't matter.
  * Less code.
  * The possibility of allowing the user to provide their
own subclass for reading and writing fields. (For
Lucy, in the language of your choice.)

What I haven't got yet is a way to move terms and postings  
economically from one segment to another.  But I'm working on it.  :)



One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a "pay it forward" design so you're
alwa>ys over-paying when you build a given index size.  With KS's
Fibonacci merge policy, you don't.  LUCENE-854 has some more details.


However, even under Fibo, when you get socked with a big merge, you  
really get socked.  It bothers me that the time for adding to your  
index can vary so unpredictably.



Segment merging really is costly.  In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging!  Details are in
LUCENE-856...



This is a great model.  Are there Python bindings to Lucy yet/coming?


I'm sure that they will appear once the C core is ready.  The  
approach I am taking is to make some high-level design decisions  
collaboratively on lucy-dev, then implement them in KS.  There's a  
large amount of code that has been written according to our specs  
that is working in KS and ready to commit to Lucy after trivial  
changes.  There's more that's ready for review.  However, release of  
KS 0.20 is taking priority, so code flow into the Lucy repository has  
slowed.


I'll also be looking for a job in about a month.  That may slow us  
down some more, though it won't stop things --  I've basically  
decided that I'll do what it takes to Lucy off the ground.  I'll go  
with something stopgap if nothing materializes which is compatible  
with that commitment.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Otis Gospodnetic

Quick question, Mike:

You talk about a RAM buffer from 1MB - 96MB, but then you have the amount of 
RAM @ flush time (e.g. Avg RAM used (MB) @ flush:  old34.5; new 3.4 [   
90.1% less]).

I don't follow 100% of what you are doing in LUCENE-843, so could you please 
explain what these 2 different amounts of RAM are?
Is the first (1-96) the RAM you use for in-memory merging of segments?
What is the RAM used @ flush?  More precisely, why is it that that amount of 
RAM exceeds the RAM buffer?

Thanks,
Otis



- Original Message 
From: Michael McCandless (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, April 5, 2007 9:22:32 AM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to 
buffer added documents


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942
 ] 

Michael McCandless commented on LUCENE-843:
---


OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.

I used the "normal" sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.

Here're the results:

NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)


1 MB

  old
20 docs in 862.2 secs
index size = 1.7G

  new
20 docs in 297.1 secs
index size = 1.7G

  Total Docs/sec: old   232.0; new   673.2 [  190.2% faster]
  Docs/MB @ flush:old47.2; new   278.4 [  489.6% more]
  Avg RAM used (MB) @ flush:  old34.5; new 3.4 [   90.1% less]



2 MB

  old
20 docs in 828.7 secs
index size = 1.7G

  new
20 docs in 279.0 secs
index size = 1.7G

  Total Docs/sec: old   241.3; new   716.8 [  197.0% faster]
  Docs/MB @ flush:old47.0; new   322.4 [  586.7% more]
  Avg RAM used (MB) @ flush:  old37.9; new 4.5 [   88.0% less]



4 MB

  old
20 docs in 840.5 secs
index size = 1.7G

  new
20 docs in 260.8 secs
index size = 1.7G

  Total Docs/sec: old   237.9; new   767.0 [  222.3% faster]
  Docs/MB @ flush:old46.8; new   363.1 [  675.4% more]
  Avg RAM used (MB) @ flush:  old33.9; new 6.5 [   80.9% less]



8 MB

  old
20 docs in 678.8 secs
index size = 1.7G

  new
20 docs in 248.8 secs
index size = 1.7G

  Total Docs/sec: old   294.6; new   803.7 [  172.8% faster]
  Docs/MB @ flush:old46.8; new   392.4 [  739.1% more]
  Avg RAM used (MB) @ flush:  old60.3; new10.7 [   82.2% less]



16 MB

  old
20 docs in 660.6 secs
index size = 1.7G

  new
20 docs in 247.3 secs
index size = 1.7G

  Total Docs/sec: old   302.8; new   808.7 [  167.1% faster]
  Docs/MB @ flush:old46.7; new   415.4 [  788.8% more]
  Avg RAM used (MB) @ flush:  old47.1; new19.2 [   59.3% less]



24 MB

  old
20 docs in 658.1 secs
index size = 1.7G

  new
20 docs in 243.0 secs
index size = 1.7G

  Total Docs/sec: old   303.9; new   823.0 [  170.8% faster]
  Docs/MB @ flush:old46.7; new   430.9 [  822.2% more]
  Avg RAM used (MB) @ flush:  old70.0; new27.5 [   60.8% less]



32 MB

  old
20 docs in 714.2 secs
index size = 1.7G

  new
20 docs in 239.2 secs
index size = 1.7G

  Total Docs/sec: old   280.0; new   836.0 [  198.5% faster]
  Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
  Avg RAM used (MB) @ flush:  old92.5; new36.7 [   60.3% less]



48 MB

  old
20 docs in 640.3 secs
index size = 1.7G

  new
20 docs in 236.0 secs
index size = 1.7G

  Total Docs/sec: old   312.4; new   847.5 [  171.3% faster]
  Docs/MB @ flush:old46.7; new   438.5 [  838.8% more]
  Avg RAM used (MB) @ flush:  old   138.9; new52.8 [   62.0% less]



64 MB

  old
20 docs in 649.3 secs
index size = 1.7G

  new
20 docs in 238.3 secs
index size = 1.7G

  Total Docs/sec: old   308.0; new   839.3 [  172.5% faster]
  Docs/MB @ flush:old46.7; new   441.3 [  844.7% more]
  Avg RAM used (MB) @ flush:  old   302.6; new72.7 [   76.0% less]



80 MB

  old
20 docs in 670.2 secs
index size = 1.7G

  new
20 docs in 227.2 secs
index size = 1.7G

  Total Docs/sec: old   298.4; new   880.5 [  195.0% faster]
  Docs/MB @ flush:old46.7; new   446.2 [  855.2% more]
  Avg RAM used (MB) @ flush:  old   231.7; new94.3 [   59.3% less]



96 MB

  old
20 docs in 683.4 secs
index size = 1.7G

  new
20 docs in 226.8 secs
index siz

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless


"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:

> > (I think for KS you "add" a previous segment not that
> > differently from how you "add" a document)?
> 
> Yeah.  KS has to decompress and serialize posting content, which sux.
> 
> The one saving grace is that with the Fibonacci merge schedule and  
> the seg-at-a-time indexing strategy, segments don't get merged nearly  
> as often as they do in Lucene.

Yeah we need to work on this one.  One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a "pay it forward" design so you're
alwa>ys over-paying when you build a given index size.  With KS's
Fibonacci merge policy, you don't.  LUCENE-854 has some more details.

Segment merging really is costly.  In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging!  Details are in
LUCENE-856...

> > On C) I think it is important so the many ports of Lucene can "compare
> > notes" and "cross fertilize".
> 
> Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply  
> the patch. ;)

I hear you!

> Cross-fertilization is a powerful tool for stimulating algorithmic  
> innovation.  Exhibit A: our unfolding collaborative successes.

Couldn't agree more.

> That's why it was built into the Lucy proposal:
> 
>  [Lucy's C engine] will provide core, performance-critical
>  functionality, but leave as much up to the higher-level
>  language as possible.
> 
> Users from diverse communities approach problems from different  
> angles and come up with different solutions.  The best ones will  
> propagate across Lucy bindings.
> 
> The only problem is that since Dave Balmain has been much less  
> available than we expected, it's been largely up to me to get Lucy to  
> critical mass where other people can start writing bindings.

This is a great model.  Are there Python bindings to Lucy yet/coming?

> > But does KS give its users a choice in Tokenizer?
> 
> You supply a regular expression which matches one token.
> 
># Presto! A WhiteSpaceTokenizer:
>my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
>token_re => qr/\S+/
>);
> 
> > Or, can users pre-tokenize their fields themselves?
> 
> TokenBatch provides an API for bulk addition of tokens; you can  
> subclass Analyzer to exploit that.

Ahh, I get it.  Nice!

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey



On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:


Marvin do you have any sense of what the equivalent cost is
in KS


It's big.  I don't have any good optimizations to suggest in this area.


(I think for KS you "add" a previous segment not that
differently from how you "add" a document)?


Yeah.  KS has to decompress and serialize posting content, which sux.

The one saving grace is that with the Fibonacci merge schedule and  
the seg-at-a-time indexing strategy, segments don't get merged nearly  
as often as they do in Lucene.



I share large int[] blocks and char[] blocks
across Postings and re-use them.  Etc.


Interesting.  I will have to try something like that!


On C) I think it is important so the many ports of Lucene can "compare
notes" and "cross fertilize".


Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply  
the patch. ;)


Cross-fertilization is a powerful tool for stimulating algorithmic  
innovation.  Exhibit A: our unfolding collaborative successes.


That's why it was built into the Lucy proposal:

[Lucy's C engine] will provide core, performance-critical
functionality, but leave as much up to the higher-level
language as possible.

Users from diverse communities approach problems from different  
angles and come up with different solutions.  The best ones will  
propagate across Lucy bindings.


The only problem is that since Dave Balmain has been much less  
available than we expected, it's been largely up to me to get Lucy to  
critical mass where other people can start writing bindings.



Performance certainly isn't everything.


That's a given in scripting language culture.  Most users are  
concerned with minimizing developer time above all else.  Ergo, my  
emphasis on API design and simplicity.



But does KS give its users a choice in Tokenizer?


You supply a regular expression which matches one token.

  # Presto! A WhiteSpaceTokenizer:
  my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
  token_re => qr/\S+/
  );


Or, can users pre-tokenize their fields themselves?


TokenBatch provides an API for bulk addition of tokens; you can  
subclass Analyzer to exploit that.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless


"eks dev" <[EMAIL PROTECTED]> wrote:
> wow, impressive numbers, congrats !

Thanks!  But remember many Lucene apps won't see these speedups since I've
carefully minimized cost of tokenization and cost of document retrieval.  I
think for many Lucene apps these are a sizable part of time spend indexing.

Next up I'm going to test thread concurrency of new vs old.

And then still a fair number of things to resolve before committing...

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread eks dev

wow, impressive numbers, congrats !

- Original Message 
From: Michael McCandless (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, 5 April, 2007 3:22:32 PM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to 
buffer added documents


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942
 ] 

Michael McCandless commented on LUCENE-843:
---


OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.

I used the "normal" sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.

Here're the results:

NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)


1 MB

  old
20 docs in 862.2 secs
index size = 1.7G

  new
20 docs in 297.1 secs
index size = 1.7G

  Total Docs/sec: old   232.0; new   673.2 [  190.2% faster]
  Docs/MB @ flush:old47.2; new   278.4 [  489.6% more]
  Avg RAM used (MB) @ flush:  old34.5; new 3.4 [   90.1% less]



2 MB

  old
20 docs in 828.7 secs
index size = 1.7G

  new
20 docs in 279.0 secs
index size = 1.7G

  Total Docs/sec: old   241.3; new   716.8 [  197.0% faster]
  Docs/MB @ flush:old47.0; new   322.4 [  586.7% more]
  Avg RAM used (MB) @ flush:  old37.9; new 4.5 [   88.0% less]



4 MB

  old
20 docs in 840.5 secs
index size = 1.7G

  new
20 docs in 260.8 secs
index size = 1.7G

  Total Docs/sec: old   237.9; new   767.0 [  222.3% faster]
  Docs/MB @ flush:old46.8; new   363.1 [  675.4% more]
  Avg RAM used (MB) @ flush:  old33.9; new 6.5 [   80.9% less]



8 MB

  old
20 docs in 678.8 secs
index size = 1.7G

  new
20 docs in 248.8 secs
index size = 1.7G

  Total Docs/sec: old   294.6; new   803.7 [  172.8% faster]
  Docs/MB @ flush:old46.8; new   392.4 [  739.1% more]
  Avg RAM used (MB) @ flush:  old60.3; new10.7 [   82.2% less]



16 MB

  old
20 docs in 660.6 secs
index size = 1.7G

  new
20 docs in 247.3 secs
index size = 1.7G

  Total Docs/sec: old   302.8; new   808.7 [  167.1% faster]
  Docs/MB @ flush:old46.7; new   415.4 [  788.8% more]
  Avg RAM used (MB) @ flush:  old47.1; new19.2 [   59.3% less]



24 MB

  old
20 docs in 658.1 secs
index size = 1.7G

  new
20 docs in 243.0 secs
index size = 1.7G

  Total Docs/sec: old   303.9; new   823.0 [  170.8% faster]
  Docs/MB @ flush:old46.7; new   430.9 [  822.2% more]
  Avg RAM used (MB) @ flush:  old70.0; new27.5 [   60.8% less]



32 MB

  old
20 docs in 714.2 secs
index size = 1.7G

  new
20 docs in 239.2 secs
index size = 1.7G

  Total Docs/sec: old   280.0; new   836.0 [  198.5% faster]
  Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
  Avg RAM used (MB) @ flush:  old92.5; new36.7 [   60.3% less]



48 MB

  old
20 docs in 640.3 secs
index size = 1.7G

  new
20 docs in 236.0 secs
index size = 1.7G

  Total Docs/sec: old   312.4; new   847.5 [  171.3% faster]
  Docs/MB @ flush:old46.7; new   438.5 [  838.8% more]
  Avg RAM used (MB) @ flush:  old   138.9; new52.8 [   62.0% less]



64 MB

  old
20 docs in 649.3 secs
index size = 1.7G

  new
20 docs in 238.3 secs
index size = 1.7G

  Total Docs/sec: old   308.0; new   839.3 [  172.5% faster]
  Docs/MB @ flush:old46.7; new   441.3 [  844.7% more]
  Avg RAM used (MB) @ flush:  old   302.6; new72.7 [   76.0% less]



80 MB

  old
20 docs in 670.2 secs
index size = 1.7G

  new
20 docs in 227.2 secs
index size = 1.7G

  Total Docs/sec: old   298.4; new   880.5 [  195.0% faster]
  Docs/MB @ flush:old46.7; new   446.2 [  855.2% more]
  Avg RAM used (MB) @ flush:  old   231.7; new94.3 [   59.3% less]



96 MB

  old
20 docs in 683.4 secs
index size = 1.7G

  new
20 docs in 226.8 secs
index size = 1.7G

  Total Docs/sec: old   292.7; new   882.0 [  201.4% faster]
  Docs/MB @ flush:old46.7; new   448.0 [  859.1% more]
  Avg RAM used (MB) @ flush:  old   274.5; new   112.7 [   59.0% less]


Some observations:

  * Remember the test is already biased against "new" because with the
patch you get an optimized index in the end but with "old" you
don't.

  * Sweet spot for old (trunk) seems to be 48 MB: that is the peak
docs/sec @

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless (JIRA)

 and better RAM efficiency, the more RAM you give.
This makes sense: it's better able to compress the terms dict, the
more docs are merged in RAM before having to flush to disk.  I
would also expect this curve to be somewhat content dependent.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:
> 
> >> (: Ironically, the numbers for Lucene on that page are a little
> >> better than they should be because of a sneaky bug.  I would have
> >> made updating the results a priority if they'd gone the other  
> >> way.  :)
> >
> > Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
> > Ferret and others?).
> 
> Doing honest, rigorous benchmarking is exacting and labor-intensive.   
> Publishing results tends to ignite flame wars I don't have time for.
> 
> The main point that I wanted to make with that page was that KS was a  
> lot faster than Plucene, and that it was in Lucene's ballpark.   
> Having made that point, I've moved on.  The benchmarking code is  
> still very useful for internal development and I use it frequently.

Agreed.  Though, if the benchmarking is done in a way that anyone
could download & re-run it (eg as part of Lucene's new & developing
benchmark framework), it should help to keep flaming in check.

Accurate & well communicated benchmark results both within each
variant/port of Lucene and across them is crucial for all of us making
iterative progress on performance.

> At some point I would like to port the benchmarking work that has  
> been contributed to Lucene of late, but I'm waiting for that code  
> base to settle down first.  After that happens, I'll probably make a  
> pass and publish some results.  Better to spend the time preparing  
> one definitive presentation than to have to rebut every idiot's  
> latest wildly inaccurate shootout.

Excellent!

> >> ... However, Lucene has been tuned by an army of developers over the
> >> years, while KS is young yet and still had many opportunities for
> >> optimization.  Current svn trunk for KS is about twice as fast for
> >> indexing as when I did those benchmarking tests.
> >
> > Wow, that's an awesome speedup!
> 
> The big bottleneck for KS has been its Tokenizer class.  There's only  
> one such class in KS, and it's regex-based.  A few weeks ago, I  
> finally figured out how to hook it into Perl's regex engine at the C  
> level.  The regex engine is not an official part of Perl's C API, so  
> I wouldn't do this if I didn't have to, but the tokenizing loop is  
> only about 100 lines of code and the speedup is dramatic.

Tokenization is a very big part of Lucene's indexing time as well.

StandardAnalyzer is very time consuming.  When I switched to testing
with WhitespaceAnalyzer, it was quite a bit faster (I don't have exact
numbers).  Then when I created and switched to SimpleSpaceAnalyzer
(just splits on the space character, and, doesn't do new String(...)
for every token, instead makes offset+lenth slices into a char[]
array), it was even faster.

This is why "your mileage will vary" caveat is extremely important.
For most users of Lucene, I'd expect that 1) retrieving the doc from
whatever its source is, and 2) tokenizing, take a substantial amount
of time.  So the gains I'm seeing in my benchmarks won't usually be
seen by normal applications unless these applications have already
optimized their doc retrieval/tokenization.

And now that indexing each document is so fast, segment merging has
become a BIG part (66% in my "large index" test in LUCENE-856) of
indexing.  Marvin do you have any sense of what the equivalent cost is
in KS (I think for KS you "add" a previous segment not that
differently from how you "add" a document)?

> I've also squeezed out another 30-40% by changing the implementation  
> in ways which have gradually winnowed down the number of malloc()  
> calls.  Some of the techniques may be applicable to Lucene; I'll get  
> around to firing up JIRA issues describing them someday.

This generally was my approach in LUCENE-843 (minimize "new
Object()").  I re-use Posting objects, the hash for Posting objects,
byte buffers, etc.  I share large int[] blocks and char[] blocks
across Postings and re-use them.  Etc.

The one thing that still baffles me is: I can't get a persistent
Posting hash to be any faster.  I still reset the Posting hash with
every document, but I had variants in my iterations that kept the
Postings hash between documents (just flushing the int[]'s
periodically).  I had expected that leaving Posting instances in the
hash, esp. for frequent terms, would be a win, but so far I haven't
seen that empirically.

> > So KS is faster than Lucene today?
> 
> I haven't tested recent versions of Lucene.  I believe that the  
> current svn trunk for KS is faster for indexing than Lucene 1.9.1.   
> But... A) I don't have an official release out with the current  
> Tokenizer code, B) I have no immediate plans to prepare further  
> published benchmarks, and C) it's not really important, because so  
> long as the numbers are close you'd be nuts to choose one engine or  
> the other based on that criteria rather than, say, what language your  
> development team speaks.  Kin

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Marvin Humphrey



On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:


(: Ironically, the numbers for Lucene on that page are a little
better than they should be because of a sneaky bug.  I would have
made updating the results a priority if they'd gone the other  
way.  :)


Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
Ferret and others?).


Doing honest, rigorous benchmarking is exacting and labor-intensive.   
Publishing results tends to ignite flame wars I don't have time for.


The main point that I wanted to make with that page was that KS was a  
lot faster than Plucene, and that it was in Lucene's ballpark.   
Having made that point, I've moved on.  The benchmarking code is  
still very useful for internal development and I use it frequently.


At some point I would like to port the benchmarking work that has  
been contributed to Lucene of late, but I'm waiting for that code  
base to settle down first.  After that happens, I'll probably make a  
pass and publish some results.  Better to spend the time preparing  
one definitive presentation than to have to rebut every idiot's  
latest wildly inaccurate shootout.



... However, Lucene has been tuned by an army of developers over the
years, while KS is young yet and still had many opportunities for
optimization.  Current svn trunk for KS is about twice as fast for
indexing as when I did those benchmarking tests.


Wow, that's an awesome speedup!


The big bottleneck for KS has been its Tokenizer class.  There's only  
one such class in KS, and it's regex-based.  A few weeks ago, I  
finally figured out how to hook it into Perl's regex engine at the C  
level.  The regex engine is not an official part of Perl's C API, so  
I wouldn't do this if I didn't have to, but the tokenizing loop is  
only about 100 lines of code and the speedup is dramatic.


I've also squeezed out another 30-40% by changing the implementation  
in ways which have gradually winnowed down the number of malloc()  
calls.  Some of the techniques may be applicable to Lucene; I'll get  
around to firing up JIRA issues describing them someday.



So KS is faster than Lucene today?


I haven't tested recent versions of Lucene.  I believe that the  
current svn trunk for KS is faster for indexing than Lucene 1.9.1.   
But... A) I don't have an official release out with the current  
Tokenizer code, B) I have no immediate plans to prepare further  
published benchmarks, and C) it's not really important, because so  
long as the numbers are close you'd be nuts to choose one engine or  
the other based on that criteria rather than, say, what language your  
development team speaks.  KinoSearch scales to multiple machines, too.


Looking to the future, I wouldn't be surprised if Lucene edged ahead  
and stayed slightly ahead speed-wise, because I'm prepared to make  
some sacrifices for the sake of keeping KinoSearch's core API simple  
and the code base as small as possible.  I'd rather maintain a  
single, elegant, useful, flexible, plenty fast regex-based Tokenizer  
than the slew of Tokenizers Lucene offers, for instance.  It might be  
at a slight disadvantage going mano a mano against Lucene's  
WhiteSpaceTokenizer, but that's fine.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> 
> On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote:
> 
> > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> >> Wow, very nice results Mike!
> >
> > Thanks :)  I'm just praying I don't have some sneaky bug making
> > the results far better than they really are!!
> 
> That's possible, but I'm confident that the model you're using is  
> capable of the gains you're seeing.  When I benched KinoSearch a year  
> ago against Lucene, KS was getting close, but was still a little  
> behind... 

OK glad to hear that :)  I *think* I don't have such bugs.

> (: Ironically, the numbers for Lucene on that page are a little  
> better than they should be because of a sneaky bug.  I would have  
> made updating the results a priority if they'd gone the other way.  :)

Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
Ferret and others?).
 
> ... However, Lucene has been tuned by an army of developers over the  
> years, while KS is young yet and still had many opportunities for  
> optimization.  Current svn trunk for KS is about twice as fast for  
> indexing as when I did those benchmarking tests.

Wow, that's an awesome speedup!  So KS is faster than Lucene today?

> I look forward to studying your patch in detail at some point to see  
> what you've done differently.  It sounds like you only familiarized  
> yourself with the high-level details of how KS has been working,  
> yes?  Hopefully, you misunderstood and came up with something better. ;)

Exactly!  I very carefully didn't look closely at how KS does
indexing.  I did read your posts on this list and did read the Wiki
page and I think a few other pages describing KS's merge model but
stopped there.  We can compare our approaches in detail at some point
and then cross-fertilize :)

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Marvin Humphrey



On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote:


"Yonik Seeley" <[EMAIL PROTECTED]> wrote:

Wow, very nice results Mike!


Thanks :)  I'm just praying I don't have some sneaky bug making
the results far better than they really are!!


That's possible, but I'm confident that the model you're using is  
capable of the gains you're seeing.  When I benched KinoSearch a year  
ago against Lucene, KS was getting close, but was still a little  
behind... 


(: Ironically, the numbers for Lucene on that page are a little  
better than they should be because of a sneaky bug.  I would have  
made updating the results a priority if they'd gone the other way.  :)


... However, Lucene has been tuned by an army of developers over the  
years, while KS is young yet and still had many opportunities for  
optimization.  Current svn trunk for KS is about twice as fast for  
indexing as when I did those benchmarking tests.


I look forward to studying your patch in detail at some point to see  
what you've done differently.  It sounds like you only familiarized  
yourself with the high-level details of how KS has been working,  
yes?  Hopefully, you misunderstood and came up with something better. ;)


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless


"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> Wow, very nice results Mike!

Thanks :)  I'm just praying I don't have some sneaky bug making
the results far better than they really are!!  And still plenty
to do...

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless

"Ning Li" <[EMAIL PROTECTED]> wrote:
> On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
> 
> >  * With term vectors and/or stored fields, the new patch has
> >substantially better RAM efficiency.
> 
> Impressive numbers! The new patch improves RAM efficiency quite a bit
> even with no term vectors nor stored fields, because of the periodic
> in-RAM merges of posting lists & term infos etc. The frequency of the
> in-RAM merges is controlled by flushedMergeFactor, which measures in
> doc count, right? How sensitive is performance to the value of
> flushedMergeFactor?

Right, the in-RAM merges seem to help *alot* because you get great
compression of the terms dictionary, and also some compression of the
freq postings since the docIDs are delta encoded.  Also, you waste
less end buffer space (buffers are fixed sizes) when you merge together
into a large segment.

The in-RAM merges are triggered by number of bytes used vs RAM buffer
size.  Each doc is indexed to its own RAM segment, then once these
level 0 segments use > 1/Nth of the RAM buffer size, I merge into
level 1.  Then once level 1 segments are using > 1/Mth of the RAM
buffer size, I merge into level 2.  I don't do any merges beyond that.
Right now N = 14 and M = 7 but I haven't really tuned them yet ...

Once RAM is full, all of those segments are merged into a single
on-disk segment.  Once enough on-disk segments accumulate they are
periodically merged (based on flushedMergeFactor) as well.  Finally
when it's time to commit a real segment I merge all RAM segments and
flushed segments into a real Lucene segment.

I haven't done much testing to find sweet spot for these merge
settings just yet.  Still plenty to do!

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486385
 ] 

Michael McCandless commented on LUCENE-843:
---


>> The actual HEAP RAM usage is quite a bit more
>> stable with the patch, especially with term vectors
>> & stored fields enabled. I think this is because the
>> patch creates far less garbage for GC to periodically
>> reclaim. I think this also means you could push your
>> RAM buffer size even higher to get better performance.
>
> For KinoSearch, the sweet spot seems to be a buffer of around 16 MB
> when benchmarking with the Reuters corpus on my G4 laptop. Larger
> than that and things actually slow down, unless the buffer is large
> enough that it never needs flushing. My hypothesis is that RAM
> fragmentation is slowing down malloc/free. I'll be interested as to
> whether you see the same effect.

Interesting.  OK I will run the benchmark across increasing RAM sizes
to see where the sweet spot seems to be!


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Yonik Seeley


Wow, very nice results Mike!

-Yonik

On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335
 ]

Michael McCandless commented on LUCENE-843:
---


Last is the results for small docs (100 tokens = ~550 bytes plain text each):

  200 DOCS @ ~550 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  200 docs in 886.7 secs
  index size = 438M

new
  200 docs in 230.5 secs
  index size = 435M

Total Docs/sec: old  2255.6; new  8676.4 [  284.7% faster]
Docs/MB @ flush:old   128.0; new  4194.6 [ 3176.2% more]
Avg RAM used (MB) @ flush:  old   107.3; new37.7 [   64.9% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  200 docs in 888.7 secs
  index size = 438M

new
  200 docs in 239.6 secs
  index size = 432M

Total Docs/sec: old  2250.5; new  8348.7 [  271.0% faster]
Docs/MB @ flush:old   128.0; new  4146.8 [ 3138.9% more]
Avg RAM used (MB) @ flush:  old   108.1; new38.9 [   64.0% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  200 docs in 1480.1 secs
  index size = 2.1G

new
  200 docs in 462.0 secs
  index size = 2.1G

Total Docs/sec: old  1351.2; new  4329.3 [  220.4% faster]
Docs/MB @ flush:old93.1; new  4194.6 [ 4405.7% more]
Avg RAM used (MB) @ flush:  old   296.4; new38.3 [   87.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  200 docs in 1489.4 secs
  index size = 2.1G

new
  200 docs in 347.9 secs
  index size = 2.1G

Total Docs/sec: old  1342.8; new  5749.4 [  328.2% faster]
Docs/MB @ flush:old93.1; new  4146.8 [ 4354.5% more]
Avg RAM used (MB) @ flush:  old   297.1; new38.6 [   87.0% less]



  20 DOCS @ ~5,500 bytes plain text


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 397.6 secs
  index size = 415M

new
  20 docs in 167.5 secs
  index size = 411M

Total Docs/sec: old   503.1; new  1194.1 [  137.3% faster]
Docs/MB @ flush:old81.6; new   406.2 [  397.6% more]
Avg RAM used (MB) @ flush:  old87.3; new35.2 [   59.7% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 394.6 secs
  index size = 415M

new
  20 docs in 168.4 secs
  index size = 408M

Total Docs/sec: old   506.9; new  1187.7 [  134.3% faster]
Docs/MB @ flush:old81.6; new   432.2 [  429.4% more]
Avg RAM used (MB) @ flush:  old   126.6; new36.9 [   70.8% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 754.2 secs
  index size = 1.7G

new
  20 docs in 304.9 secs
  index size = 1.7G

Total Docs/sec: old   265.2; new   656.0 [  147.4% faster]
Docs/MB @ flush:old46.7; new   406.2 [  769.6% more]
Avg RAM used (MB) @ flush:  old92.9; new35.2 [   62.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 743.9 secs
  index size = 1.7G

new
  20 docs in 244.3 secs
  index size = 1.7G

Total Docs/sec: old   268.9; new   818.7 [  204.5% faster]
Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
Avg RAM used (MB) @ flush:  old93.0; new36.6 [   60.6% less]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Ning Li


On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:


 * With term vectors and/or stored fields, the new patch has
   substantially better RAM efficiency.


Impressive numbers! The new patch improves RAM efficiency quite a bit
even with no term vectors nor stored fields, because of the periodic
in-RAM merges of posting lists & term infos etc. The frequency of the
in-RAM merges is controlled by flushedMergeFactor, which measures in
doc count, right? How sensitive is performance to the value of
flushedMergeFactor?

Cheers,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486373
 ] 

Marvin Humphrey commented on LUCENE-843:


> The actual HEAP RAM usage is quite a bit more 
> stable with the  patch, especially with term vectors 
> & stored fields enabled. I think this is because the 
> patch creates far less garbage for GC to periodically 
> reclaim. I think this also means you could push your 
> RAM buffer size even higher to get better performance. 

For KinoSearch, the sweet spot seems to be a buffer of around 16 MB when 
benchmarking with the Reuters corpus on my G4 laptop. Larger than that and 
things actually slow down, unless the buffer is large enough that it never 
needs flushing. My hypothesis is that RAM fragmentation is slowing down 
malloc/free.  I'll be interested as to whether you see the same effect.

> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486339
 ] 

Michael McCandless commented on LUCENE-843:
---

A few notes from these results:

  * A real Lucene app won't see these gains because frequently the
retrieval of docs from the content source, and the tokenization,
take substantial amounts of time whereas for this test I've
intentionally minimized the cost of those steps but they are very
low for this test because I'm 1) pulling one line at a time from a
big text file, and 2) using my simplistic SimpleSpaceAnalyzer
which just breaks tokens at the space character.

  * Best speedup is ~4.3X faster, for tiny docs (~550 bytes) with term
vectors and stored fields enabled and using autoCommit=false.

  * Least speedup is still ~1.6X faster, for large docs (~55,000
bytes) with autoCommit=true.

  * The autoCommit=false cases are a little unfair to the new patch
because with the new patch, you get a single-segment (optimized)
index in the end, but with existing Lucene trunk, you don't.

  * With term vectors and/or stored fields, autoCommit=false is quite
a bit faster with the patch, because we never pay the price to
merge them since they are written once.

  * With term vectors and/or stored fields, the new patch has
substantially better RAM efficiency.

  * The patch is especially faster and has better RAM efficiency with
smaller documents.

  * The actual HEAP RAM usage is quite a bit more stable with the
patch, especially with term vectors & stored fields enabled.  I
think this is because the patch creates far less garbage for GC to
periodically reclaim.  I think this also means you could push your
RAM buffer size even higher to get better performance.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335
 ] 

Michael McCandless commented on LUCENE-843:
---


Last is the results for small docs (100 tokens = ~550 bytes plain text each):

  200 DOCS @ ~550 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  200 docs in 886.7 secs
  index size = 438M

new
  200 docs in 230.5 secs
  index size = 435M

Total Docs/sec: old  2255.6; new  8676.4 [  284.7% faster]
Docs/MB @ flush:old   128.0; new  4194.6 [ 3176.2% more]
Avg RAM used (MB) @ flush:  old   107.3; new37.7 [   64.9% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  200 docs in 888.7 secs
  index size = 438M

new
  200 docs in 239.6 secs
  index size = 432M

Total Docs/sec: old  2250.5; new  8348.7 [  271.0% faster]
Docs/MB @ flush:old   128.0; new  4146.8 [ 3138.9% more]
Avg RAM used (MB) @ flush:  old   108.1; new38.9 [   64.0% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  200 docs in 1480.1 secs
  index size = 2.1G

new
  200 docs in 462.0 secs
  index size = 2.1G

Total Docs/sec: old  1351.2; new  4329.3 [  220.4% faster]
Docs/MB @ flush:old93.1; new  4194.6 [ 4405.7% more]
Avg RAM used (MB) @ flush:  old   296.4; new38.3 [   87.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  200 docs in 1489.4 secs
  index size = 2.1G

new
  200 docs in 347.9 secs
  index size = 2.1G

Total Docs/sec: old  1342.8; new  5749.4 [  328.2% faster]
Docs/MB @ flush:old93.1; new  4146.8 [ 4354.5% more]
Avg RAM used (MB) @ flush:  old   297.1; new38.6 [   87.0% less]



  20 DOCS @ ~5,500 bytes plain text


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 397.6 secs
  index size = 415M

new
  20 docs in 167.5 secs
  index size = 411M

Total Docs/sec: old   503.1; new  1194.1 [  137.3% faster]
Docs/MB @ flush:old81.6; new   406.2 [  397.6% more]
Avg RAM used (MB) @ flush:  old87.3; new35.2 [   59.7% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 394.6 secs
  index size = 415M

new
  20 docs in 168.4 secs
  index size = 408M

Total Docs/sec: old   506.9; new  1187.7 [  134.3% faster]
Docs/MB @ flush:old81.6; new   432.2 [  429.4% more]
Avg RAM used (MB) @ flush:  old   126.6; new36.9 [   70.8% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 754.2 secs
  index size = 1.7G

new
  20 docs in 304.9 secs
  index size = 1.7G

Total Docs/sec: old   265.2; new   656.0 [  147.4% faster]
Docs/MB @ flush:old46.7; new   406.2 [  769.6% more]
Avg RAM used (MB) @ flush:  old92.9; new35.2 [   62.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 743.9 secs
  index size = 1.7G

new
  20 docs in 244.3 secs
  index size = 1.7G

Total Docs/sec: old   268.9; new   818.7 [  204.5% faster]
Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
Avg RAM used (MB) @ flush:  old93.0; new36.6 [   60.6% less]





> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486334
 ] 

Michael McCandless commented on LUCENE-843:
---

Here are the results for "normal" sized docs (1K tokens = ~5,500 bytes plain 
text each):

  20 DOCS @ ~5,500 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 397.6 secs
  index size = 415M

new
  20 docs in 167.5 secs
  index size = 411M

Total Docs/sec: old   503.1; new  1194.1 [  137.3% faster]
Docs/MB @ flush:old81.6; new   406.2 [  397.6% more]
Avg RAM used (MB) @ flush:  old87.3; new35.2 [   59.7% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 394.6 secs
  index size = 415M

new
  20 docs in 168.4 secs
  index size = 408M

Total Docs/sec: old   506.9; new  1187.7 [  134.3% faster]
Docs/MB @ flush:old81.6; new   432.2 [  429.4% more]
Avg RAM used (MB) @ flush:  old   126.6; new36.9 [   70.8% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 754.2 secs
  index size = 1.7G

new
  20 docs in 304.9 secs
  index size = 1.7G

Total Docs/sec: old   265.2; new   656.0 [  147.4% faster]
Docs/MB @ flush:old46.7; new   406.2 [  769.6% more]
Avg RAM used (MB) @ flush:  old92.9; new35.2 [   62.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 743.9 secs
  index size = 1.7G

new
  20 docs in 244.3 secs
  index size = 1.7G

Total Docs/sec: old   268.9; new   818.7 [  204.5% faster]
Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
Avg RAM used (MB) @ flush:  old93.0; new36.6 [   60.6% less]





> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486332
 ] 

Michael McCandless commented on LUCENE-843:
---

A couple more details on the testing: I run java -server to get all
optimizations in the JVM, and the IO system is a local OS X RAID 0 of
4 SATA drives.

Using the above tool I ran an initial set of benchmarks comparing old
(= Lucene trunk) vs new (= this patch), varying document size (~550
bytes to ~5,500 bytes to ~55,000 bytes of plain text from Europarl
"en").

For each document size I run 4 combinations of whether term vectors
and stored fields are on or off and whether autoCommit is true or
false.  I measure net docs/sec (= total # docs indexed divided by
total time taken), RAM efficiency (= avg # docs flushed with each
flush divided by RAM buffer size), and avg HEAP RAM usage before each
flush.

Here are the results for the 10K tokens (= ~55,000 bytes plain text)
per document:

  2 DOCS @ ~55,000 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  2 docs in 200.3 secs
  index size = 358M

new
  2 docs in 126.0 secs
  index size = 356M

Total Docs/sec: old99.8; new   158.7 [   59.0% faster]
Docs/MB @ flush:old24.2; new49.1 [  102.5% more]
Avg RAM used (MB) @ flush:  old74.5; new36.2 [   51.4% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  2 docs in 202.7 secs
  index size = 358M

new
  2 docs in 120.0 secs
  index size = 354M

Total Docs/sec: old98.7; new   166.7 [   69.0% faster]
Docs/MB @ flush:old24.2; new48.9 [  101.7% more]
Avg RAM used (MB) @ flush:  old74.3; new37.0 [   50.2% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  2 docs in 374.7 secs
  index size = 1.4G

new
  2 docs in 236.1 secs
  index size = 1.4G

Total Docs/sec: old53.4; new84.7 [   58.7% faster]
Docs/MB @ flush:old10.2; new49.1 [  382.8% more]
Avg RAM used (MB) @ flush:  old   129.3; new36.6 [   71.7% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  2 docs in 385.7 secs
  index size = 1.4G

new
  2 docs in 182.8 secs
  index size = 1.4G

Total Docs/sec: old51.9; new   109.4 [  111.0% faster]
Docs/MB @ flush:old10.2; new48.9 [  380.9% more]
Avg RAM used (MB) @ flush:  old76.0; new37.3 [   50.9% less]



> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comm

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486292
 ] 

Michael McCandless commented on LUCENE-843:
---

Some details on how I measure RAM usage: both the baseline (current
lucene trunk) and my patch have two general classes of RAM usage.

The first class, "document processing RAM", is RAM used while
processing a single doc. This RAM is re-used for each document (in the
trunk, it's GC'd and new RAM is allocated; in my patch, I explicitly
re-use these objects) and how large it gets is driven by how big each
document is.

The second class, "indexed documents RAM", is the RAM used up by
previously indexed documents.  This RAM grows with each added
document and how large it gets is driven by the number and size of
docs indexed since the last flush.

So when I say the writer is allowed to use 32 MB of RAM, I'm only
measuring the "indexed documents RAM".  With trunk I do this by
calling ramSizeInBytes(), and with my patch I do the analagous thing
by measuring how many RAM buffers are held up storing previously
indexed documents.

I then define "RAM efficiency" (docs/MB) as how many docs we can hold
in "indexed documents RAM" per MB RAM, at the point that we flush to
disk.  I think this is an important metric because it drives how large
your initial (level 0) segments are.  The larger these segments are
then generally the less merging you need to do, for a given # docs in
the index.

I also measure overall RAM used in the JVM (using
MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush
except the last, to also capture the "document processing RAM", object
overhead, etc.


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486293
 ] 

Michael McCandless commented on LUCENE-843:
---

To do the benchmarking I created a simple standalone tool
(demo/IndexLineFiles, in the last patch) that indexes one line at a
time from a large previously created file, optionally using multiple
threads.  I do it this way to minimize IO cost of pulling the document
source because I want to measure just indexing time as much as possible.

Each line is read and a doc is created with field "contents" that is
not stored, is tokenized, and optionally has term vectors with
position+offsets.  I also optionally add two small only-stored fields
("path" and "modified").  I think these are fairly trivial documents
compared to typical usage of Lucene.

For the corpus, I took Europarl's "en" content, stripped tags, and
processed into 3 files: one with 100 tokens per line (= ~550 bytes),
one with 1000 tokens per line (= ~5,500 bytes) and with 1 tokens
per line (= ~55,000 bytes) plain text per line.

All settings (mergeFactor, compound file, etc.) are left at defaults.
I don't optimize the index in the end.  I'm using my new
SimpleSpaceAnalyzer (just splits token on the space character and
creates token text as slice into a char[] array instead of new
String(...)) to minimize the cost of tokenization.

I ran the tests with Java 1.5 on a Mac Pro quad (2 Intel CPUs, each
dual core) OS X box with 2 GB RAM.  I give java 1 GB heap (-Xmx1024m).


> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-04-02 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-843:
--

Attachment: LUCENE-843.take4.patch

Another rev of the patch.  All tests pass except disk full tests.  The
code is still rather "dirty" and not well commented.

I think I'm close to finishing optimizing and now I will focus on
error handling (eg disk full), adding some deeper unit tests, more
testing on corner cases like massive docs or docs with massive terms,
etc., flushing pending norms to disk, cleaning up / commenting the
code and various other smaller items.

Here are the changes in this rev:

  * A proposed backwards compatible change to the Token API to also
allow the term text to be delivered as a slice (offset & length)
into a char[] array instead of String.  With an analyzer/tokenizer
that takes advantage of this, this was a decent performance gain
in my local testing.  I've created a SimpleSpaceAnalyzer that only
splits words at the space character to test this.

  * Added more asserts (run java -ea to enable asserts).  The asserts
are quite useful and now often catch a bug I've introduced before
the unit tests do.

  * Changed to custom int[] block buffering for postings to store
freq, prox's and offsets.  With this buffering we no longer have
to double the size of int[] arrays while adding positions, nor do
we have to copy ints whenever we needs more space for these
arrays.  Instead I allocate larger slices out of the shared int[]
arrays.  This reduces memory and improves performance.

  * Changed to custom char[] block buffering for postings to store
term text.  This also reduces memory and improves performance.

  * Changed to single file for RAM & flushed partial segments (was 3
separate files before)

  * Changed how I merge flushed partial segments to match what's
described in LUCENE-854

  * Reduced memory usage when indexing large docs (25 MB plain text
each).  I'm still consuming more RAM in this case than the
baseline (trunk) so I'm still working on this one ...

  * Fixed a slow memory leak when building large (20+ GB) indices



> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-843:
--

Attachment: LUCENE-843.take3.patch


Another rev of the patch:

  * Got thread concurrency working: removed "synchronized" from entire
call to MultiDocWriter.addDocument and instead synchronize two
quick steps (init/finish) addDocument leaving the real work
(processDocument) unsynchronized.

  * Fixed bug that was failing to delete temp files from index

  * Reduced memory usage of Posting by inlining positions, start
offset, end offset into a single int array.

  * Enabled IndexLineFiles.java (tool I use for local benchmarking) to
run multiple threads

  * Other small optimizations

BTW, one of the nice side effects of this patch is it cleans up the
mergeSegments method of IndexWriter by separating out "flush" of added
docs & deletions because it's no longer a merge, from the "true"
mergeSegments whose purpose is then to merge disk segments.
Previously mergeSegments was getting rather confusing with the
different cases/combinations of added docs or not, deleted docs or
not, any merges or not.



> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-25 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-843:
--

Attachment: LUCENE-843.take2.patch

New rev of the patch:

  * Fixed at least one data corruption case

  * Added more asserts (run with "java -ea" so asserts run)

  * Some more small optimizations

  * Updated to current trunk so patch applies cleanly



> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless


"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> I've only been loosely following this...
> 
> Do you think it is possible to separate the stored/term vector  
> handling into a separate patch against the current trunk?  This seems  
> like a quick win and I know it has been speculated about before.

This is definitely possible, but, I'd rather just do this as part of
LUCENE-843 (I don't think I'm too too far from iterating it to a good
point).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Grant Ingersoll


I've only been loosely following this...

Do you think it is possible to separate the stored/term vector  
handling into a separate patch against the current trunk?  This seems  
like a quick win and I know it has been speculated about before.


On Mar 23, 2007, at 12:00 PM, Michael McCandless wrote:



"Yonik Seeley" <[EMAIL PROTECTED]> wrote:

On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:

Merging is costly because you read all data in then write all data
out, so, you want to minimize for byte of data in the index in the
index how many times it will be "serviced" (read in, written out) as
part of a merge.


Avoiding the re-writing of stored fields might be nice:
http://www.nabble.com/Re%3A--jira--Commented%3A-%28LUCENE-565%29- 
Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance- 
Results-Provided%29-p6177280.html


That's exactly the approach I'm taking in LUCENE-843: stored fields  
and term

vectors are immediately written to disk.  Only frq, prx and tis use up
memory.  This greatly extends how many docs you can buffer before
having to flush (assuming your docs have stored fields and term
vectors).

When memory is full, I either flush a segment to disk (when writer is
in autoCommit=true mode), else I flush the data to tmp files which are
finally merged into a segment when the writer is closed.  This merging
is less costly because the bytes in/out are just frq, prx and tis, so
this improves performance of autoCommit=false mode vs autoCommit=true
mode.

But, this is only for the segment created from buffered docs (ie the
segment created by a "flush").  Subsequent merges still must copy
bytes in/out and in LUCENE-843 I haven't changed anything about how
segments are merged.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Ning Li


On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:

Yes the code re-computes the level of a given segment from the current
values of maxBufferedDocs & mergeFactor.  But when these values have
changed (or, segments were flushed by RAM not by maxBufferedDocs) then
the way it computes level no longer results in the logarithmic policy
that it's trying to implement, I think.


The algorithm gradually re-adjusts toward the latest maxBufferedDocs &
mergeFactor - see case 3 of the "Overview of merge policy" comment in
the code.

With the modification that RAM or file size as segment size, the
algorithm would work by maxBufferedSize & mergeFactor. Let's say
maxBufferedDocs or maxBufferedSize is the base size. Lucene-845
complains that the merge behaviour for segments <= base size in some
cases is not logrithmic. It's a tradeoff. We always keep small
segments in check. The algorithm reflects the tradeoff made when
segments <= base size.



Exactly, when logarithmic works "correctly" (you don't change
mergeFactor/maxBufferedDocs and your docs are all uniform in size), it
does achieve this "merge roughly equal size in byte" segments (yes
those two numbers are roughly equal).  Though now I have to go ponder
KS's Fibonacci series approach!.


It doesn't have to be Fibonacci series. Logrithmic would work well
too. The main difference is KS can choose any segments to merge, not
just adjacent segments. Thus it may find better candidates for merge.



Basically, this would keep the same logarithmic approach now, but
derive levels somehow from the net size in bytes.


Exactly! Levels defined in size in bytes.

Cheers,
Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Ning Li


On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:

Right I'm calling a newly created segment (ie flushed from RAM) level
0 and then a level 1 segment is created when you merge 10 level 0
segments, level 2 is created when merge 10 level 1 segments, etc.


That is not how the current merge policy works. There are two
orthogonal aspects to this problem:
 1 the measurement of a segment size
 2 the merge behaviour given a measurement

In the current code:
 1 The measurement of a segment size is the document count in the
segment, not the actual RAM or file size. Levels are defined according
to this measurement.
 2 The behaviour is the two invariants when mergeFactor (M) does not
change and segment doc count is not reaching maxMergeDocs: B for
maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B)))
1) If i (left*) and i+1 (right*) are two consecutive segments of
doc counts x and y, then f(x) >= f(y).
2) The number of committed segments on the same level (f(n)) <= M.

The document counts are approximation of segment sizes thus
approximation of merge cost. Sometimes, however, they do not correctly
reflect segment sizes. So it is probably a good idea to use RAM or
file size as measurement of a segment size as Mike suggested. But the
behaviour does not have to change: the two invariants can still be
guaranteed, with the definition of sizes and levels modified according
to the new measurement.

Ning

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > Merging is costly because you read all data in then write all data
> > out, so, you want to minimize for byte of data in the index in the
> > index how many times it will be "serviced" (read in, written out) as
> > part of a merge.
> 
> Avoiding the re-writing of stored fields might be nice:
> http://www.nabble.com/Re%3A--jira--Commented%3A-%28LUCENE-565%29-Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance-Results-Provided%29-p6177280.html

That's exactly the approach I'm taking in LUCENE-843: stored fields and term
vectors are immediately written to disk.  Only frq, prx and tis use up
memory.  This greatly extends how many docs you can buffer before
having to flush (assuming your docs have stored fields and term
vectors).

When memory is full, I either flush a segment to disk (when writer is
in autoCommit=true mode), else I flush the data to tmp files which are
finally merged into a segment when the writer is closed.  This merging
is less costly because the bytes in/out are just frq, prx and tis, so
this improves performance of autoCommit=false mode vs autoCommit=true
mode.

But, this is only for the segment created from buffered docs (ie the
segment created by a "flush").  Subsequent merges still must copy
bytes in/out and in LUCENE-843 I haven't changed anything about how
segments are merged.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Yonik Seeley


On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:

Merging is costly because you read all data in then write all data
out, so, you want to minimize for byte of data in the index in the
index how many times it will be "serviced" (read in, written out) as
part of a merge.


Avoiding the re-writing of stored fields might be nice:
http://www.nabble.com/Re%3A--jira--Commented%3A-%28LUCENE-565%29-Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance-Results-Provided%29-p6177280.html

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > We say that
> > developers should not rely on docIDs but people still seem to rely on
> > their monotonic ordering (even though they change).
> 
> Yes.  If the benefits of removing that guarantee are large enough, we
> could consider dumping it... but not in Lucene 2.x IMO.  We should see
> how people are using this, and if there are acceptable workarounds.
> Solr's current dependence on document ordering could me removed.

I don't think we should change this "feature" of the current merge
policy.  I think it may break too many people in hard-to-figure-out
ways.  It's also not clear that we have much to gain if we were
allowed to break this?  (The current "logarithmic" merge policy I
think works quite well.)

And as Steve said, once merge policy is decoupled from the writer then
apps have the freedom to pick or build a different merge policy
(leaving current one as the default).

However, I do think we should fix the bug with the current merge
policy when you "flush by RAM" (LUCENE-845).  Since the recommended
way (I think?) to maximize indexing performance is to "flush by RAM
usage" I expect people will start hitting this bug fairly often.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Yonik Seeley


On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:

We say that
developers should not rely on docIDs but people still seem to rely on
their monotonic ordering (even though they change).


Yes.  If the benefits of removing that guarantee are large enough, we
could consider dumping it... but not in Lucene 2.x IMO.  We should see
how people are using this, and if there are acceptable workarounds.
Solr's current dependence on document ordering could me removed.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Steven Parkes

Given history, perhaps the default merge policy should conserve this,
but with pluggable merge policies, I don't see that all merge policies
need to.

-Original Message-
From: Michael McCandless [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 23, 2007 1:53 AM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses
RAM to buffer added documents


"Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> : > Actually is #2 a hard requirement?
> :
> : A lot of Lucene users depend on having document number correspond to
> : age, I think.  ISTR Hatcher at least recommending techniques that
> : require it.
> 
> "Corrispond to age" may be missleading as it implies that the actual
> docid has meaning ... it's more that the relative order of addition is
> preserved regardless of deletions/merging
> 
> A trivial example of using this is getting the N newest documents
> matching
> a search using a HitCollector, it's just a bounded queue that only
> remembers the last N things you put in it.
> 
> An more complicated example is duplicate unique field detection:
> iterating
> over a TermDoc and knowing that the doc with the higheest docId is the
> last one added, so the earlier ones can be ignored/deleted.  (as i
> recall,
> Solr takes advantage of this.)

Got it, so we need to preserve this invariant.  So this puts the
general restriction on the Lucene merge policy that only adjacent
segments (ie, when ordered by segment number) can be merged.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Erik Hatcher



On Mar 22, 2007, at 8:13 PM, Marvin Humphrey wrote:

On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:


Actually is #2 a hard requirement?


A lot of Lucene users depend on having document number correspond  
to age, I think.  ISTR Hatcher at least recommending techniques  
that require it.


I may have recommended it only as "if you can guarantee you index in  
age order then you can ... ", but given FunctionQuery's ability to  
rank based on age given a date field per document its not needed.   
Guaranteeing any kind of order of document insertion (given updates  
that delete and re-add) is not really possible in most cases anyway.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-23 Thread Michael McCandless


"Chris Hostetter" <[EMAIL PROTECTED]> wrote:
> : > Actually is #2 a hard requirement?
> :
> : A lot of Lucene users depend on having document number correspond to
> : age, I think.  ISTR Hatcher at least recommending techniques that
> : require it.
> 
> "Corrispond to age" may be missleading as it implies that the actual
> docid has meaning ... it's more that the relative order of addition is
> preserved regardless of deletions/merging
> 
> A trivial example of using this is getting the N newest documents
> matching
> a search using a HitCollector, it's just a bounded queue that only
> remembers the last N things you put in it.
> 
> An more complicated example is duplicate unique field detection:
> iterating
> over a TermDoc and knowing that the doc with the higheest docId is the
> last one added, so the earlier ones can be ignored/deleted.  (as i
> recall,
> Solr takes advantage of this.)

Got it, so we need to preserve this invariant.  So this puts the
general restriction on the Lucene merge policy that only adjacent
segments (ie, when ordered by segment number) can be merged.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Chris Hostetter

: > Actually is #2 a hard requirement?
:
: A lot of Lucene users depend on having document number correspond to
: age, I think.  ISTR Hatcher at least recommending techniques that
: require it.

"Corrispond to age" may be missleading as it implies that the actual
docid has meaning ... it's more that the relative order of addition is
preserved regardless of deletions/merging

A trivial example of using this is getting the N newest documents matching
a search using a HitCollector, it's just a bounded queue that only
remembers the last N things you put in it.

An more complicated example is duplicate unique field detection: iterating
over a TermDoc and knowing that the doc with the higheest docId is the
last one added, so the earlier ones can be ignored/deleted.  (as i recall,
Solr takes advantage of this.)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes

> But when these values have
> changed (or, segments were flushed by RAM not by maxBufferedDocs) then
> the way it computes level no longer results in the logarithmic policy
> that it's trying to implement, I think.

That's right. Parts of the implementation assume that the segments are
logarithmically related and if they aren't, the result is kind of
undefined, in the sense that it does whatever it does. It doesn't try to
restore the logarithmic invariant.

In some cases, where we know we're combining multiple indexes, we try to
restore the invariant, but it's a little weakly defined what that
invariant is, so the results aren't canonically defined.

> Though now I have to go ponder
> KS's Fibonacci series approach!.

Yeah, me too. I roughed out a version using the factored merge policy
... not that that means I completely understand the implications. I
still have to teach IndexWriter how to deal with non-contiguous merge
specs, which I didn't originally anticipate.

But a very interesting exercise.

> I think for high turnover content the way old segments will be mostly
> deleted but at some point the merge will fully cascade (basically
> an optimize) and they will be cleaned up?

Well, as it stands, the only way to get rid of deleted docs at level n
is either to merge them into a level n+1 segment or optimize. As n gets
big, the chances of getting to a level n+1 segment go down. So those
level n segments live a long time, a manual optimize not withstanding:
hence my question: Do we see big segments that are mostly sparse? And a
merge there is one of those kinda bad things: if you have to level n
segments, one pretty sparse and the next fairly full, you have to merge
left to right, copying all of the full segment just to recover the
deleted space in the sparse segment.

> Hmmm.  What cases lead to this?

Well, it's kinda artificial, but combining indexes will do this. They're
combined in series, so the little segments of one index are followed by
the big segments of the next. So if you need to do some merging, and
given the current ordering constraint, you can merge a bunch of little
segments with one big segment to create only a slightly bigger segment.
Similarly with lots of deleted documents ...

But I don't know if these cases are observed in the wild ...

> The only thing I wonder
> is whether reading from say 20 segments at once is slower (more
> than 2X slower) than reading from 10?  Or it could actually be
> faster (since OS/disk heads can better schedule IO access to a
> larger number of spots)?  I don't know.

Me neither. I can imagine that order of magnitude differences (binary or
otherwise) would show changes but that small changes like +/- one
wouldn't, so that we might want to give ourselves that freedom.

Should be pretty easy to experiment with this stuff with the factored
merge, which I'll post Real Soon Now.

I haven't tried any performance testing. I can't remember: I need to
look and see if the performance framework measure indexing as well as
querying.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless

Steven Parkes wrote:

>> Right I'm calling a newly created segment (ie flushed from RAM)
>> level 0 and then a level 1 segment is created when you merge 10
>> level 0 segments, level 2 is created when merge 10 level 1 segments,
>> etc.
>
> This isn't the way the current code treats things. I'm not saying
> it's the only way to look at things, just making sure I/we are clear
> on the current state. The current code effectively computes level
> as a function of size of the segment, independent of how the
> segment was created and of the size of its neighbors.

Yes the code re-computes the level of a given segment from the current
values of maxBufferedDocs & mergeFactor.  But when these values have
changed (or, segments were flushed by RAM not by maxBufferedDocs) then
the way it computes level no longer results in the logarithmic policy
that it's trying to implement, I think.

>> Backing up a bit ... I think the lowest amortized cost merge policy
>> should always try to merge roughly equal sized segments
>
> Well, logarithmic does do this, it just clumps logarithmically,
> with, in this case, the boundary condition (size of the lowest
> level) being defined by maxBufferedDocs. Do you consider 9 and
> 10001 to be roughly equal sized?

Exactly, when logarithmic works "correctly" (you don't change
mergeFactor/maxBufferedDocs and your docs are all uniform in size), it
does achieve this "merge roughly equal size in byte" segments (yes
those two numbers are roughly equal).  Though now I have to go ponder
KS's Fibonacci series approach!.

> Of course, without deletes and not worrying about edge cases caused
> by flushes/closes, everything will be of the form
> maxBufferedDocs*mergeFactor^n. I'm curious to know if they stay that
> way: in particular what happens to the sizes of big old
> segments. Can everybody afford an optimize?

I think for high turnover content the way old segments will be mostly
deleted but at some point the merge will fully cascade (basically
an optimize) and they will be cleaned up?

> I think it'd be interesting to play with merge policies and see what
> we can do. One thing I'm interested in is tweaking the behavior on
> the big end of the scale. Exponentials get big fast and I think in
> some cases, treating the big segments differently from the smaller
> segments makes sense, for example, self-merging an old segment as
> deletes build up, or merging small subsequences of segments for a
> simlar reason. This is mostly what motivated me to work on factoring
> out the merge policy from the rest of IndexWriter. I've got
> something reasonably stable (though by no means perfect) that'll
> I'll post, hopefully later today.

Agreed!  There are lots of interesting things to explore here.

> It's really interesting to think about this. I'm not a algorithms
> expert, but I still find it interesting. It's appealing to think of
> a "best" policy but is it possible to have a best policy without
> making assumptions on the operations sequence (adds, deletes)?

I think there is no global "best"; it does depend on adds/deletes and
also what's important to you (search performance vs index
performance).

> Also, "best" has to trade off multiple constraints. You mentioned
> copying bytes as rarely as possible. But the number of segments has
> both hard and soft impacts on lookups, too, right? The hard limit is
> the number of file descriptors: every reader is opening every
> segment: that's the limiting case for file descriptors (mergers only
> need to open mergeFactor+1). The soft impact is doing the join when
> walking posting lists from each segment.

You're right, it's searching that has to open all the segments and
that's a hard limit once you start bumping up to max # descriptors and
soft, in slowing down your search.

>> Merging is costly because you read all data in then write all data
>> out, so, you want to minimize for byte of data in the index in the
>> index how many times it will be "serviced" (read in, written out) as
>> part of a merge.

> I've thought about this, as I looked through the existing merge
> code. I think you can get cases where you'll get a perfect storm of
> bad merges, copying a large segment multiple times to merge it with
> small segments.  Ick. But an edge condition? I guess the question is
> what is the expected value of the cost, i.e., the absolute cost but
> mitigated by the unlikelihood of it occurring.

Hmmm.  What cases lead to this?

>> I think instead of calling segments "level N" we should just measure
>> their net sizes and merge on that basis?

> I think this is a great candidate, given that you factor in the number
> of segments and sometimes merge just to keep the total number down.

Basically, this would keep the same logarithmic approach now, but
derive levels somehow from the net size in bytes.

>> Yes, at no time should you merge more than mergeFactor segments at
>> once.
>
> Why is this? File descriptors? Like I said before, readers are more
> an issue

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Marvin Humphrey



On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:


Actually is #2 a hard requirement?


A lot of Lucene users depend on having document number correspond to  
age, I think.  ISTR Hatcher at least recommending techniques that  
require it.



Do the loose ports of Lucene
(KinoSearch, Ferret, etc.) also follow this restriction?


KS: Nope.  So you can't use those tricks.


I think instead of calling segments "level N" we should just measure
their net sizes and merge on that basis?


Here's the fibonacci-series-based algorithm used in KinoSearch, taken  
from MultiReader:


sub segreaders_to_merge {
my ( $self, $all ) = @_;
return unless @{ $self->{sub_readers} };
return @{ $self->{sub_readers} } if $all;

# sort by ascending size in docs
my @sorted_sub_readers
= sort { $a->num_docs <=> $b->num_docs } @{ $self-> 
{sub_readers} };


# find sparsely populated segments
my $total_docs = 0;
my $threshold  = -1;
for my $i ( 0 .. $#sorted_sub_readers ) {
$total_docs += $sorted_sub_readers[$i]->num_docs;
if ( $total_docs < fibonacci( $i + 5 ) ) {
$threshold = $i;
}
}

# if any of the segments are sparse, return their readers
if ( $threshold > -1 ) {
return @sorted_sub_readers[ 0 .. $threshold ];
}
else {
return;
}
}

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes

> Right I'm calling a newly created segment (ie flushed from RAM) level
> 0 and then a level 1 segment is created when you merge 10 level 0
> segments, level 2 is created when merge 10 level 1 segments, etc.

This isn't the way the current code treats things. I'm not saying it's
the only way to look at things, just making sure I/we are clear on the
current state. The current code effectively computes level as a function
of size of the segment, independent of how the segment was created and
of the size of its neighbors.

Hmmm ... that's not completely true. Lets say it's mostly true. Okay,
kinda true.

> Backing up a bit ... I think the lowest amortized cost merge policy
> should always try to merge roughly equal sized segments

Well, logarithmic does do this, it just clumps logarithmically, with, in
this case, the boundary condition (size of the lowest level) being
defined by maxBufferedDocs. Do you consider 9 and 10001 to be
roughly equal sized?

Of course, without deletes and not worrying about edge cases caused by
flushes/closes, everything will be of the form
maxBufferedDocs*mergeFactor^n. I'm curious to know if they stay that
way: in particular what happens to the sizes of big old segments. Can
everybody afford an optimize? 

I think it'd be interesting to play with merge policies and see what we
can do. One thing I'm interested in is tweaking the behavior on the big
end of the scale. Exponentials get big fast and I think in some cases,
treating the big segments differently from the smaller segments makes
sense, for example, self-merging an old segment as deletes build up, or
merging small subsequences of segments for a simlar reason. This is
mostly what motivated me to work on factoring out the merge policy from
the rest of IndexWriter. I've got something reasonably stable (though by
no means perfect) that'll I'll post, hopefully later today.

Anyway, back to some of your comments:

It's really interesting to think about this. I'm not a algorithms
expert, but I still find it interesting. It's appealing to think of a
"best" policy but is it possible to have a best policy without making
assumptions on the operations sequence (adds, deletes)?

Also, "best" has to trade off multiple constraints. You mentioned
copying bytes as rarely as possible. But the number of segments has both
hard and soft impacts on lookups, too, right? The hard limit is the
number of file descriptors: every reader is opening every segment:
that's the limiting case for file descriptors (mergers only need to open
mergeFactor+1). The soft impact is doing the join when walking posting
lists from each segment.

My impression is that the logarithmic policy (in size), was chosen as a
nice closed form that does an "okay" job in all these areas.

> Merging is costly because you read all data in then write all data
> out, so, you want to minimize for byte of data in the index in the
> index how many times it will be "serviced" (read in, written out) as
> part of a merge.

I've thought about this, as I looked through the existing merge code. I
think you can get cases where you'll get a perfect storm of bad merges,
copying a large segment multiple times to merge it with small segments.
Ick. But an edge condition? I guess the question is what is the expected
value of the cost, i.e., the absolute cost but mitigated by the
unlikelihood of it occurring.

Interesting.

> I think instead of calling segments "level N" we should just measure
> their net sizes and merge on that basis?

I think this is a great candidate, given that you factor in the number
of segments and sometimes merge just to keep the total number down.

> Yes, at no time should you merge more than mergeFactor segments at
once.

Why is this? File descriptors? Like I said before, readers are more an
issue there, right? I don't see that there's a huge difference between
mergeFactor and mergeFactor+1 if for other reasons, one might think
mergeFactor+1 was better than mergeFactor for a particular merge.

Hmmm ... I guess I think of mergeFactor as more setting the size
boundaries of segments than I do the n-way-ness of a merge, though the
word itself lends itself to the latter interpretation.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless


On Thu, 22 Mar 2007 13:34:39 -0700, "Steven Parkes" <[EMAIL PROTECTED]> said:
> > EG if you set maxBufferedDocs to say 1 but then it turns out based
> > on RAM usage you actually flush every 300 docs then the merge policy
> > will incorrectly merge a level 1 segment (with 3000 docs) in with the
> > level 0 segments (with 300 docs).  This is because the merge policy
> > looks at the current value of maxBufferedDocs to compute the levels
> > so a 3000 doc segment and a 300 doc segment all look like "level 0".
> 
> Are you calling the 3K segment a level 1 segment because it was created
> from level 0 segments? Because based on size, it is a level 0 segment,
> right? With the current merge policy, you can merge level n segments and
> get a level n segment. Deletes will do this, plus other things like
> changing merge policy parameters and combining indexes.

Right I'm calling a newly created segment (ie flushed from RAM) level
0 and then a level 1 segment is created when you merge 10 level 0
segments, level 2 is created when merge 10 level 1 segments, etc.

> Because based on size, it is a level 0 segment, right?

Well, I don't think it's right to call something level 0 just because
it's under the the current maxBufferedDocs.

Backing up a bit ... I think the lowest amortized cost merge policy
should always try to merge roughly equal sized segments subject to
restrictions of 1) max # segments that can be merged at once
(mergeFactor) presumably due to file descriptor limits and/or
substantial degradation in merge performance as mergeFactor increases
eg due to lack of concurrency in IO system (??) and 2) that you must
merge adjacent segments I think (so docIDs, though changing, remain
"monotonic").

Actually is #2 a hard requirement?  Do the loose ports of Lucene
(KinoSearch, Ferret, etc.) also follow this restriction?  We say that
developers should not rely on docIDs but people still seem to rely on
their monotonic ordering (even though they change).

Merging is costly because you read all data in then write all data
out, so, you want to minimize for byte of data in the index in the
index how many times it will be "serviced" (read in, written out) as
part of a merge.  I think if N equal sized segments are always merged
then the # copies for each byte of data will be minimized?

So, the fact that due to this bug we will merge a 3000 doc segment
with 9 300 doc segments is not efficient (amortized) because those
3000 docs in the first segment will net/net have to get merged again
far sooner than they would have had they been merged with 9 3000 doc
segments.

I think instead of calling segments "level N" we should just measure
their net sizes and merge on that basis?

> Leads to the question of what is "over merging". The current merge
> policy doesn't consider the size of the result, it simply counts the
> number of segments at a level. Do you think this qualifies as over
> merging? It still should only merge when there are mergeFactor segments
> at a level, so you shouldn't be doing too terribly much merging.  And
> you have to be careful not to do less, right? By bounding the number of
> segments at each level, you ensure that your file descriptor usage only
> grows logarithmically.

Yes, at no time should you merge more than mergeFactor segments at once.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes

> EG if you set maxBufferedDocs to say 1 but then it turns out based
> on RAM usage you actually flush every 300 docs then the merge policy
> will incorrectly merge a level 1 segment (with 3000 docs) in with the
> level 0 segments (with 300 docs).  This is because the merge policy
> looks at the current value of maxBufferedDocs to compute the levels
> so a 3000 doc segment and a 300 doc segment all look like "level 0".

Are you calling the 3K segment a level 1 segment because it was created
from level 0 segments? Because based on size, it is a level 0 segment,
right? With the current merge policy, you can merge level n segments and
get a level n segment. Deletes will do this, plus other things like
changing merge policy parameters and combining indexes.

Leads to the question of what is "over merging". The current merge
policy doesn't consider the size of the result, it simply counts the
number of segments at a level. Do you think this qualifies as over
merging? It still should only merge when there are mergeFactor segments
at a level, so you shouldn't be doing too terribly much merging.  And
you have to be careful not to do less, right? By bounding the number of
segments at each level, you ensure that your file descriptor usage only
grows logarithmically.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless

"Steven Parkes" <[EMAIL PROTECTED]> wrote:
>   * Merge policy has problems when you "flush by RAM" (this is true
> even before my patch).  Not sure how to fix yet.
> 
> Do you mean where one would be trying to use RAM usage to determine when
> to do a flush? 

Right, if you have your indexer measure RAM usage
(writer.ramSizeInByts()) after each added doc and flush whenever that
crosses X MB then depending on the value of maxBufferedDocs, you may
over-merge.

EG if you set maxBufferedDocs to say 1 but then it turns out based
on RAM usage you actually flush every 300 docs then the merge policy
will incorrectly merge a level 1 segment (with 3000 docs) in with the
level 0 segments (with 300 docs).  This is because the merge policy
looks at the current value of maxBufferedDocs to compute the levels
so a 3000 doc segment and a 300 doc segment all look like "level 0".

(I'm doing this to try to do apples to apples performance comparison
of current Lucene trunk vs my patch and "flushing by RAM" seems like
the fair comparison but then I have to carefully pick maxBufferedDocs
to make sure I don't hit this).

I will open a separate issue for this.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Steven Parkes

  * Merge policy has problems when you "flush by RAM" (this is true
even before my patch).  Not sure how to fix yet.

Do you mean where one would be trying to use RAM usage to determine when
to do a flush? 

-Original Message-
From: Michael McCandless (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 22, 2007 10:09 AM
To: java-dev@lucene.apache.org
Subject: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM
to buffer added documents


 [
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira
.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--

Attachment: LUCENE-843.patch

I'm attaching a patch with my current state.  NOTE: this is very rough
and very much a work in progress and nowhere near ready to commit!  I
wanted to get it out there sooner rather than later to get feedback,
maybe entice some daring early adopters, iterate, etc.

It passes all unit tests except the disk-full tests.

There are some big issues yet to resolve:

  * Merge policy has problems when you "flush by RAM" (this is true
even before my patch).  Not sure how to fix yet.

  * Thread safety and thread concurrency aren't there yet.

  * Norms are not flushed (just use up RAM until you close the
writer).

  * Many other things on my TODO list :)



> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-843:
--

Attachment: LUCENE-843.patch

I'm attaching a patch with my current state.  NOTE: this is very rough
and very much a work in progress and nowhere near ready to commit!  I
wanted to get it out there sooner rather than later to get feedback,
maybe entice some daring early adopters, iterate, etc.

It passes all unit tests except the disk-full tests.

There are some big issues yet to resolve:

  * Merge policy has problems when you "flush by RAM" (this is true
even before my patch).  Not sure how to fix yet.

  * Thread safety and thread concurrency aren't there yet.

  * Norms are not flushed (just use up RAM until you close the
writer).

  * Many other things on my TODO list :)



> improve how IndexWriter uses RAM to buffer added documents
> --
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-843.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges.  Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-03-22 Thread Michael McCandless (JIRA)

improve how IndexWriter uses RAM to buffer added documents
--

 Key: LUCENE-843
 URL: https://issues.apache.org/jira/browse/LUCENE-843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
 Assigned To: Michael McCandless
Priority: Minor


I'm working on a new class (MultiDocumentWriter) that writes more than
one document directly into a single Lucene segment, more efficiently
than the current approach.

This only affects the creation of an initial segment from added
documents.  I haven't changed anything after that, eg how segments are
merged.

The basic ideas are:

  * Write stored fields and term vectors directly to disk (don't
use up RAM for these).

  * Gather posting lists & term infos in RAM, but periodically do
in-RAM merges.  Once RAM is full, flush buffers to disk (and
merge them later when it's time to make a real segment).

  * Recycle objects/buffers to reduce time/stress in GC.

  * Other various optimizations.

Some of these changes are similar to how KinoSearch builds a segment.
But, I haven't made any changes to Lucene's file format nor added
requirements for a global fields schema.

So far the only externally visible change is a new method
"setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
deprecated) so that it flushes according to RAM usage and not a fixed
number documents added.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

94 matches

Mail list logo