[jira] Updated: (LUCENE-1301) Refactor DocumentsWriter

Michael McCandless (JIRA) Fri, 11 Jul 2008 11:02:33 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-1301:
---------------------------------------

    Attachment: LUCENE-1301.patch

New rev of the patch attached.  I've fixed all nocommits.  All tests
pass.  I believe this version is ready to commit!

I'll wait a few more days before committing...

I ran some indexing throughput tests, indexing Wikipedia docs from a
line file using StandardAnalyzer.  Each result is best of 4.  Here's
the alg:

{code}
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = true
doc.term.vector = true
doc.add.log.step=2000

directory=FSDirectory
autocommit=false
compound=false

work.dir=/lucene/work
ram.flush.mb=64

{ "Rounds"
  ResetSystemErase
  { "BuildIndex"
    - CreateIndex
     { "AddDocs" AddDoc > : 200000
    - CloseIndex
  }
  NewRound
} : 4

RepSumByPrefRound BuildIndex
{code}

Gives these results with term vectors & stored fields:
{code}
patch
  BuildIndex -  - 1 -  -   1 -  -  200000 -  -   900.4 -  - 222.12 - 
410,938,688  1,029,046,272

trunk
  BuildIndex -  - 1 -  -   1 -  -  200000 -  -   969.0 -  - 206.39 - 
400,372,256  1,029,046,272

2.3
  BuildIndex      2        1       200002        905.4      220.89   
391,630,016  1,029,046,272
{code}


And without term vectors & stored fields:

{code}
patch
  BuildIndex -  - 3 -  -   1 -  -  200000 -  - 1,297.5 -  - 154.15 - 
399,966,592  1,029,046,272

trunk
  BuildIndex -  - 1 -  -   1 -  -  200000 -  - 1,372.5 -  - 145.72 - 
390,581,376  1,029,046,272

2.3
  BuildIndex -  - 1 -  -   1 -  -  200002 -  - 1,308.5 -  - 152.85 - 
389,224,640  1,029,046,272
{code}

So, the bad news is the refactoring had made things a bit (~5-7%)
slower than the current trunk.  But the good news is trunk was already
6-7% faster than 2.4, so they nearly cancel out.

If I repeat these tests using tiny docs (~100 bytes per body) instead,
indexing the first 10 million docs, the slowdown is worse (~13-15% vs
trunk, ~11-13% vs 2.3)... I think it's because the additional method calls
with the refactoring become a bigger part of the time.

With term vectors & stored fields:

{code}
patch
  BuildIndex -  - 3 -  -   1 -   10000000 -   38,320.1 -  - 260.96 - 
313,980,832  1,029,046,272

trunk
  BuildIndex      2        1     10000000     45,194.1      221.27   
414,987,072  1,029,046,272

2.3
  BuildIndex -  - 1 -  -   1 -   10000002 -   42,861.4 -  - 233.31 - 
182,957,440  1,029,046,272
{code}

Without term vectors & stored fields:

{code}
patch
  BuildIndex -  - 1 -  -   1 -   10000000 -   60,778.4 -  - 164.53 - 
341,611,456  1,029,046,272

trunk
  BuildIndex      2        1     10000000     68,387.8      146.23   
405,388,960  1,029,046,272

2.3
  BuildIndex      0        1     10000002     68,052.7      146.95   
330,334,912  1,029,046,272
{code}

I think these small slowdowns are worth the improvement in code
clarity.


> Refactor DocumentsWriter
> ------------------------
>
>                 Key: LUCENE-1301
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1301
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1301.patch, LUCENE-1301.patch, 
> LUCENE-1301.take2.patch, LUCENE-1301.take3.patch
>
>
> I've been working on refactoring DocumentsWriter to make it more
> modular, so that adding new indexing functionality (like column-stride
> stored fields, LUCENE-1231) is just a matter of adding a plugin into
> the indexing chain.
> This is an initial step towards flexible indexing (but there is still
> alot more to do!).
> And it's very much still a work in progress -- there are intemittant
> thread safety issues, I need to add tests cases and test/iterate on
> performance, many "nocommits", etc.  This is a snapshot of my current
> state...
> The approach introduces "consumers" (abstract classes defining the
> interface) at different levels during indexing.  EG DocConsumer
> consumes the whole document.  DocFieldConsumer consumes separate
> fields, one at a time.  InvertedDocConsumer consumes tokens produced
> by running each field through the analyzer.  TermsHashConsumer writes
> its own bytes into in-memory posting lists stored in byte slices,
> indexed by term, etc.
> DocumentsWriter*.java is then much simpler: it only interacts with a
> DocConsumer and has no idea what that consumer is doing.  Under that
> DocConsumer there is a whole "indexing chain" that does the real work:
>   * NormsWriter holds norms in memory and then flushes them to _X.nrm.
>   * FreqProxTermsWriter holds postings data in memory and then flushes
>     to _X.frq/prx.
>   * StoredFieldsWriter flushes immediately to _X.fdx/fdt
>   * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd
> DocumentsWriter still manages things like flushing a segment, closing
> doc stores, buffering & applying deletes, freeing memory, aborting
> when necesary, etc.
> In this first step, everything is package-private, and, the indexing
> chain is hardwired (instantiated in DocumentsWriter) to the chain
> currently matching Lucene trunk.  Over time we can open this up.
> There are no changes to the index file format.
> For the most part this is just a [large] refactoring, except for these
> two small actual changes:
>   * Improved concurrency with mixed large/small docs: previously the
>     thread state would be tied up when docs finished indexing
>     out-of-order.  Now, it's not: instead I use a separate class to
>     hold any pending state to flush to the doc stores, and immediately
>     free up the thread state to index other docs.
>   * Buffered norms in memory now remain sparse, until flushed to the
>     _X.nrm file.  Previously we would "fill holes" in norms in memory,
>     as we go, which could easily use way too much memory.  Really this
>     isn't a solution to the problem of sparse norms (LUCENE-830); it
>     just delays that issue from causing memory blowup during indexing;
>     memory use will still blowup during searching.
> I expect performance (indexing throughput) will be worse with this
> change.  I'll profile & iterate to minimize this, but I think we can
> accept some loss.  I also plan to measure benefit of manually
> re-cycling RawPostingList instances from our own pool, vs letting GC
> recycle them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1301) Refactor DocumentsWriter

Reply via email to