[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Marvin Humphrey (JIRA) Fri, 18 Dec 2009 12:21:48 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792625#action_12792625
 ]


Marvin Humphrey commented on LUCENE-2026:
-----------------------------------------

> Well, autoCommit just means "periodically call commit". So, if you
> decide to offer a commit() operation, then autoCommit would just wrap
> that? But, I don't think autoCommit should be offered... app should
> decide.

Agreed, autoCommit had benefits under legacy Lucene, but wouldn't be important
now.  If we did add some sort of "automatic commit" feature, it would mean
something else: commit every change instantly.  But that's easy to implement
via a wrapper, so there's no point cluttering the the primary index writer
class to support such a feature.

> Again: NRT is not a "specialized reader". It's a normal read-only
> DirectoryReader, just like you'd get from IndexReader.open, with the
> only difference being that it consulted IW to find which segments to
> open. Plus, it's pooled, so that if IW already has a given segment
> reader open (say because deletes were applied or merges are running),
> it's reused.

Well, it seems to me that those two features make it special -- particularly
the pooling of SegmentReaders.  You can't take advantage of that outside the
context of IndexWriter:

> Yes, Lucene's approach must be in the same JVM. But we get important
> gains from this - reusing a single reader (the pool), carrying over
> merged deletions directly in RAM (and eventually field cache & norms
> too - LUCENE-1785).

Exactly.  In my view, that's what makes that reader "special": unlike ordinary
Lucene IndexReaders, this one springs into being with its caches already
primed rather than in need of lazy loading.

But to achieve those benefits, you have to mod the index writing process.
Those modifications are not necessary under the Lucy model, because the mere
act of writing the index stores our data in the system IO cache.

> Instead, Lucy (by design) must do all sharing & access all index data
> through the filesystem (a decision, I think, could be dangerous),
> which will necessarily increase your reopen time. 

Dangerous in what sense?

Going through the file system is a tradeoff, sure -- but it's pretty nice to
design your low-latency search app free from any concern about whether
indexing and search need to be coordinated within a single process.
Furthermore, if separate processes are your primary concurrency model, going
through the file system is actually mandatory to achieve best performance on a
multi-core box.  Lucy won't always be used with multi-threaded hosts.

I actually think going through the file system is dangerous in a different
sense: it puts pressure on the file format spec.  The easy way to achieve IPC
between writers and readers will be to dump stuff into one of the JSON files
to support the killer-feature-du-jour -- such as what I'm proposing with this
"fsync" key in the snapshot file.  But then we wind up with a bunch of crap
cluttering up our index metadata files.  I'm determined that Lucy will have a
more coherent file format than Lucene, but with this IPC requirement we're
setting our community up to push us in the wrong direction.  If we're not
careful, we could end up with a file format that's an unmaintainable jumble.

But you're talking performance, not complexity costs, right?

> Maybe in practice that cost is small though... the OS write cache should
> keep everything fresh... but you still must serialize.

Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records
and 900 MB worth of sort cache data; opening a fresh searcher and loading all
sort caches takes circa 21 ms.

There's room to improve that further -- we haven't yet implemented
IndexReader.reopen() -- but that was fast enough to achieve what we wanted to
achieve.

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to