[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Michael McCandless (JIRA) Thu, 17 Dec 2009 06:07:45 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791936#action_12791936
 ]


Michael McCandless commented on LUCENE-2026:
--------------------------------------------

{quote}
FWIW, autoCommit doesn't really have a place in Lucy's
one-segment-per-indexing-session model.
{quote}

Well, autoCommit just means "periodically call commit".  So, if you
decide to offer a commit() operation, then autoCommit would just wrap
that?  But, I don't think autoCommit should be offered... app should
decide.

{quote}
Revisiting the LUCENE-1044 threads, one passage stood out:

http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321

This is why in a db system, the only file that is sync'd is the log
file - all other files can be made "in sync" from the log file - and
this file is normally striped for optimum write performance. Some
systems have special "log file drives" (some even solid state, or
battery backed ram) to aid the performance.

The fact that we have to sync all files instead of just one seems sub-optimal.
{quote}

Yes, but, that cost is not on the reopen path, so it's much less
important.  Ie, the app can freely choose how frequently it wants to
commit, completely independent from how often it needs to reopen.

{quote}
Yet Lucene is not well set up to maintain a transaction log. The very act of
adding a document to Lucene is inherently lossy even if all fields are stored,
because doc boost is not preserved.
{quote}

I don't see that those two statements are related.

One can "easily" (meaning, it's easily decoupled from core) make a
transaction log on top of lucene -- just serialize your docs/analzyer
selection/etc to the log & sync it periodically.

But, that's orthogonal to what Lucene does & doesn't preserve in its
index (and, yes, Lucene doesn't precisely preserve boosts).

{quote}
bq. Also, having the app explicitly decouple these two notions keeps the door 
open for future improvements. If we force absolutely all sharing to go through 
the filesystem then that limits the improvements we can make to NRT.

However, Lucy has much more to gain going through the file system than Lucene
does, because we don't necessarily incur JVM startup costs when launching a
new process. The Lucene approach to NRT - specialized reader hanging off of
writer - is constrained to a single process. The Lucy approach - fast index
opens enabled by mmap-friendly index formats - is not.

The two approaches aren't mutually exclusive. It will be possible to augment
Lucy with a specialized index reader within a single process. However, A)
there seems to be a lot of disagreement about just how to integrate that
reader, and B) there seem to be ways to bolt that functionality on top of the
existing classes. Under those circumstances, I think it makes more sense to
keep that feature external for now.
{quote}

Again: NRT is not a "specialized reader".  It's a normal read-only
DirectoryReader, just like you'd get from IndexReader.open, with the
only difference being that it consulted IW to find which segments to
open.  Plus, it's pooled, so that if IW already has a given segment
reader open (say because deletes were applied or merges are running),
it's reused.

We've discussed making it specialized (eg directly asearching DW's ram
buffer, caching recently flushed segments in RAM, special
incremental-copy-on-write data structures for deleted docs, etc.) but
so far these changes don't seem worthwhile.

The current approach to NRT is simple... I haven't yet seen
performance gains strong enough to justify moving to "specialized
readers".

Yes, Lucene's approach must be in the same JVM.  But we get important
gains from this -- reusing a single reader (the pool), carrying over
merged deletions directly in RAM (and eventually field cache & norms
too -- LUCENE-1785).

Instead, Lucy (by design) must do all sharing & access all index data
through the filesystem (a decision, I think, could be dangerous),
which will necessarily increase your reopen time.  Maybe in practice
that cost is small though... the OS write cache should keep everything
fresh... but you still must serialize.

{quote}
bq. Alternatively, you could keep the notion "flush" (an unsafe commit) alive? 
You write the segments file, but make no effort to ensure it's durability (and 
also preserve the last "true" commit). Then a normal IR.reopen suffices...

That sounds promising. The semantics would differ from those of Lucene's
flush(), which doesn't make changes visible.

We could implement this by somehow marking a "committed" snapshot and a
"flushed" snapshot differently, either by adding an "fsync" property to the
snapshot file that would be false after a flush() but true after a commit(),
or by encoding the property within the snapshot filename. The file purger
would have to ensure that all index files referenced by either the last
committed snapshot or the last flushed snapshot were off limits. A rollback()
would zap all changes since the last commit().

Such a scheme allows the the top level app to avoid the costs of fsync while
maintaining its own transaction log - perhaps with the optimizations
suggested above (separate disk, SSD, etc).
{quote}

In fact, this would make Lucy's approach to NRT nearly identical to
Lucene NRT.

The only difference is, instead of getting the current uncommitted
segments_N via RAM, Lucy uses the filesystem.  And, of course
Lucy doesn't pool readers.  So this is really a Lucy-ification of
Lucene's approach to NRT.

So it has the same benefits as Lucene's NRT, ie, lets Lucy apps
decouple decisions about safety (commit) and freshness (reopen
turnaround time).


> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to