[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Earwin Burrfoot (JIRA) Fri, 11 Dec 2009 15:18:42 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789604#action_12789604
 ]


Earwin Burrfoot commented on LUCENE-2026:
-----------------------------------------

bq. Until you need to spillover to disk because your RAM buffer is full?
No, buffer is there only to decouple indexing from writing. Can be spilt over 
asynchronously without waiting for it to be filled up.

Okay, we agree on a zillion of things, except simpicity of the current NRT, and 
approach to commit().

Good commit() behaviour consists of two parts:
1. Everything commit()ed is guaranteed to be on disk.
2. Until commit() is called, reading threads don't see new/updated records.

Now we want more speed, and are ready to sacrifice something if needed.
You decide to sacrifice new record (in)visibility. No choice, but to hack into 
IW to allow readers see its hot, fresh innards.

I say it's better to sacrifice write guarantee. In the rare case the 
process/machine crashes, you can reindex last few minutes' worth of docs. Now 
you don't have to hack into IW and write specialized readers. Hence, simpicity. 
You have only one straightforward writer, you have only one straightforward 
reader (which is nicely immutable and doesn't need any synchronization code).

In fact you don't even need to sacrifice write guarantee. What was the reason 
for it? The only one I can come up with is - the thread that does writes and 
sync() is different from the thread that calls commit(). But, commit() can 
return a Future. 
So the process goes as:
- You index docs, nobody sees them, nor deletions.
- You call commit(), the docs/deletes are written down to memory (NRT 
case)/disk (non-NRT case). Right after calling commit() every newly reopened 
Reader is guaranteed to see your docs/deletes.
- Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), 
and fires up the Future returned from commit(). At this point all data is 
guaranteed to be written and braced for a crach, ram cache or not, OS/raid 
controller cache or not.

For back-compat purporses we can use another name for that 
Future-returning-commit(), and current commit() will just call this new method 
and wait on future returned.

Okay, with that I'm probably shutting up on the topic until I can back myself 
up with code. Sadly, my current employer is happy with update lag in tens of 
seconds :)

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to