[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Michael McCandless (JIRA) Sat, 12 Dec 2009 02:56:45 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789708#action_12789708
 ]


Michael McCandless commented on LUCENE-2026:
--------------------------------------------

{quote}
bq. Until you need to spillover to disk because your RAM buffer is full?

No, buffer is there only to decouple indexing from writing. Can be spilt over 
asynchronously without waiting for it to be filled up.
{quote}

But this is where things start to get complex... the devil is in the
details here.  How do you carry over your deletes?  This spillover
will take time -- do you block all indexing while that's happening
(not great)?  Do you do it gradually (start spillover when half full,
but still accept indexing)?  Do you throttle things if index rate
exceeds flush rate?  How do you recover on exception?

NRT today let's the OS's write cache decide how to use RAM to speed up
writing of these small files, which keeps things alot simpler for us.
I don't see why we should add complexity to Lucene to replicate what
the OS is doing for us (NOTE: I don't really trust the OS in the
reverse case... I do think Lucene should read into RAM the data
structures that are important).

bq. You decide to sacrifice new record (in)visibility. No choice, but to hack 
into IW to allow readers see its hot, fresh innards.

bq. Now you don't have to hack into IW and write specialized readers.

Probably we'll just have to disagree here... NRT isn't a hack ;)

IW is already hanging onto completely normal segments.  Ie, the index
has been updated with these segments, just not yet published so
outside readers can see it.  All NRT does is let a reader see this
private view.

The readers that an NRT reader expoes are normal SegmentReaders --
it's just that rather than consult a segments_N on disk to get the
segment metadata, they pulled from IW's uncommitted in memory
SegmentInfos instance.

Yes we've talked about the "hot innards" solution -- an IndexReader
impl that can directly search DW's ram buffer -- but that doesn't look
necessary today, because performance of NRT is good with the simple
solution we have now.

NRT reader also gains performance by carrying over deletes in RAM.  We
should eventually do the same thing with norms & field cache.  No
reason to write to disk, then right away read again.

{quote}
* You index docs, nobody sees them, nor deletions.
* You call commit(), the docs/deletes are written down to memory (NRT 
case)/disk (non-NRT case). Right after calling commit() every newly reopened 
Reader is guaranteed to see your docs/deletes.
* Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), 
and fires up the Future returned from commit(). At this point all data is 
guaranteed to be written and braced for a crash, ram cache or not, OS/raid 
controller cache or not.
{quote}

But this is not a commit, if docs/deletes are written down into RAM?
Ie, commit could return, then the machine could crash, and you've lost
changes?  Commit should go through to stable storage before returning?
Maybe I'm just missing the big picture of what you're proposing
here...

Also, you can build all this out on top of Lucene today?  Zoie is a
proof point of this.  (Actually: how does your proposal differ from
Zoie?  Maybe that'd help shed light...).

bq. I say it's better to sacrifice write guarantee. In the rare case the 
process/machine crashes, you can reindex last few minutes' worth of docs. 

It is not that simple -- if you skip the fsync, and OS crashes/you
lose power, your index can easily become corrupt.  The resulting
CheckIndex -fix can easily need to remove large segments.

The OS's write cache makes no gurantees on the order in which the
files you've written find their way to disk.

Another option (we've discussed this) would be journal file approach
(ie transaction log, like most DBs use).  You only have one file to
fsync, and you replay to recover.  But that'd be a big change for
Lucene, would add complexity, and can be accomplished outside of
Lucene if an app really wants to...

Let me try turning this around: in your componentization of
SegmentReader, why does it matter who's tracking which components are
needed to make up a given SR?  In the IndexReader.open case, it's a
SegmntInfos instance (obtained by loading segments_N file from disk).
In the NRT case, it's also a SegmentInfos instace (the one IW is
privately keeping track of and only publishing on commit).  At the
component level, creating the SegmentReader should be no different?


> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to