[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Michael McCandless (JIRA) Fri, 11 Dec 2009 13:48:44 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789555#action_12789555
 ]


Michael McCandless commented on LUCENE-2026:
--------------------------------------------

bq. If I understand everything right, with current uberfast reopens (thanks 
per-segment search), the only thing that makes index/commit/reopen cycle slow 
is the 'sync' call.

I agree, per-segment searching was the most important step towards
NRT.  It's a great step forward...

But the fsync call is a killer, so avoiding it in the NRT path is
necessary.  It's also very OS/FS dependent.

bq. That sync call on memory-based Directory is noop.

Until you need to spillover to disk because your RAM buffer is full?

Also, if IW.commit() is called, I would expect any changes in RAM
should be committed to the real dir (stable storage)?

And, going through RAM first will necessarily be a hit on indexing
throughput (Jake estimates 10% hit in Zoie's case).  Really, our
current approach goes through RAM as well, in that OS's write cache
(if the machine has spare RAM) will quickly accept the small index
files & write them in the BG.  It's not clear we can do better than
the OS here...

bq. And no, you really should commit() to be able to see stuff on reopen()  My 
god, seeing changes that aren't yet commited - that violates the meaning of 
'commit'.

Uh, this is an API that clearly states that its purpose is to search
the uncommitted changes.  If you really want to be "pure"
transactional, don't use this API ;)

bq. The original purporse of current NRT code was.. well.. let me remember.. 
NRT search!  With per-segment caches and sync lag defeated you get the delay 
between doc being indexed and becoming searchable under tens of milliseconds. 
Is that not NRT enough to introduce tight coupling between classes that have 
absolutely no other reason to be coupled?? Lucene 4.0. Simplicity is our 
candidate! Vote for Simplicity!

In fact I favor our current approach because of its simplicity.

Have a look at LUCENE-1313 (adds RAMDir as you're discussing), or,
Zoie, which also adds the RAMDir and backgrounds resolving deleted
docs -- they add complexity to Lucene that I don't think is warranted.

My general feeling at this point is with per-segment searching, and
fsync avoided, NRT performance is excellent.

We've explored a number of possible tweaks to improve it --
writing first to RAMDir (LUCENE-1313), resolving deletes in the
foreground (LUCENE-2047), using paged BitVector for deletions
(LUCENE-1526), Zoie (buffering segments in RAM & backgrounds resolving
deletes), etc., but, based on testing so far, I don't see the
justification for the added complexity.

bq. *: Okay, there remains an issue of merges that piggyback on commits, so 
writing and commiting one smallish segment suddenly becomes a time-consuming 
operation. But that's a completely separate issue. Go, fix your mergepolicies 
and have a thread that merges asynchronously.

This already runs in the BG by default.  But warming the reader on the
merged segment (before lighting it) is important (IW does this today).


> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to