[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791936#action_12791936 ]
Michael McCandless commented on LUCENE-2026: -------------------------------------------- {quote} FWIW, autoCommit doesn't really have a place in Lucy's one-segment-per-indexing-session model. {quote} Well, autoCommit just means "periodically call commit". So, if you decide to offer a commit() operation, then autoCommit would just wrap that? But, I don't think autoCommit should be offered... app should decide. {quote} Revisiting the LUCENE-1044 threads, one passage stood out: http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321 This is why in a db system, the only file that is sync'd is the log file - all other files can be made "in sync" from the log file - and this file is normally striped for optimum write performance. Some systems have special "log file drives" (some even solid state, or battery backed ram) to aid the performance. The fact that we have to sync all files instead of just one seems sub-optimal. {quote} Yes, but, that cost is not on the reopen path, so it's much less important. Ie, the app can freely choose how frequently it wants to commit, completely independent from how often it needs to reopen. {quote} Yet Lucene is not well set up to maintain a transaction log. The very act of adding a document to Lucene is inherently lossy even if all fields are stored, because doc boost is not preserved. {quote} I don't see that those two statements are related. One can "easily" (meaning, it's easily decoupled from core) make a transaction log on top of lucene -- just serialize your docs/analzyer selection/etc to the log & sync it periodically. But, that's orthogonal to what Lucene does & doesn't preserve in its index (and, yes, Lucene doesn't precisely preserve boosts). {quote} bq. Also, having the app explicitly decouple these two notions keeps the door open for future improvements. If we force absolutely all sharing to go through the filesystem then that limits the improvements we can make to NRT. However, Lucy has much more to gain going through the file system than Lucene does, because we don't necessarily incur JVM startup costs when launching a new process. The Lucene approach to NRT - specialized reader hanging off of writer - is constrained to a single process. The Lucy approach - fast index opens enabled by mmap-friendly index formats - is not. The two approaches aren't mutually exclusive. It will be possible to augment Lucy with a specialized index reader within a single process. However, A) there seems to be a lot of disagreement about just how to integrate that reader, and B) there seem to be ways to bolt that functionality on top of the existing classes. Under those circumstances, I think it makes more sense to keep that feature external for now. {quote} Again: NRT is not a "specialized reader". It's a normal read-only DirectoryReader, just like you'd get from IndexReader.open, with the only difference being that it consulted IW to find which segments to open. Plus, it's pooled, so that if IW already has a given segment reader open (say because deletes were applied or merges are running), it's reused. We've discussed making it specialized (eg directly asearching DW's ram buffer, caching recently flushed segments in RAM, special incremental-copy-on-write data structures for deleted docs, etc.) but so far these changes don't seem worthwhile. The current approach to NRT is simple... I haven't yet seen performance gains strong enough to justify moving to "specialized readers". Yes, Lucene's approach must be in the same JVM. But we get important gains from this -- reusing a single reader (the pool), carrying over merged deletions directly in RAM (and eventually field cache & norms too -- LUCENE-1785). Instead, Lucy (by design) must do all sharing & access all index data through the filesystem (a decision, I think, could be dangerous), which will necessarily increase your reopen time. Maybe in practice that cost is small though... the OS write cache should keep everything fresh... but you still must serialize. {quote} bq. Alternatively, you could keep the notion "flush" (an unsafe commit) alive? You write the segments file, but make no effort to ensure it's durability (and also preserve the last "true" commit). Then a normal IR.reopen suffices... That sounds promising. The semantics would differ from those of Lucene's flush(), which doesn't make changes visible. We could implement this by somehow marking a "committed" snapshot and a "flushed" snapshot differently, either by adding an "fsync" property to the snapshot file that would be false after a flush() but true after a commit(), or by encoding the property within the snapshot filename. The file purger would have to ensure that all index files referenced by either the last committed snapshot or the last flushed snapshot were off limits. A rollback() would zap all changes since the last commit(). Such a scheme allows the the top level app to avoid the costs of fsync while maintaining its own transaction log - perhaps with the optimizations suggested above (separate disk, SSD, etc). {quote} In fact, this would make Lucy's approach to NRT nearly identical to Lucene NRT. The only difference is, instead of getting the current uncommitted segments_N via RAM, Lucy uses the filesystem. And, of course Lucy doesn't pool readers. So this is really a Lucy-ification of Lucene's approach to NRT. So it has the same benefits as Lucene's NRT, ie, lets Lucy apps decouple decisions about safety (commit) and freshness (reopen turnaround time). > Refactoring of IndexWriter > -------------------------- > > Key: LUCENE-2026 > URL: https://issues.apache.org/jira/browse/LUCENE-2026 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 3.1 > > > I've been thinking for a while about refactoring the IndexWriter into > two main components. > One could be called a SegmentWriter and as the > name says its job would be to write one particular index segment. The > default one just as today will provide methods to add documents and > flushes when its buffer is full. > Other SegmentWriter implementations would do things like e.g. appending or > copying external segments [what addIndexes*() currently does]. > The second component's job would it be to manage writing the segments > file and merging/deleting segments. It would know about > DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would > provide hooks that allow users to manage external data structures and > keep them in sync with Lucene's data during segment merges. > API wise there are things we have to figure out, such as where the > updateDocument() method would fit in, because its deletion part > affects all segments, whereas the new document is only being added to > the new segment. > Of course these should be lower level APIs for things like parallel > indexing and related use cases. That's why we should still provide > easy to use APIs like today for people who don't need to care about > per-segment ops during indexing. So the current IndexWriter could > probably keeps most of its APIs and delegate to the new classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org