[ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789905#action_12789905 ]
Marvin Humphrey commented on LUCENE-2026: ----------------------------------------- > I think that's a poor default (trades safety for performance), unless > Lucy eg uses a transaction log so you can concretely bound what's lost > on crash/power loss. Or, if you go back to autocommitting I guess... Search indexes should not be used for canonical data storage -- they should be built *on top of* canonical data storage. Guarding against power failure induced corruption in a database is an imperative. Guarding against power failure induced corruption in a search index is a feature, not an imperative. Users have many options for dealing with the potential for such corruption. You can go back to your canonical data store and rebuild your index from scratch when it happens. In a search cluster environment, you can rsync a known-good copy from another node. Potentially, you might enable fsync-before-commit and keep your own transaction log. However, if the time it takes to rebuild or recover an index from scratch would have caused you unacceptable downtime, you can't possibly be operating in a single-point-of-failure environment where a power failure could take you down anyway -- so other recovery options are available to you. Turning on fsync is only one step towards ensuring index integrity; others steps involve making decisions about hard drives, RAID arrays, failover strategies, network and off-site backups, etc, and are outside of our domain as library authors. We cannot meet the needs of users who need guaranteed index integrity on our own. For everybody else, what turning on fsync by default achieves is to make an exceedingly rare event rarer. That's valuable, but not essential. My argument is that since the search indexes should not be used for canonical storage, and since fsync is not testably reliable and not sufficient on its own, it's a good engineering compromise to prioritize performance. > If we did this in Lucene, you can have unbounded corruption. It's not > just the last few minutes of updates... Wasn't that a possibility under autocommit as well? All it takes is for the OS to finish flushing the new snapshot file to persistent storage before it finishes flushing a segment data file needed by that snapshot, and for the power failure to squeeze in between. In practice, locality of reference is going to make the window very very small, since those two pieces of data will usually get written very close to each other on the persistent media. I've seen a lot more messages to our user lists over the years about data corruption caused by bugs and misconfigurations than by power failures. But really, that's as it should be. Ensuring data integrity to the degree required by a database is costly -- it requires far more rigorous testing, and far more conservative development practices. If we accept that our indexes must *never* go corrupt, it will retard innovation. Of course we should work very hard to prevent index corruption. However, I'm much more concerned about stuff like silent omission of search results due to overzealous, overly complex optimizations than I am about problems arising from power failures. When a power failure occurs, you know it -- so you get the opportunity to fsck the disk, run checkIndex(), perform data integrity reconciliation tests against canonical storage, and if anything fails, take whatever recovery actions you deem necessary. > You don't need to turn off sync for NRT - that's the whole point. It > gives you a reader without syncing the files. I suppose this is where Lucy and Lucene differ. Thanks to mmap and the near-instantaneous reader opens it has enabled, we don't need to keep a special reader alive. Since there's no special reader, the only way to get data to a search process is to go through a commit. But if we fsync on every commit, we'll drag down indexing responsiveness. Fishishing the commit and returning control to client code as quickly as possible is a high priority for us. Furthermore, I don't want us to have to write the code to support a near-real-time reader hanging off of IndexWriter a la Lucene. The architectural discussions have made for very interesting reading, but the design seems to be tricky to pull off, and implementation simplicity in core search code is a high priority for Lucy. It's better for Lucy to kill two birds with one stone and concentrate on making *all* index opens fast. > Really, this is your safety tradeoff - it means you can commit less > frequently, since the NRT reader can search the latest updates. But, your > app has complete control over how it wants to to trade safety for > performance. So long as fsync is an option, the app always has complete control, regardless of whether the default setting is fsync or no fsync. If a Lucene app wanted to increase NRT responsiveness and throughput, and if absolute index integrity wasn't a concern because it had been addressed through other means (e.g. multi-node search cluster), would turning off fsync speed things up under any of the proposed designs? > Refactoring of IndexWriter > -------------------------- > > Key: LUCENE-2026 > URL: https://issues.apache.org/jira/browse/LUCENE-2026 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 3.1 > > > I've been thinking for a while about refactoring the IndexWriter into > two main components. > One could be called a SegmentWriter and as the > name says its job would be to write one particular index segment. The > default one just as today will provide methods to add documents and > flushes when its buffer is full. > Other SegmentWriter implementations would do things like e.g. appending or > copying external segments [what addIndexes*() currently does]. > The second component's job would it be to manage writing the segments > file and merging/deleting segments. It would know about > DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would > provide hooks that allow users to manage external data structures and > keep them in sync with Lucene's data during segment merges. > API wise there are things we have to figure out, such as where the > updateDocument() method would fit in, because its deletion part > affects all segments, whereas the new document is only being added to > the new segment. > Of course these should be lower level APIs for things like parallel > indexing and related use cases. That's why we should still provide > easy to use APIs like today for people who don't need to care about > per-segment ops during indexing. So the current IndexWriter could > probably keeps most of its APIs and delegate to the new classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org