Re: Corrupt index

Michael McCandless Thu, 14 Jun 2012 17:11:02 -0700

Right: Lucene never autocommits anymore ...

If you create a new index, add a bunch of docs, and things crash
before you have a chance to commit, then there is no index (not even a
0 doc one) in that directory.


Mike McCandless

http://blog.mikemccandless.com

On Thu, Jun 14, 2012 at 1:41 PM, Itamar Syn-Hershko <ita...@code972.com> wrote:
> I'm quite certain this shouldn't happen also when Commit wasn't called.
>
> Mike, can you comment on that?
>
> On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens
> <currens.ch...@gmail.com> wrote:
>>
>> Well, the only thing I see is that there is no place where writer.Commit()
>> is called in the delegate assigned to corpusReader.OnDocument.  I know
>> that
>> lucene is very transactional, and at least in 3.x, the writer will never
>> auto commit to the index.  You can write millions of documents, but if
>> commit is never called, those documents aren't actually part of the index.
>>  Committing isn't a cheap operation, so you definitely don't want to do it
>> on every document.
>>
>> You can test it yourself with this (naive) solution.  Right below the
>> writer.SetUseCompoundFile(false) line, add "int numDocsAdded = 0;".  At
>> the
>> end of the corpusReader.OnDocument delegate add:
>>
>> // Example only.  I wouldn't suggest committing this often
>> if(++numDocsAdded % 5 == 0)
>> {
>>    writer.Commit();
>> }
>>
>> I had the application crash for real on this file:
>>
>> http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2,
>> about 20% into the operation.  Without the commit, the index is empty.
>>  Add
>> it in, and I get 755 files in the index after it crashes.
>>
>>
>> Thanks,
>> Christopher
>>
>> On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko
>> <ita...@code972.com>wrote:
>>
>>
>> > Yes, reproduced in first try. See attached program - I referenced it to
>> > current trunk.
>> >
>> >
>> > On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko
>> > <ita...@code972.com>wrote:
>> >
>> >> Christopher,
>> >>
>> >> I used the IndexBuilder app from here
>> >> https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a
>> >> 8.5GB wikipedia dump.
>> >>
>> >> After running for 2.5 days I had to forcefully close it (infinite loop
>> >> in
>> >> the wiki-markdown parser at 92%, go figure), and the 40-something GB
>> >> index
>> >> I had by then was unusable. I then was able to reproduce this
>> >>
>> >> Please note I now added a few safe-guards you might want to remove to
>> >> make sure the app really crashes on process kill.
>> >>
>> >> I'll try to come up with a better way to reproduce this - hopefully
>> >> Mike
>> >> will be able to suggest better ways than manual process kill...
>> >>
>> >> On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens <
>> >> currens.ch...@gmail.com> wrote:
>> >>
>> >>> Mike, The codebase for lucene.net should be almost identical to java's
>> >>> 3.0.3 release, and LUCENE-1044 is included in that.
>> >>>
>> >>> Itamar, are you committing the index regularly?  I only ask because I
>> >>> can't
>> >>> reproduce it myself by forcibly terminating the process while it's
>> >>> indexing.  I've tried both 3.0.3 and 2.9.4.  If I don't commit at all
>> >>> and
>> >>> terminate the process (even with a 10,000 4K documents created), there
>> >>> will
>> >>> be no documents in the index when I open it in luke, which I expect.
>> >>>  If
>> >>> I
>> >>> commit at 10,000 documents, and terminate it a few thousand after
>> >>> that,
>> >>> the
>> >>> index has the first ten thousand that were committed.  I've even
>> >>> terminated
>> >>> it *while* a second commit was taking place, and it still had all of
>> >>> the
>> >>> documents I expected.
>> >>>
>> >>> It may be that I'm not trying to reproducing it correctly.  Do you
>> >>> have a
>> >>> minimal amount of code that can reproduce it?
>> >>>
>> >>>
>> >>> Thanks,
>> >>> Christopher
>> >>>
>> >>> On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless <
>> >>> luc...@mikemccandless.com> wrote:
>> >>>
>> >>> > Hi Itamar,
>> >>> >
>> >>> > One quick question: does Lucene.Net include the fixes done for
>> >>> > LUCENE-1044 (to fsync files on commit)?  Those are very important
>> >>> > for
>> >>> > an index to be intact after OS/JVM crash or power loss.
>> >>> >
>> >>> > More responses below:
>> >>> >
>> >>> > On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko <
>> >>> ita...@code972.com>
>> >>> > wrote:
>> >>> >
>> >>> > > I'm a Lucene.Net committer, and there is a chance we have a bug in
>> >>> our
>> >>> > > FSDirectory implementation that causes indexes to get corrupted
>> >>> > > when
>> >>> > > indexing is cut while the IW is still open. As it roots from some
>> >>> > > retroactive fixes you made, I'd appreciate your feedback.
>> >>> > >
>> >>> > > Correct me if I'm wrong, but by design Lucene should be able to
>> >>> recover
>> >>> > > rather quickly from power failures or app crashes. Since existing
>> >>> segment
>> >>> > > files are read only, only new segments that are still being
>> >>> > > written
>> >>> can
>> >>> > get
>> >>> > > corrupted. Hence, recovering from worst-case scenarios is done by
>> >>> simply
>> >>> > > removing the write.lock file. The worst that could happen then is
>> >>> having
>> >>> > the
>> >>> > > last segment damaged, and that can be fixed by removing those
>> >>> > > files,
>> >>> > > possibly by running CheckIndex on the index.
>> >>> >
>> >>> > You shouldn't even have to run CheckIndex ... because (as of
>> >>> > LUCENE-1044) we now fsync all segment files before writing the new
>> >>> > segments_N file, and then removing old segments_N files (and any
>> >>> > segments that are no longer referenced).
>> >>> >
>> >>> > You do have to remove the write.lock if you aren't using
>> >>> > NativeFSLockFactory (but this has been the default lock impl for a
>> >>> > while now).
>> >>> >
>> >>> > > Last week I have been playing with rather large indexes and
>> >>> > > crashed
>> >>> my
>> >>> > app
>> >>> > > while it was indexing. I wasn't able to open the index, and Luke
>> >>> > > was
>> >>> even
>> >>> > > kind enough to wipe the index folder clean even though I opened it
>> >>> > > in
>> >>> > > read-only mode. I re-ran this, and after another crash running
>> >>> CheckIndex
>> >>> > > revealed nothing - the index was detected to be an empty one. I am
>> >>> not
>> >>> > > entirely sure what could be the cause for this, but I suspect it
>> >>> > > has
>> >>> > > been corrupted by the crash.
>> >>> >
>> >>> > Had no commit completed (no segments file written)?
>> >>> >
>> >>> > If you don't fsync then all sorts of crazy things are possible...
>> >>> >
>> >>> > > I've been looking at these:
>> >>> > >
>> >>> > >
>> >>> >
>> >>>
>> >>> https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> >>> > >
>> >>> >
>> >>>
>> >>> https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> >>> >
>> >>> > (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328
>> >>> > broke...).
>> >>> >
>> >>> > > And it seems like this is what I was experiencing. Mike and Mark
>> >>> > > will
>> >>> > > probably be able to tell if this is what they saw or not, but as
>> >>> > > far
>> >>> as I
>> >>> > > can tell this is not an expected behavior of a Lucene index.
>> >>> >
>> >>> > Definitely not expected behavior: assuming nothing is flipping bits,
>> >>> > then on OS/JVM crash or power loss your index should be fine, just
>> >>> > reverted to the last successful commit.
>> >>> >
>> >>> > > What I'm looking for at the moment is some advice on what
>> >>> > > FSDirectory
>> >>> > > implementation to use to make sure no corruption can happen. The
>> >>> > > 3.4
>> >>> > version
>> >>> > > (which is where LUCENE-3418 was committed to) seems to handle a
>> >>> > > lot
>> >>> of
>> >>> > > things the 3.0 doesn't, but on the other hand LUCENE-3418 was
>> >>> introduced
>> >>> > by
>> >>> > > changes made to the 3.0 codebase.
>> >>> >
>> >>> > Hopefully it's just that you are missing fsync!
>> >>> >
>> >>> > > Also, is there any test in the suite checking for those scenarios?
>> >>> >
>> >>> > Our test framework has a sneaky MockDirectoryWrapper that, after a
>> >>> > test finishes, goes and corrupts any unsync'd files and then
>> >>> > verifies
>> >>> > the index is still OK... it's good because it'll catch any times we
>> >>> > are missing calls t sync, but, it's not low level enough such that
>> >>> > if
>> >>> > FSDir is failing to actually call fsync (that wsa the bug in
>> >>> > LUCENE-3418) then it won't catch that...
>> >>> >
>> >>> > Mike McCandless
>> >>> >
>> >>> > http://blog.mikemccandless.com
>> >>> >
>> >>>
>> >>
>> >>
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Corrupt index

Reply via email to