[ 
https://issues.apache.org/jira/browse/LUCENE-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7113:
---------------------------------------
    Attachment: LUCENE-7113.patch

Patch.

This change forces anyone using {{OfflineSorter}} to write the
checksum at the end of the input, and then pass a
{{ChecksumIndexInput}} when creating the {{ByteSequencesReader}}.

{{OfflineSorter}} then fully validates checksums for its input source,
and all of its temporary sorted partitions, since it always fully
consumes every file.

I also validate if we hit an unexpected exception, adding the
corruption as a suppressed exception, like we do for index files, and
also validate even if we don't hit an unexpected exception, since the
corruption could easily be otherwise un-noticed during the sorting.

{{BKDWriter}} also does some validation, but unfortunately it's far
from perfect because it doesn't always fully consume its temp files in
one sweep: it often reads only a slice at a time, at each step of its
recursion.

The slices over the full recursion will in fact add up to a single
sweep over a given temporary file, however the code is not currently
organized correctly to "take advantage" of this: on each recursion it
opens a new {{IndexInput}} to scan just that one slice, then closes
it.  I think re-organizing this recursive writer to actually "be" a
single sweep is a little too risky at this point (and I'm not yet sure it's
even possible!).

I did go and upgrade some asserts to real checks, and on unexpected
exception I check the checksum of the current file we are working on,
but even this is not perfect because corruption in an earlier
(different) temp file might not be noticed until later on.

So net/net the {{BKDWriter}} corruption checks are only best effort,
but they still do something (see the crazy added tests!).  We can
improve this with time, but I think this is an OK compromise for 6.0.


> OfflineSorter and BKD should verify checksums in their temp files
> -----------------------------------------------------------------
>
>                 Key: LUCENE-7113
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7113
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master, 6.0
>
>         Attachments: LUCENE-7113.patch
>
>
> I am trying to index all 3.2 B points from the latest OpenStreetMaps export.
> My SSDs were not up to this, so I added a spinning magnets disk to beast2.
> But then I was hitting scary bug-like exceptions 
> ({{ArrayIndexOutOfBoundsException}}) when indexing the first 2B points, and I 
> finally checked dmesg and saw that my hard drive is dying.
> I think it's important that our temp file usages also validate checksums 
> (like we do for all our index files, either at reader open or at merge or 
> {{CheckIndex}}), so we can hopefully easily differentiate a bit-flipping IO 
> system from a possible Lucene bug, in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to