[ 
https://issues.apache.org/jira/browse/LUCENE-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2446:
--------------------------------

    Attachment: LUCENE-2446.patch

I think this is a pretty important issue: besides the case of distributed 
system copying files around, we have the issue that today there is no integrity 
mechanism to detect hardware issues (can cause developers to pull hair out 
trying to debug corruptions), and we have some optimized components doing bulk 
merge which can propagate corruptions to new segments over a long time.

Also in recent jvms, computing checksum is fast: e.g. in java8 CRC32 is 
intrinsic and uses clmul hardware instructions on x86 and so on.

I created an initial patch: the last 8 bytes of every file is a zlib-crc32 
checksum. We also write some additional metadata before it (its done via 
CodecUtil.writeFooter) so we can extend it more in the future if we need.

For small metadata files (e.g. .fnm, .si, .dvm, ...) we just verify when we 
open, because we are reading the file anyway. So this provides some extra 
safety.

For larger files this would be expensive: instead the patch adds 
AtomicReader.validate() which asks the codec (or filterreader, or whatever), to 
ensure everything is valid. This is called by e.g. checkindex before decoding.
 
Patch adds an option (defaults to off) on IndexWriterConfig to call this before 
merging. Ideally we wouldnt need this and just validate-as-we-merge, but that 
requires some codec/merge API changes...

File format changes are backwards compatible.

> Add checksums to Lucene segment files
> -------------------------------------
>
>                 Key: LUCENE-2446
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2446
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Lance Norskog
>              Labels: checksum
>         Attachments: LUCENE-2446.patch
>
>
> It would be useful for the different files in a Lucene index to include 
> checksums. This would make it easy to spot corruption while copying index 
> files around; the various cloud efforts assume many more data-copying 
> operations than older single-index implementations.
> This feature might be much easier to implement if all index files are created 
> in a sequential fashion. This issue therefore depends on [LUCENE-2373].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to