[
https://issues.apache.org/jira/browse/LUCENE-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-2446:
--------------------------------
Attachment: LUCENE-2446.patch
I think this is a pretty important issue: besides the case of distributed
system copying files around, we have the issue that today there is no integrity
mechanism to detect hardware issues (can cause developers to pull hair out
trying to debug corruptions), and we have some optimized components doing bulk
merge which can propagate corruptions to new segments over a long time.
Also in recent jvms, computing checksum is fast: e.g. in java8 CRC32 is
intrinsic and uses clmul hardware instructions on x86 and so on.
I created an initial patch: the last 8 bytes of every file is a zlib-crc32
checksum. We also write some additional metadata before it (its done via
CodecUtil.writeFooter) so we can extend it more in the future if we need.
For small metadata files (e.g. .fnm, .si, .dvm, ...) we just verify when we
open, because we are reading the file anyway. So this provides some extra
safety.
For larger files this would be expensive: instead the patch adds
AtomicReader.validate() which asks the codec (or filterreader, or whatever), to
ensure everything is valid. This is called by e.g. checkindex before decoding.
Patch adds an option (defaults to off) on IndexWriterConfig to call this before
merging. Ideally we wouldnt need this and just validate-as-we-merge, but that
requires some codec/merge API changes...
File format changes are backwards compatible.
> Add checksums to Lucene segment files
> -------------------------------------
>
> Key: LUCENE-2446
> URL: https://issues.apache.org/jira/browse/LUCENE-2446
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Lance Norskog
> Labels: checksum
> Attachments: LUCENE-2446.patch
>
>
> It would be useful for the different files in a Lucene index to include
> checksums. This would make it easy to spot corruption while copying index
> files around; the various cloud efforts assume many more data-copying
> operations than older single-index implementations.
> This feature might be much easier to implement if all index files are created
> in a sequential fashion. This issue therefore depends on [LUCENE-2373].
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]