mikemccand commented on issue #15552: URL: https://github.com/apache/lucene/issues/15552#issuecomment-3751525420
> I'm trying to understand the use-case, for s3 wouldn't you use it's native checksum functionality instead? Yeah we do also use S3's checksums ... I thought it was still helpful to also validate Lucene's checksum? We confirm checksum each step of a segment file's journey... Otherwise we'd be vulnerable to bit flips that happen after `IndexWriter` successfully writes the file, but before it is copied out to S3? E.g. if the local storage (where `IndexWriter` writes the index) is a bit-flipper, or the RAM of the indexer box, when we read the bytes and send to S3, if what we sent to S3 has a bit flipped, S3 would think it's fine (its checksum matches what we sent) and we'd have a corrupt index replicated to S3. Downloading would bring the corruption to replicas and S3 checksum is still fine (S3 stores the bit-flipped version). Lucene might detect the bit flip when a replica lights the segment, if it was in a small/metadata index file (that we check immediately on opening the segment files), but if it's in a large file, it won't be caught right away? Though if replica hits certin exceptions, Lucene will go validate checksum I think? (Or is that only when indexing?) But ... if the indexing box's RAM or local storage is a bit-flipper, chances are `IndexWriter` would eventually detect the corruption (e.g. we `checkIntegritry` of all segment files when merging)... the chances of a bit flipping in that window before/during S3 replication and NOT being hit by `IndexWriter` is probably low in general. So yeah maybe we should turn off validating Lucene checksums and rely entirely on S3, then we can use S3's chunk'd uploading ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
