[ 
https://issues.apache.org/jira/browse/CASSANDRA-19369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814508#comment-17814508
 ] 

Francisco Guerrero edited comment on CASSANDRA-19369 at 2/5/24 9:16 PM:
------------------------------------------------------------------------

[~smiklosovic] Cassandra Analytics creates an SSTable during bulk writes. For 
each SSTable component generated we calculate the digest of each file (this 
includes the crc32 file), which is then uploaded. The purpose of this checksum 
is to prevent file integrity of each of the SSTable component files during 
transmission from Spark executor to Cassandra Sidecar service, rather than 
integrity of the data file.

For data integrity, bulk writer does the following:
- Checksums of each file generated
- Re-read the generated SSTable file and ensure that what was written is the 
same as what we read.
- Transfer the file with a checksum header
- (On Sidecar) Validate that the checksum matches the uploaded file


was (Author: frankgh):
[~smiklosovic] Cassandra Analytics creates an SSTable during bulk writes. For 
each SSTable component generated we calculate the digest of each file (this 
includes the crc32 file), which is then uploaded. The purpose of this checksum 
is to prevent file integrity of each of the SSTable component files, rather 
than integrity of the data file.

For data integrity, bulk writer does the following:
- Checksums of each file generated
- Re-read the generated SSTable file and ensure that what was written is the 
same as what we read.
- Transfer the file with a checksum header
- (On Sidecar) Validate that the checksum matches the uploaded file

> [Analytics] Use XXHash32 for digest calculation of SSTables
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-19369
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19369
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Analytics Library
>            Reporter: Francisco Guerrero
>            Assignee: Francisco Guerrero
>            Priority: Normal
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> During bulk writes, Cassandra Analytics calculates the MD5 checksum of every 
> SSTable it produces. During SSTable upload to Cassandra Sidecar, Cassandra 
> Analytics includes the {{content-md5}} header as part of the upload request. 
> This information is used by Cassandra Sidecar to validate the integrity of 
> the uploaded SSTable and prevent issues with bit flips and corrupted SSTables.
> Recently, Cassandra Sidecar introduced [support for additional checksum 
> validations|https://issues.apache.org/jira/browse/CASSANDRASC-97] during 
> SSTable upload. Notably the XXHash32 digest support was added which offers 
> for more performant checksum calculations. This support now allows Cassandra 
> Analytics to use a more efficient digest algorithm that is friendlier on the 
> CPU usage of Sidecar and spark resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to