[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646656#comment-17646656 ]
ASF GitHub Bot commented on PARQUET-1539: ----------------------------------------- pitrou commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348674622 @wgtmac No particular rule, no. AFAIU we only synchronize when we want to get meaningful spec changes. > Clarify CRC checksum in page header > ----------------------------------- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Boudewijn Braams > Assignee: Boudewijn Braams > Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)