[ https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646585#comment-17646585 ]
ASF GitHub Bot commented on PARQUET-1539: ----------------------------------------- mapleFU commented on PR #126: URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348324323 > (also cc @mapleFU, who's working on CRC support for Parquet C++) Hi, all, I have a question here, the format says: ``` /** The 32bit CRC for the page, to be be calculated as follows: * - Using the standard CRC32 algorithm * - On the data only, i.e. this header should not be included. 'Data' * hereby refers to the concatenation of the repetition levels, the * definition levels and the column value, in this exact order. * - On the encoded versions of the repetition levels, definition levels and * column values * - On the compressed versions of the repetition levels, definition levels * and column values where possible; * - For v1 data pages, the repetition levels, definition levels and column * values are always compressed together. If a compression scheme is * specified, the CRC shall be calculated on the compressed version of * this concatenation. If no compression scheme is specified, the CRC * shall be calculated on the uncompressed version of this concatenation. * - For v2 data pages, the repetition levels and definition levels are * handled separately from the data and are never compressed (only * encoded). If a compression scheme is specified, the CRC shall be * calculated on the concatenation of the uncompressed repetition levels, * uncompressed definition levels and the compressed column values. * If no compression scheme is specified, the CRC shall be calculated on * the uncompressed concatenation. * - In encrypted columns, CRC is calculated after page encryption; the * encryption itself is performed after page compression (if compressed) * If enabled, this allows for disabling checksumming in HDFS if only a few * pages need to be read. **/ ``` and in `README`: ``` Data pages can be individually checksummed. ``` But in our coding, we have: ```c++ int64_t WriteDictionaryPage(const DictionaryPage& page) override { // TODO(PARQUET-594) crc checksum ... } ``` So, could DICTIONARY_PAGE or even INDEX_PAGE have crc? /cc @pitrou > Clarify CRC checksum in page header > ----------------------------------- > > Key: PARQUET-1539 > URL: https://issues.apache.org/jira/browse/PARQUET-1539 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Boudewijn Braams > Assignee: Boudewijn Braams > Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the > [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607] > in the Thrift specification reads ‘32bit crc for the data below’, which is > somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum > should be calculated on. To ensure backward- and cross-compatibility of > Parquet readers/writes which do want to leverage the CRC checksums, the > format should specify exactly how and on what data the checksum should be > calculated. > h2. Alternatives > There are three main choices to be made here: > # Which variant of CRC32 to use > # Whether to include the page header itself in the checksum calculation > # Whether to calculate the checksum on uncompressed or compressed data > h3. Algorithm > The CRC field holds a 32-bit value. There are many different variants of the > original CRC32 algorithm, each producing different values for the same input. > For ease of implementation we propose to use the standard CRC32 algorithm. > h3. Including page header > The page header itself could be included in the checksum calculation using an > approach similar to what TCP does, whereby the checksum field itself is > zeroed out before calculating the checksum that will be inserted there. > Evidently, including the page header is better in the sense that it increases > the data covered by the checksum. However, from an implementation > perspective, not including it is likely easier. Furthermore, given the > relatively small size of the page header compared to the page itself, simply > not including it will likely be good enough. > h3. Compressed vs uncompressed > *Compressed* > Pros > * Inherently faster, less data to operate on > * Potentially better triaging when determining where a corruption may have > been introduced, as checksum is calculated in a later stage > Cons > * We have to trust both the encoding stage and the compression stage > *Uncompressed* > Pros > * We only have to trust the encoding stage > * Possibly able to detect more corruptions, as data is checksummed at > earliest possible moment, checksum will be more sensitive to corruption > introduced further down the line > Cons > * Inherently slower, more data to operate on, always need to decompress first > * Potentially harder triaging, more stages in which corruption could have > been introduced > h2. Proposal > The checksum will be calculated using the *standard CRC32 algorithm*, whereby > the checksum is to be calculated on the *data only, not including the page > header* itself (simple implementation) and the checksum will be calculated on > *compressed data* (inherently faster, likely better triaging). -- This message was sent by Atlassian Jira (v8.20.10#820010)