[ 
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646585#comment-17646585
 ] 

ASF GitHub Bot commented on PARQUET-1539:
-----------------------------------------

mapleFU commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348324323

   > (also cc @mapleFU, who's working on CRC support for Parquet C++)
   
   Hi, all, I have a question here, the format says:
   
   ```
     /** The 32bit CRC for the page, to be be calculated as follows:
      * - Using the standard CRC32 algorithm
      * - On the data only, i.e. this header should not be included. 'Data'
      *   hereby refers to the concatenation of the repetition levels, the
      *   definition levels and the column value, in this exact order.
      * - On the encoded versions of the repetition levels, definition levels 
and
      *   column values
      * - On the compressed versions of the repetition levels, definition levels
      *   and column values where possible;
      *   - For v1 data pages, the repetition levels, definition levels and 
column
      *     values are always compressed together. If a compression scheme is
      *     specified, the CRC shall be calculated on the compressed version of
      *     this concatenation. If no compression scheme is specified, the CRC
      *     shall be calculated on the uncompressed version of this 
concatenation.
      *   - For v2 data pages, the repetition levels and definition levels are
      *     handled separately from the data and are never compressed (only
      *     encoded). If a compression scheme is specified, the CRC shall be
      *     calculated on the concatenation of the uncompressed repetition 
levels,
      *     uncompressed definition levels and the compressed column values.
      *     If no compression scheme is specified, the CRC shall be calculated 
on
      *     the uncompressed concatenation.
      * - In encrypted columns, CRC is calculated after page encryption; the
      *   encryption itself is performed after page compression (if compressed)
      * If enabled, this allows for disabling checksumming in HDFS if only a few
      * pages need to be read.
      **/
   ```
   
   and in `README`:
   
   ```
   Data pages can be individually checksummed. 
   ```
   
   But in our coding, we have:
   
   ```c++
   int64_t WriteDictionaryPage(const DictionaryPage& page) override {
       // TODO(PARQUET-594) crc checksum
       ...
   }
   ```
   
   So, could DICTIONARY_PAGE or even INDEX_PAGE have crc? /cc @pitrou 
   
   
   




> Clarify CRC checksum in page header
> -----------------------------------
>
>                 Key: PARQUET-1539
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1539
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Boudewijn Braams
>            Assignee: Boudewijn Braams
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification, 
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the 
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
>  in the Thrift specification reads ‘32bit crc for the data below’, which is 
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum 
> should be calculated on. To ensure backward- and cross-compatibility of 
> Parquet readers/writes which do want to leverage the CRC checksums, the 
> format should specify exactly how and on what data the checksum should be 
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the 
> original CRC32 algorithm, each producing different values for the same input. 
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an 
> approach similar to what TCP does, whereby the checksum field itself is 
> zeroed out before calculating the checksum that will be inserted there. 
> Evidently, including the page header is better in the sense that it increases 
> the data covered by the checksum. However, from an implementation 
> perspective, not including it is likely easier. Furthermore, given the 
> relatively small size of the page header compared to the page itself, simply 
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
>  Pros
>  * Inherently faster, less data to operate on
>  * Potentially better triaging when determining where a corruption may have 
> been introduced, as checksum is calculated in a later stage
> Cons
>  * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
>  Pros
>  * We only have to trust the encoding stage
>  * Possibly able to detect more corruptions, as data is checksummed at 
> earliest possible moment, checksum will be more sensitive to corruption 
> introduced further down the line
> Cons
>  * Inherently slower, more data to operate on, always need to decompress first
>  * Potentially harder triaging, more stages in which corruption could have 
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby 
> the checksum is to be calculated on the *data only, not including the page 
> header* itself (simple implementation) and the checksum will be calculated on 
> *compressed data* (inherently faster, likely better triaging). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to