mapleFU commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348324323

   > (also cc @mapleFU, who's working on CRC support for Parquet C++)
   
   Hi, all, I have a question here, the format says:
   
   ```
     /** The 32bit CRC for the page, to be be calculated as follows:
      * - Using the standard CRC32 algorithm
      * - On the data only, i.e. this header should not be included. 'Data'
      *   hereby refers to the concatenation of the repetition levels, the
      *   definition levels and the column value, in this exact order.
      * - On the encoded versions of the repetition levels, definition levels 
and
      *   column values
      * - On the compressed versions of the repetition levels, definition levels
      *   and column values where possible;
      *   - For v1 data pages, the repetition levels, definition levels and 
column
      *     values are always compressed together. If a compression scheme is
      *     specified, the CRC shall be calculated on the compressed version of
      *     this concatenation. If no compression scheme is specified, the CRC
      *     shall be calculated on the uncompressed version of this 
concatenation.
      *   - For v2 data pages, the repetition levels and definition levels are
      *     handled separately from the data and are never compressed (only
      *     encoded). If a compression scheme is specified, the CRC shall be
      *     calculated on the concatenation of the uncompressed repetition 
levels,
      *     uncompressed definition levels and the compressed column values.
      *     If no compression scheme is specified, the CRC shall be calculated 
on
      *     the uncompressed concatenation.
      * - In encrypted columns, CRC is calculated after page encryption; the
      *   encryption itself is performed after page compression (if compressed)
      * If enabled, this allows for disabling checksumming in HDFS if only a few
      * pages need to be read.
      **/
   ```
   
   and in `README`:
   
   ```
   Data pages can be individually checksummed. 
   ```
   
   But in our coding, we have:
   
   ```c++
   int64_t WriteDictionaryPage(const DictionaryPage& page) override {
       // TODO(PARQUET-594) crc checksum
       ...
   }
   ```
   
   So, could DICTIONARY_PAGE or even INDEX_PAGE have crc? /cc @pitrou 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to