[ 
https://issues.apache.org/jira/browse/ARROW-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217634#comment-17217634
 ] 

Wes McKinney commented on ARROW-10353:
--------------------------------------

Note that DataPageV2 is not recommended for production use

> [C++] Parquet decompresses DataPageV2 pages even if is_compressed==0
> --------------------------------------------------------------------
>
>                 Key: ARROW-10353
>                 URL: https://issues.apache.org/jira/browse/ARROW-10353
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Jan Finis
>            Priority: Major
>             Fix For: 3.0.0
>
>
> According to the parquet-format specification, DataPageV2 pages have an 
> is_compressed flag. Even if the column chunk has a decompression codec set, 
> the page is only compressed if this flag is true (this likely enables not 
> compressing some pages where the compression wouldn't save memory).
> Here is the relevant excerpt from parquet.thrift describing the semantics of 
> the is_compressed flag in a DataPageV2:
> _/** whether the values are compressed._
>  _Which means the section of the page between_
>  _definition_levels_byte_length + repetition_levels_byte_length + 1 and 
> compressed_page_size (included)_
>  _is compressed with the compression_codec._
>  _If missing it is considered compressed */_
>  _7: optional bool is_compressed = 1;_
>  
> It seems that the apache parquet cpp library (haven't checked other languages 
> but might have the bug as well) totally disregards this flag and decompresses 
> the page in all cases if a decompressor is set for the column chunk.
> The erroneous code is in column_reader.cc: 
> std::shared_ptr<Page> SerializedPageReader::NextPage() 
> This method first decompresses the page if there is a decompressor set and 
> only then does a case distinction on whether this page is a DataPageV2 and 
> has the is_compressed flag. Thus, even if the page would have this flag set 
> to 0, the page would be decompressed anyway.
> The method that should use the is_compressed flag but doesn't is:
> std::shared_ptr<Buffer> SerializedPageReader::DecompressPage
> This method doesn't look at the is_compressed flag at all.
>  
> The reason why this bug probably doesn't show in any unit test is that the 
> write implementation seems to do the same mistake: It always compresses the 
> page, even if the page has its is_compressed flag set to false.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to