[ 
https://issues.apache.org/jira/browse/ARROW-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Finis updated ARROW-10353:
------------------------------
    Description: 
According to the parquet-format specification, DataPageV2 pages have an 
is_compressed flag. Even if the column chunk has a decompression codec set, the 
page is only compressed if this flag is true (this likely enables not 
compressing some pages where the compression wouldn't save memory).

Here is the relevant excerpt from parquet.thrift describing the semantics of 
the is_compressed flag in a DataPageV2:

_/** whether the values are compressed._
 _Which means the section of the page between_
 _definition_levels_byte_length + repetition_levels_byte_length + 1 and 
compressed_page_size (included)_
 _is compressed with the compression_codec._
 _If missing it is considered compressed */_
 _7: optional bool is_compressed = 1;_

 

It seems that the apache parquet cpp library (haven't checked other languages 
but might have the bug as well) totally disregards this flag and decompresses 
the page in all cases if a decompressor is set for the column chunk.

The erroneous code is in column_reader.cc: 

std::shared_ptr<Page> SerializedPageReader::NextPage() 

This method first decompresses the page if there is a decompressor set and only 
then does a case distinction on whether this page is a DataPageV2 and has the 
is_compressed flag. Thus, even if the page would have this flag set to 0, the 
page would be decompressed anyway.

The method that should use the is_compressed flag but doesn't is:

std::shared_ptr<Buffer> SerializedPageReader::DecompressPage

This method doesn't look at the is_compressed flag at all.

 

The reason why this bug probably doesn't show in any unit test is that the 
write implementation seems to do the same mistake: It always compresses the 
page, even if the page has its is_compressed flag set to false.

 

  was:
According to the parquet-format specification, DataPageV2 pages have an 
is_compressed flag. Even if the column chunk has a decompression codec set, the 
page is only compressed if this flag is true (this likely enables not 
compressing some pages where the compression wouldn't save memory).

Here is the relevant excerpt from parquet.thrift describing the semantics of 
the is_compressed flag in a DataPageV2:

_/** whether the values are compressed._
 _Which means the section of the page between_
 _definition_levels_byte_length + repetition_levels_byte_length + 1 and 
compressed_page_size (included)_
 _is compressed with the compression_codec._
 _If missing it is considered compressed */_
 _7: optional bool is_compressed = 1;_

 

It seems that the apache parquet cpp library (haven't checked other languages 
but might have the bug as well) totally disregards this flag and decompresses 
the page in all cases if a decompressor is set for the column chunk.

The erroneous code is in column_reader.cc: 

std::shared_ptr<Page> SerializedPageReader::NextPage() 

This method first decompresses the page if there is a decompressor set and only 
then does a case distinction on whether this page is a DataPageV2 and has the 
is_compressed flag. Thus, even if the page would have this flag set to 0, the 
page would be decompressed anyway.

The method that should use the is_compressed flag but doesn't is:

std::shared_ptr<Buffer> SerializedPageReader::DecompressPage

This method doesn't look at the is_compressed flag at all.

 


> Arrow Parquet Cpp decompresses DataPageV2 pages even if is_compressed==0
> ------------------------------------------------------------------------
>
>                 Key: ARROW-10353
>                 URL: https://issues.apache.org/jira/browse/ARROW-10353
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Jan Finis
>            Priority: Major
>
> According to the parquet-format specification, DataPageV2 pages have an 
> is_compressed flag. Even if the column chunk has a decompression codec set, 
> the page is only compressed if this flag is true (this likely enables not 
> compressing some pages where the compression wouldn't save memory).
> Here is the relevant excerpt from parquet.thrift describing the semantics of 
> the is_compressed flag in a DataPageV2:
> _/** whether the values are compressed._
>  _Which means the section of the page between_
>  _definition_levels_byte_length + repetition_levels_byte_length + 1 and 
> compressed_page_size (included)_
>  _is compressed with the compression_codec._
>  _If missing it is considered compressed */_
>  _7: optional bool is_compressed = 1;_
>  
> It seems that the apache parquet cpp library (haven't checked other languages 
> but might have the bug as well) totally disregards this flag and decompresses 
> the page in all cases if a decompressor is set for the column chunk.
> The erroneous code is in column_reader.cc: 
> std::shared_ptr<Page> SerializedPageReader::NextPage() 
> This method first decompresses the page if there is a decompressor set and 
> only then does a case distinction on whether this page is a DataPageV2 and 
> has the is_compressed flag. Thus, even if the page would have this flag set 
> to 0, the page would be decompressed anyway.
> The method that should use the is_compressed flag but doesn't is:
> std::shared_ptr<Buffer> SerializedPageReader::DecompressPage
> This method doesn't look at the is_compressed flag at all.
>  
> The reason why this bug probably doesn't show in any unit test is that the 
> write implementation seems to do the same mistake: It always compresses the 
> page, even if the page has its is_compressed flag set to false.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to