[ https://issues.apache.org/jira/browse/ARROW-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-10353: ----------------------------------- Summary: [C++] Parquet decompresses DataPageV2 pages even if is_compressed==0 (was: Arrow Parquet Cpp decompresses DataPageV2 pages even if is_compressed==0) > [C++] Parquet decompresses DataPageV2 pages even if is_compressed==0 > -------------------------------------------------------------------- > > Key: ARROW-10353 > URL: https://issues.apache.org/jira/browse/ARROW-10353 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Jan Finis > Priority: Major > Fix For: 3.0.0 > > > According to the parquet-format specification, DataPageV2 pages have an > is_compressed flag. Even if the column chunk has a decompression codec set, > the page is only compressed if this flag is true (this likely enables not > compressing some pages where the compression wouldn't save memory). > Here is the relevant excerpt from parquet.thrift describing the semantics of > the is_compressed flag in a DataPageV2: > _/** whether the values are compressed._ > _Which means the section of the page between_ > _definition_levels_byte_length + repetition_levels_byte_length + 1 and > compressed_page_size (included)_ > _is compressed with the compression_codec._ > _If missing it is considered compressed */_ > _7: optional bool is_compressed = 1;_ > > It seems that the apache parquet cpp library (haven't checked other languages > but might have the bug as well) totally disregards this flag and decompresses > the page in all cases if a decompressor is set for the column chunk. > The erroneous code is in column_reader.cc: > std::shared_ptr<Page> SerializedPageReader::NextPage() > This method first decompresses the page if there is a decompressor set and > only then does a case distinction on whether this page is a DataPageV2 and > has the is_compressed flag. Thus, even if the page would have this flag set > to 0, the page would be decompressed anyway. > The method that should use the is_compressed flag but doesn't is: > std::shared_ptr<Buffer> SerializedPageReader::DecompressPage > This method doesn't look at the is_compressed flag at all. > > The reason why this bug probably doesn't show in any unit test is that the > write implementation seems to do the same mistake: It always compresses the > page, even if the page has its is_compressed flag set to false. > -- This message was sent by Atlassian Jira (v8.3.4#803005)