[ https://issues.apache.org/jira/browse/PARQUET-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chao Sun reassigned PARQUET-1249: --------------------------------- Assignee: Chao Sun > Clarify encoding schemes for boolean types > ------------------------------------------ > > Key: PARQUET-1249 > URL: https://issues.apache.org/jira/browse/PARQUET-1249 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Chao Sun > Assignee: Chao Sun > Priority: Major > > In the Parquet format specification, under [the section for Plain > encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#plain-plain--0], > boolean is encoded using the deprecated bit-packed encoding. However, [the > section for bit-packed > encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#bit-packed-deprecated-bit_packed--4] > specifies that it is only used for repetition/definition levels. This seems > contradictory. > [The section for RLE/bit-packed hybrid > encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3] > says "_Boolean values in data pages, as an alternative to PLAIN encoding_" - > perhaps we should be specific and indicate this is only used for data page V2? > Also, implementation-wise, I saw parquet-cpp still encode boolean as plain > 1-bit value while parquet-mr uses bit-packed encoding as described in the > specification. Perhaps consolidation should be done for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)