[ 
https://issues.apache.org/jira/browse/PARQUET-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned PARQUET-1249:
---------------------------------

    Assignee: Chao Sun

> Clarify encoding schemes for boolean types
> ------------------------------------------
>
>                 Key: PARQUET-1249
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1249
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>            Priority: Major
>
> In the Parquet format specification, under [the section for Plain 
> encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#plain-plain--0],
>  boolean is encoded using the deprecated bit-packed encoding. However, [the 
> section for bit-packed 
> encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#bit-packed-deprecated-bit_packed--4]
>  specifies that it is only used for repetition/definition levels. This seems 
> contradictory. 
> [The section for RLE/bit-packed hybrid 
> encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3]
>  says "_Boolean values in data pages, as an alternative to PLAIN encoding_" - 
> perhaps we should be specific and indicate this is only used for data page V2?
> Also, implementation-wise, I saw parquet-cpp still encode boolean as plain 
> 1-bit value while parquet-mr uses bit-packed encoding as described in the 
> specification. Perhaps consolidation should be done for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to