Re: Inconsistencies between parquet.thrift and parquet-mr/Hive

Lars Volker Thu, 05 Jan 2017 11:10:00 -0800

I created PARQUET-826 <https://issues.apache.org/jira/browse/PARQUET-826>
to track this and submitted PR #48
<https://github.com/apache/parquet-format/pull/48> to address it.


On Fri, Dec 16, 2016 at 8:06 PM, Lars Volker <[email protected]> wrote:

> Hi All,
>
> I'm currently working on adding support for writing min/max statistics to
> Parquet files to Impala (IMPALA-3909
> <https://issues.cloudera.org/browse/IMPALA-3909>). I noticed, that the
> comments in parquet.thrift#L201
> <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L201>
>  don't
> seem to match the implementations in parquet-mr and Hive.
>
> The comments ask for min/max statistics to be "*encoded in PLAIN encoding*".
> For strings (BYTE_ARRAY), this should be "*4 byte length stored as little
> endian, followed by bytes*".
>
> Looking at BinaryStatistics.java#L61
> <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L61>,
> it seems to return the bytes without a length-prefix. Writing a parquet
> file with Hive also shows this behavior.
>
> Is this the intended behavior? If so, we might want to add a description
> to the Statistics struct in parquet.thrift to elaborate on the intrinsics
> of storing string values there.
>
> Similarly, but less ambiguous, PLAIN encoding for booleans uses
> bit-packing. It seems to be implied that for a single bit (min/max of a
> boolean column) it means setting the least significant bit of a single
> byte. This could be made more clear in the parquet.thrift file, too.
>
> I'm curious to hear your feedback. Let me know if you think we should
> change the parquet.thrift file and I'll happily send a PR.
>
> Cheers, Lars
>

Re: Inconsistencies between parquet.thrift and parquet-mr/Hive

Reply via email to