Hi All,

I'm currently working on adding support for writing min/max statistics to
Parquet files to Impala (IMPALA-3909
<https://issues.cloudera.org/browse/IMPALA-3909>). I noticed, that the
comments in parquet.thrift#L201
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L201>
don't
seem to match the implementations in parquet-mr and Hive.

The comments ask for min/max statistics to be "*encoded in PLAIN encoding*".
For strings (BYTE_ARRAY), this should be "*4 byte length stored as little
endian, followed by bytes*".

Looking at BinaryStatistics.java#L61
<https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L61>,
it seems to return the bytes without a length-prefix. Writing a parquet
file with Hive also shows this behavior.

Is this the intended behavior? If so, we might want to add a description to
the Statistics struct in parquet.thrift to elaborate on the intrinsics of
storing string values there.

Similarly, but less ambiguous, PLAIN encoding for booleans uses
bit-packing. It seems to be implied that for a single bit (min/max of a
boolean column) it means setting the least significant bit of a single
byte. This could be made more clear in the parquet.thrift file, too.

I'm curious to hear your feedback. Let me know if you think we should
change the parquet.thrift file and I'll happily send a PR.

Cheers, Lars

Reply via email to