I created PARQUET-826 <https://issues.apache.org/jira/browse/PARQUET-826> to track this and submitted PR #48 <https://github.com/apache/parquet-format/pull/48> to address it.
On Fri, Dec 16, 2016 at 8:06 PM, Lars Volker <[email protected]> wrote: > Hi All, > > I'm currently working on adding support for writing min/max statistics to > Parquet files to Impala (IMPALA-3909 > <https://issues.cloudera.org/browse/IMPALA-3909>). I noticed, that the > comments in parquet.thrift#L201 > <https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L201> > don't > seem to match the implementations in parquet-mr and Hive. > > The comments ask for min/max statistics to be "*encoded in PLAIN encoding*". > For strings (BYTE_ARRAY), this should be "*4 byte length stored as little > endian, followed by bytes*". > > Looking at BinaryStatistics.java#L61 > <https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L61>, > it seems to return the bytes without a length-prefix. Writing a parquet > file with Hive also shows this behavior. > > Is this the intended behavior? If so, we might want to add a description > to the Statistics struct in parquet.thrift to elaborate on the intrinsics > of storing string values there. > > Similarly, but less ambiguous, PLAIN encoding for booleans uses > bit-packing. It seems to be implied that for a single bit (min/max of a > boolean column) it means setting the least significant bit of a single > byte. This could be made more clear in the parquet.thrift file, too. > > I'm curious to hear your feedback. Let me know if you think we should > change the parquet.thrift file and I'll happily send a PR. > > Cheers, Lars >
