Hi Micah,

Does the FileMetaData.version [1] provide any information about
the writer? What about the num_values in each page header? Is
the actual number of values consistent with num_values in the
ColumnMetaData?

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108

Best,
Gang

On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield <[email protected]>
wrote:

> We've recently encountered files that have inconsistencies between the
> number of rows specified in the row group [1] and the total number of
> values in a column [2] for non-repeated columns (within a file there is
> inconsistency between columns but all counts appear to be greater than or
> equal to the number of rows). .
>
> Two questions:
> 1.  Is anyone aware of parquet implementations that might generate files
> like this?
> 2.  Does anyone have an opinion on the correct interpretation of these
> files?  Should the files be treated as corrupt, or should the number of
> rows be treated as authoritative and any additional data in a column be
> truncated?
>
> It appears different engines make different choices in this case.  Arrow
> treats this as corruption. Spark seems to allow reading the data.
>
> Thanks,
> Micah
>
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895
> [2]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786
>

Reply via email to