Hi Micah, Does the FileMetaData.version [1] provide any information about the writer? What about the num_values in each page header? Is the actual number of values consistent with num_values in the ColumnMetaData?
[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108 Best, Gang On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield <[email protected]> wrote: > We've recently encountered files that have inconsistencies between the > number of rows specified in the row group [1] and the total number of > values in a column [2] for non-repeated columns (within a file there is > inconsistency between columns but all counts appear to be greater than or > equal to the number of rows). . > > Two questions: > 1. Is anyone aware of parquet implementations that might generate files > like this? > 2. Does anyone have an opinion on the correct interpretation of these > files? Should the files be treated as corrupt, or should the number of > rows be treated as authoritative and any additional data in a column be > truncated? > > It appears different engines make different choices in this case. Arrow > treats this as corruption. Spark seems to allow reading the data. > > Thanks, > Micah > > > [1] > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895 > [2] > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786 >
