I scanned through the parquet-mr implementation. It provides a row-wise interface to write records in the ColumnWriter. This cannot reproduce the issue in this thread. I suspect some other implementations may have their own column-wise column writer implementations and only write pages to the parquet-mr layer.
Best, Gang On Wed, Nov 29, 2023 at 2:14 PM Micah Kornfield <[email protected]> wrote: > Hi Gang, > For writes I'm seeing "parquet-mr version 1.11.1" and "parquet-mr version > 1.10.1". I need to look more into the page headers to check for > consistency. At the column level, in some cases the number of values read > by pyarrow is consistent with num_rows and in some cases it is consistent > with num_values. I don't see any discernable pattern based on schema or > types. > > It looks like the parquet files might have been written with > avro ("parquet.avro.schema" key and a corresponding schema are present in > their metadata). > > Thanks, > Micah > > On Tue, Nov 28, 2023 at 6:30 PM Gang Wu <[email protected]> wrote: > > > Hi Micah, > > > > Does the FileMetaData.version [1] provide any information about > > the writer? What about the num_values in each page header? Is > > the actual number of values consistent with num_values in the > > ColumnMetaData? > > > > [1] > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108 > > > > Best, > > Gang > > > > On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield <[email protected]> > > wrote: > > > > > We've recently encountered files that have inconsistencies between the > > > number of rows specified in the row group [1] and the total number of > > > values in a column [2] for non-repeated columns (within a file there is > > > inconsistency between columns but all counts appear to be greater than > or > > > equal to the number of rows). . > > > > > > Two questions: > > > 1. Is anyone aware of parquet implementations that might generate > files > > > like this? > > > 2. Does anyone have an opinion on the correct interpretation of these > > > files? Should the files be treated as corrupt, or should the number of > > > rows be treated as authoritative and any additional data in a column be > > > truncated? > > > > > > It appears different engines make different choices in this case. > Arrow > > > treats this as corruption. Spark seems to allow reading the data. > > > > > > Thanks, > > > Micah > > > > > > > > > [1] > > > > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895 > > > [2] > > > > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786 > > > > > >
