Thanks for checking. On Tuesday, December 5, 2023, Gang Wu <[email protected]> wrote:
> I scanned through the parquet-mr implementation. It provides a row-wise > interface to write records in the ColumnWriter. This cannot reproduce > the issue in this thread. I suspect some other implementations may have > their own column-wise column writer implementations and only write pages > to the parquet-mr layer. > > Best, > Gang > > On Wed, Nov 29, 2023 at 2:14 PM Micah Kornfield <[email protected]> > wrote: > > > Hi Gang, > > For writes I'm seeing "parquet-mr version 1.11.1" and "parquet-mr version > > 1.10.1". I need to look more into the page headers to check for > > consistency. At the column level, in some cases the number of values > read > > by pyarrow is consistent with num_rows and in some cases it is consistent > > with num_values. I don't see any discernable pattern based on schema or > > types. > > > > It looks like the parquet files might have been written with > > avro ("parquet.avro.schema" key and a corresponding schema are present in > > their metadata). > > > > Thanks, > > Micah > > > > On Tue, Nov 28, 2023 at 6:30 PM Gang Wu <[email protected]> wrote: > > > > > Hi Micah, > > > > > > Does the FileMetaData.version [1] provide any information about > > > the writer? What about the num_values in each page header? Is > > > the actual number of values consistent with num_values in the > > > ColumnMetaData? > > > > > > [1] > > > > > > > > https://github.com/apache/parquet-format/blob/master/ > src/main/thrift/parquet.thrift#L1108 > > > > > > Best, > > > Gang > > > > > > On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield <[email protected] > > > > > wrote: > > > > > > > We've recently encountered files that have inconsistencies between > the > > > > number of rows specified in the row group [1] and the total number of > > > > values in a column [2] for non-repeated columns (within a file there > is > > > > inconsistency between columns but all counts appear to be greater > than > > or > > > > equal to the number of rows). . > > > > > > > > Two questions: > > > > 1. Is anyone aware of parquet implementations that might generate > > files > > > > like this? > > > > 2. Does anyone have an opinion on the correct interpretation of > these > > > > files? Should the files be treated as corrupt, or should the number > of > > > > rows be treated as authoritative and any additional data in a column > be > > > > truncated? > > > > > > > > It appears different engines make different choices in this case. > > Arrow > > > > treats this as corruption. Spark seems to allow reading the data. > > > > > > > > Thanks, > > > > Micah > > > > > > > > > > > > [1] > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/ > src/main/thrift/parquet.thrift#L895 > > > > [2] > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/ > src/main/thrift/parquet.thrift#L786 > > > > > > > > > >
