I scanned through the parquet-mr implementation. It provides a row-wise
interface to write records in the ColumnWriter. This cannot reproduce
the issue in this thread. I suspect some other implementations may have
their own column-wise column writer implementations and only write pages
to the parquet-mr layer.

Best,
Gang

On Wed, Nov 29, 2023 at 2:14 PM Micah Kornfield <[email protected]>
wrote:

> Hi Gang,
> For writes I'm seeing "parquet-mr version 1.11.1" and "parquet-mr version
> 1.10.1".  I need to look more into the page headers to check for
> consistency.  At the column level, in some cases the number of values read
> by pyarrow is consistent with num_rows and in some cases it is consistent
> with num_values. I don't see any discernable pattern based on schema or
> types.
>
> It looks like the parquet files might have been written with
> avro ("parquet.avro.schema" key and a corresponding schema are present in
> their metadata).
>
> Thanks,
> Micah
>
> On Tue, Nov 28, 2023 at 6:30 PM Gang Wu <[email protected]> wrote:
>
> > Hi Micah,
> >
> > Does the FileMetaData.version [1] provide any information about
> > the writer? What about the num_values in each page header? Is
> > the actual number of values consistent with num_values in the
> > ColumnMetaData?
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108
> >
> > Best,
> > Gang
> >
> > On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > We've recently encountered files that have inconsistencies between the
> > > number of rows specified in the row group [1] and the total number of
> > > values in a column [2] for non-repeated columns (within a file there is
> > > inconsistency between columns but all counts appear to be greater than
> or
> > > equal to the number of rows). .
> > >
> > > Two questions:
> > > 1.  Is anyone aware of parquet implementations that might generate
> files
> > > like this?
> > > 2.  Does anyone have an opinion on the correct interpretation of these
> > > files?  Should the files be treated as corrupt, or should the number of
> > > rows be treated as authoritative and any additional data in a column be
> > > truncated?
> > >
> > > It appears different engines make different choices in this case.
> Arrow
> > > treats this as corruption. Spark seems to allow reading the data.
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895
> > > [2]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786
> > >
> >
>

Reply via email to