Thanks for checking.

On Tuesday, December 5, 2023, Gang Wu <[email protected]> wrote:

> I scanned through the parquet-mr implementation. It provides a row-wise
> interface to write records in the ColumnWriter. This cannot reproduce
> the issue in this thread. I suspect some other implementations may have
> their own column-wise column writer implementations and only write pages
> to the parquet-mr layer.
>
> Best,
> Gang
>
> On Wed, Nov 29, 2023 at 2:14 PM Micah Kornfield <[email protected]>
> wrote:
>
> > Hi Gang,
> > For writes I'm seeing "parquet-mr version 1.11.1" and "parquet-mr version
> > 1.10.1".  I need to look more into the page headers to check for
> > consistency.  At the column level, in some cases the number of values
> read
> > by pyarrow is consistent with num_rows and in some cases it is consistent
> > with num_values. I don't see any discernable pattern based on schema or
> > types.
> >
> > It looks like the parquet files might have been written with
> > avro ("parquet.avro.schema" key and a corresponding schema are present in
> > their metadata).
> >
> > Thanks,
> > Micah
> >
> > On Tue, Nov 28, 2023 at 6:30 PM Gang Wu <[email protected]> wrote:
> >
> > > Hi Micah,
> > >
> > > Does the FileMetaData.version [1] provide any information about
> > > the writer? What about the num_values in each page header? Is
> > > the actual number of values consistent with num_values in the
> > > ColumnMetaData?
> > >
> > > [1]
> > >
> > >
> > https://github.com/apache/parquet-format/blob/master/
> src/main/thrift/parquet.thrift#L1108
> > >
> > > Best,
> > > Gang
> > >
> > > On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield <[email protected]
> >
> > > wrote:
> > >
> > > > We've recently encountered files that have inconsistencies between
> the
> > > > number of rows specified in the row group [1] and the total number of
> > > > values in a column [2] for non-repeated columns (within a file there
> is
> > > > inconsistency between columns but all counts appear to be greater
> than
> > or
> > > > equal to the number of rows). .
> > > >
> > > > Two questions:
> > > > 1.  Is anyone aware of parquet implementations that might generate
> > files
> > > > like this?
> > > > 2.  Does anyone have an opinion on the correct interpretation of
> these
> > > > files?  Should the files be treated as corrupt, or should the number
> of
> > > > rows be treated as authoritative and any additional data in a column
> be
> > > > truncated?
> > > >
> > > > It appears different engines make different choices in this case.
> > Arrow
> > > > treats this as corruption. Spark seems to allow reading the data.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> > https://github.com/apache/parquet-format/blob/master/
> src/main/thrift/parquet.thrift#L895
> > > > [2]
> > > >
> > > >
> > >
> > https://github.com/apache/parquet-format/blob/master/
> src/main/thrift/parquet.thrift#L786
> > > >
> > >
> >
>

Reply via email to