Based on discussion on some of the issues it sounds like there are no known
production systems that produce this sort of invalid files (it is only a
test harness).

As such, I'd propose keeping the behavior as undefined.  Unless there is a
cheap way of checking for this type of error (also from some of the
issue/PR comments, it seems like Rust might be slower than necessary
because of these checks), I don't think we should mandate an error here.

Cheers,
Micah

On Tue, Feb 4, 2025 at 8:44 PM Parth Chandra <[email protected]> wrote:

> I logged the original issue in arrow-rs and a corresponding issue in
> Parquet [1] while working with Datafusion Comet.
> For background, Comet uses the ExampleParquetWriter to write a Parquet file
> used by its unit tests which compare the results returned by Comet to the
> results returned by Spark.
> @Steve Loughran <[email protected]> asked a relevant question -
> >Is the invalid data detected and reported as such by the different parsers
> > -or does it get returned and the apps are then expected to notice the
> > problem?
> The problem is exactly that the different implementations return different
> results and expect the applications to handle it.
>
> I don't have a binding say in this matter but any decision that results in
> consistent behavior would get my vote.
>
>  - Parth
>
> [1]  https://github.com/apache/parquet-java/issues/3142
>
> On Tue, Feb 4, 2025 at 2:44 AM Antoine Pitrou <[email protected]> wrote:
>
> >
> > If we want to recommend something, then we should certainly recommend
> > returning an error.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Tue, 4 Feb 2025 01:19:14 +0000
> > Ed Seidl <[email protected]> wrote:
> > > So far I’ve tested parquet-rs, pyarrow/arrow-cpp, and parquet-java.
> None
> > of them return an error. Parquet-rs will return either null or 238
> > depending on which API is used, pyarrow returns 238, and
> spark/parquet-java
> > returns -18.
> > >
> > > If we don’t want to mandate that readers must ignore bits outside the
> > specified range, then I agree with Raphael that it’s probably best to
> > return an error here. Then at least users will know there’s a problem
> with
> > how the file has been produced.
> > >
> > > Ed
> > >
> > > On 2/3/25, 11:33 AM, "Steve Loughran" <[email protected]>
> > wrote:
> > >
> > > > Is the invalid data detected and reported as such by the different
> > parsers
> > > > -or does it get returned and the apps are then expected to notice the
> > > > problem?
> > >
> > > > as that reporting of problems is what i care about.
> > >
> > >
> > >
> > >
> > >
> > > > On Mon, 3 Feb 2025 at 16:53, Micah Kornfield <[email protected]>
> > wrote:
> > >
> > > > Could we take a step back? Is this a real bug (and at what layer), it
> > seems
> > > > that at least paquet-java is not validating data being passed to it
> > when
> > > > writing? Are there other known implementations that write this
> > malformed
> > > > data?
> > > >
> > > > On the topic of whether to keep behavior undefined, having gone
> > through a
> > > > very long discussion on possible bugs for variant shredding, I see
> very
> > > > similar themes discussed  here.  It would be good to come to a
> > consensus as
> > > > a community on the general topic of undefined behavior and how we
> > intend to
> > > > handle it (maybe we can fork this thread).
> > > >
> > > > My two-sense.  I think retroactively defining behavior is not a great
> > idea
> > > > but in some cases might be warranted.  My preferred approach is
> keeping
> > > > notes someplace on known bugs of writers in a centralized location
> and
> > > > letting readers decide on how to handle it makes sense (we for
> > instance do
> > > > this for some cases when we know stats are incorrect).  I also think
> we
> > > > really need a validation tool  that checks for undefined behavior and
> > > > highlights to help avoid debates like this in the future.
> > > >
> > > > Cheers,
> > > > Micah
> > > >
> > > >
> > > >
> > > > On Mon, Feb 3, 2025 at 6:48 AM Gang Wu <[email protected]> wrote:
> > > >
> > > > > I agree that ambiguity is not good. The spec is clear that writer
> > > > > implementations should not write these values. However, it is
> tricky
> > > > > on the reader side. IMHO, it is similar to the case of schema
> > evolution
> > > > > when we down cast int32 to int8. It is engine-specific to throw an
> > error
> > > > > or return the lower 8 bits, and some engines use wider types to
> > return
> > > > > the exact values.
> > > > >
> > > > > On Mon, Feb 3, 2025 at 6:28 PM Raphael Taylor-Davies
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > Right, but the purpose of the specification is to say what
> > > > > > implementations should do, not necessarily what they do
> currently,
> > and
> > > > > > that way over time implementations will converge.
> > > > > >
> > > > > > There will always be non-spec compliant implementations, the
> point
> > of a
> > > > > > specification is to define what correct looks like. If
> > implementations
> > > > > > do something different, that is then a problem for those
> > > > implementations
> > > > > > to fix, and not every other implementation to accommodate.
> > Ambiguity in
> > > > > > the specification just forces every implementation to make value
> > > > > > judgements about what non-standardized behaviour they should
> > choose to
> > > > > > have, which seems unfortunate.
> > > > > >
> > > > > > Kind Regards,
> > > > > >
> > > > > > Raphael
> > > > > >
> > > > > > On 03/02/2025 10:16, Gang Wu wrote:
> > > > > > > The problem is that we cannot control all the implementations
> in
> > the
> > > > > > wild.
> > > > > > >
> > > > > > > Therefore we can only suggest that writers should not write
> such
> > kind
> > > > > of
> > > > > > > data and readers are free to return an error or anything.
> > > > > > >
> > > > > > > On Mon, Feb 3, 2025 at 6:05 PM Raphael Taylor-Davies
> > > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > >> The challenge with undefined behaviour is it inherently leads
> to
> > > > > > >> incompatibilities, which in turn wastes people's time chasing
> > > > ghosts.
> > > > > > >>
> > > > > > >> In this case I would suggest explicitly disallowing such data,
> > and
> > > > > > >> encourage implementations to return an error. This is simple,
> > avoids
> > > > > > adding
> > > > > > >> complexity to the specification, and avoids the potential can
> > of
> > > > worms
> > > > > > as
> > > > > > >> to how to apply this masking logic for statistics, bloom
> > filters,
> > > > > etc...
> > > > > > >>
> > > > > > >> I don't believe any real parquet writer will produce such
> data,
> > and
> > > > > so I
> > > > > > >> think we should just take the simplest option of not allowing
> > it.
> > > > > > >>
> > > > > > >> Kind Regards,
> > > > > > >>
> > > > > > >> Raphael
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On 3 February 2025 09:25:42 GMT, Gang Wu <[email protected]>
> > wrote:
> > > > > > >>> I'm inclined to leave it as an undefined behavior from the
> > > > > perspective
> > > > > > of
> > > > > > >>> spec to keep the spec simple.
> > > > > > >>>
> > > > > > >>> On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos
> > > > > > >>> <[email protected]> wrote:
> > > > > > >>>
> > > > > > >>>> Wouldn't it be better to return 238 in this case? What does
> > it
> > > > mean
> > > > > > for
> > > > > > >>>> parquet-java to return -18 when the logical type is UINT8?
> > > > > > >>>>
> > > > > > >>>> Thinking a bit further is this something we perhaps want to
> > > > specify?
> > > > > > >>>> Something like:
> > > > > > >>>> - if the stored value is larger than the maximum allowed by
> > the
> > > > > > >> annotation
> > > > > > >>>> only the lower N bits are taking into account
> > > > > > >>>> - if the stored value is smaller than the maximum allowed by
> > the
> > > > > > >> annotation
> > > > > > >>>> the value is sign-extended if signed otherwise zero-extended
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb <
> > > > [email protected]>
> > > > > > >> wrote:
> > > > > > >>>>> In my opinion, the more consistently they are handled the
> > better
> > > > > the
> > > > > > >>>>> ecosystem as a whole would be (we would waste less time
> > chasing
> > > > > down
> > > > > > >>>>> seeming incompatibilities)
> > > > > > >>>>>
> > > > > > >>>>> Given its predominance in the ecosystem I would personally
> > > > suggest
> > > > > > >>>> updating
> > > > > > >>>>> the other readers to follow the parquet-vava implementation
> > if
> > > > > > >> practical.
> > > > > > >>>>> Thanks,
> > > > > > >>>>> Andrew
> > > > > > >>>>>
> > > > > > >>>>> On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <[email protected]
> >
> >
> > > > wrote:
> > > > > > >>>>>
> > > > > > >>>>>> An issue was recently raised [1] in arrow-rs questioning
> > the
> > > > > reading
> > > > > > >>>> of a
> > > > > > >>>>>> file that had improperly encoded UINT_8 and UINT_16
> > columns. For
> > > > > > >>>>> instance,
> > > > > > >>>>>> a UINT_8 value of 238 (0xee) was plain encoded as
> > 0xffffffee.
> > > > When
> > > > > > >> read
> > > > > > >>>>> by
> > > > > > >>>>>> parquet-rs, a value of null was returned. For the same
> > file,
> > > > > > >>>> parquet-java
> > > > > > >>>>>> (well, parquet-cli cat) returned -18, and arrow-cpp
> > returned
> > > > 238.
> > > > > > >>>>>>
> > > > > > >>>>>> The Parquet specification [2] states that behavior in this
> > case
> > > > is
> > > > > > >>>>>> undefined, so all three readers are correct. I'm just
> > wondering
> > > > if
> > > > > > >>>> there
> > > > > > >>>>> is
> > > > > > >>>>>> any desire in the community to suggest handling such
> > malformed
> > > > > data
> > > > > > >> in
> > > > > > >>>> a
> > > > > > >>>>>> more consistent fashion, or just leave UB as UB.
> > > > > > >>>>>>
> > > > > > >>>>>> Thanks,
> > > > > > >>>>>> Ed
> > > > > > >>>>>>
> > > > > > >>>>>> [1] https://github.com/apache/arrow-rs/issues/7040
> > > > > > >>>>>> [2]
> > > > > > >>>>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers
> >
> > > > > >
> > > > >
> > > >
> >
> >
> >
> >
>

Reply via email to