Based on discussion on some of the issues it sounds like there are no known production systems that produce this sort of invalid files (it is only a test harness).
As such, I'd propose keeping the behavior as undefined. Unless there is a cheap way of checking for this type of error (also from some of the issue/PR comments, it seems like Rust might be slower than necessary because of these checks), I don't think we should mandate an error here. Cheers, Micah On Tue, Feb 4, 2025 at 8:44 PM Parth Chandra <[email protected]> wrote: > I logged the original issue in arrow-rs and a corresponding issue in > Parquet [1] while working with Datafusion Comet. > For background, Comet uses the ExampleParquetWriter to write a Parquet file > used by its unit tests which compare the results returned by Comet to the > results returned by Spark. > @Steve Loughran <[email protected]> asked a relevant question - > >Is the invalid data detected and reported as such by the different parsers > > -or does it get returned and the apps are then expected to notice the > > problem? > The problem is exactly that the different implementations return different > results and expect the applications to handle it. > > I don't have a binding say in this matter but any decision that results in > consistent behavior would get my vote. > > - Parth > > [1] https://github.com/apache/parquet-java/issues/3142 > > On Tue, Feb 4, 2025 at 2:44 AM Antoine Pitrou <[email protected]> wrote: > > > > > If we want to recommend something, then we should certainly recommend > > returning an error. > > > > Regards > > > > Antoine. > > > > > > On Tue, 4 Feb 2025 01:19:14 +0000 > > Ed Seidl <[email protected]> wrote: > > > So far I’ve tested parquet-rs, pyarrow/arrow-cpp, and parquet-java. > None > > of them return an error. Parquet-rs will return either null or 238 > > depending on which API is used, pyarrow returns 238, and > spark/parquet-java > > returns -18. > > > > > > If we don’t want to mandate that readers must ignore bits outside the > > specified range, then I agree with Raphael that it’s probably best to > > return an error here. Then at least users will know there’s a problem > with > > how the file has been produced. > > > > > > Ed > > > > > > On 2/3/25, 11:33 AM, "Steve Loughran" <[email protected]> > > wrote: > > > > > > > Is the invalid data detected and reported as such by the different > > parsers > > > > -or does it get returned and the apps are then expected to notice the > > > > problem? > > > > > > > as that reporting of problems is what i care about. > > > > > > > > > > > > > > > > > > > On Mon, 3 Feb 2025 at 16:53, Micah Kornfield <[email protected]> > > wrote: > > > > > > > Could we take a step back? Is this a real bug (and at what layer), it > > seems > > > > that at least paquet-java is not validating data being passed to it > > when > > > > writing? Are there other known implementations that write this > > malformed > > > > data? > > > > > > > > On the topic of whether to keep behavior undefined, having gone > > through a > > > > very long discussion on possible bugs for variant shredding, I see > very > > > > similar themes discussed here. It would be good to come to a > > consensus as > > > > a community on the general topic of undefined behavior and how we > > intend to > > > > handle it (maybe we can fork this thread). > > > > > > > > My two-sense. I think retroactively defining behavior is not a great > > idea > > > > but in some cases might be warranted. My preferred approach is > keeping > > > > notes someplace on known bugs of writers in a centralized location > and > > > > letting readers decide on how to handle it makes sense (we for > > instance do > > > > this for some cases when we know stats are incorrect). I also think > we > > > > really need a validation tool that checks for undefined behavior and > > > > highlights to help avoid debates like this in the future. > > > > > > > > Cheers, > > > > Micah > > > > > > > > > > > > > > > > On Mon, Feb 3, 2025 at 6:48 AM Gang Wu <[email protected]> wrote: > > > > > > > > > I agree that ambiguity is not good. The spec is clear that writer > > > > > implementations should not write these values. However, it is > tricky > > > > > on the reader side. IMHO, it is similar to the case of schema > > evolution > > > > > when we down cast int32 to int8. It is engine-specific to throw an > > error > > > > > or return the lower 8 bits, and some engines use wider types to > > return > > > > > the exact values. > > > > > > > > > > On Mon, Feb 3, 2025 at 6:28 PM Raphael Taylor-Davies > > > > > <[email protected]> wrote: > > > > > > > > > > > Right, but the purpose of the specification is to say what > > > > > > implementations should do, not necessarily what they do > currently, > > and > > > > > > that way over time implementations will converge. > > > > > > > > > > > > There will always be non-spec compliant implementations, the > point > > of a > > > > > > specification is to define what correct looks like. If > > implementations > > > > > > do something different, that is then a problem for those > > > > implementations > > > > > > to fix, and not every other implementation to accommodate. > > Ambiguity in > > > > > > the specification just forces every implementation to make value > > > > > > judgements about what non-standardized behaviour they should > > choose to > > > > > > have, which seems unfortunate. > > > > > > > > > > > > Kind Regards, > > > > > > > > > > > > Raphael > > > > > > > > > > > > On 03/02/2025 10:16, Gang Wu wrote: > > > > > > > The problem is that we cannot control all the implementations > in > > the > > > > > > wild. > > > > > > > > > > > > > > Therefore we can only suggest that writers should not write > such > > kind > > > > > of > > > > > > > data and readers are free to return an error or anything. > > > > > > > > > > > > > > On Mon, Feb 3, 2025 at 6:05 PM Raphael Taylor-Davies > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > >> The challenge with undefined behaviour is it inherently leads > to > > > > > > >> incompatibilities, which in turn wastes people's time chasing > > > > ghosts. > > > > > > >> > > > > > > >> In this case I would suggest explicitly disallowing such data, > > and > > > > > > >> encourage implementations to return an error. This is simple, > > avoids > > > > > > adding > > > > > > >> complexity to the specification, and avoids the potential can > > of > > > > worms > > > > > > as > > > > > > >> to how to apply this masking logic for statistics, bloom > > filters, > > > > > etc... > > > > > > >> > > > > > > >> I don't believe any real parquet writer will produce such > data, > > and > > > > > so I > > > > > > >> think we should just take the simplest option of not allowing > > it. > > > > > > >> > > > > > > >> Kind Regards, > > > > > > >> > > > > > > >> Raphael > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> On 3 February 2025 09:25:42 GMT, Gang Wu <[email protected]> > > wrote: > > > > > > >>> I'm inclined to leave it as an undefined behavior from the > > > > > perspective > > > > > > of > > > > > > >>> spec to keep the spec simple. > > > > > > >>> > > > > > > >>> On Mon, Feb 3, 2025 at 4:57 PM Alkis Evlogimenos > > > > > > >>> <[email protected]> wrote: > > > > > > >>> > > > > > > >>>> Wouldn't it be better to return 238 in this case? What does > > it > > > > mean > > > > > > for > > > > > > >>>> parquet-java to return -18 when the logical type is UINT8? > > > > > > >>>> > > > > > > >>>> Thinking a bit further is this something we perhaps want to > > > > specify? > > > > > > >>>> Something like: > > > > > > >>>> - if the stored value is larger than the maximum allowed by > > the > > > > > > >> annotation > > > > > > >>>> only the lower N bits are taking into account > > > > > > >>>> - if the stored value is smaller than the maximum allowed by > > the > > > > > > >> annotation > > > > > > >>>> the value is sign-extended if signed otherwise zero-extended > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> On Sat, Feb 1, 2025 at 5:00 PM Andrew Lamb < > > > > [email protected]> > > > > > > >> wrote: > > > > > > >>>>> In my opinion, the more consistently they are handled the > > better > > > > > the > > > > > > >>>>> ecosystem as a whole would be (we would waste less time > > chasing > > > > > down > > > > > > >>>>> seeming incompatibilities) > > > > > > >>>>> > > > > > > >>>>> Given its predominance in the ecosystem I would personally > > > > suggest > > > > > > >>>> updating > > > > > > >>>>> the other readers to follow the parquet-vava implementation > > if > > > > > > >> practical. > > > > > > >>>>> Thanks, > > > > > > >>>>> Andrew > > > > > > >>>>> > > > > > > >>>>> On Fri, Jan 31, 2025 at 7:09 PM Ed Seidl <[email protected] > > > > > > > > wrote: > > > > > > >>>>> > > > > > > >>>>>> An issue was recently raised [1] in arrow-rs questioning > > the > > > > > reading > > > > > > >>>> of a > > > > > > >>>>>> file that had improperly encoded UINT_8 and UINT_16 > > columns. For > > > > > > >>>>> instance, > > > > > > >>>>>> a UINT_8 value of 238 (0xee) was plain encoded as > > 0xffffffee. > > > > When > > > > > > >> read > > > > > > >>>>> by > > > > > > >>>>>> parquet-rs, a value of null was returned. For the same > > file, > > > > > > >>>> parquet-java > > > > > > >>>>>> (well, parquet-cli cat) returned -18, and arrow-cpp > > returned > > > > 238. > > > > > > >>>>>> > > > > > > >>>>>> The Parquet specification [2] states that behavior in this > > case > > > > is > > > > > > >>>>>> undefined, so all three readers are correct. I'm just > > wondering > > > > if > > > > > > >>>> there > > > > > > >>>>> is > > > > > > >>>>>> any desire in the community to suggest handling such > > malformed > > > > > data > > > > > > >> in > > > > > > >>>> a > > > > > > >>>>>> more consistent fashion, or just leave UB as UB. > > > > > > >>>>>> > > > > > > >>>>>> Thanks, > > > > > > >>>>>> Ed > > > > > > >>>>>> > > > > > > >>>>>> [1] https://github.com/apache/arrow-rs/issues/7040 > > > > > > >>>>>> [2] > > > > > > >>>>>> > > > > > > >> > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unsigned-integers > > > > > > > > > > > > > > > > > > > > > > > > > >
