+1 (non-binding :-)) on the idea of having a shortlist of "accredited" implementations.
I would suggest to add a third implementation such as parquet-rs, since its authors are active here; especially as the Parquet Java and C++ teams seem to have some overlap historically, and a third implementation helps bring different perspectives. Regards Antoine. On Thu, 16 May 2024 17:37:35 -0700 Julien Le Dem <[email protected]> wrote: > I would support it as long as we maintain a list of the implementations > that we consider "accredited" to be reference implementations (we being a > PMC vote here). > Not all implementations are created equal from an adoption point of view. > Originally the Impala implementation was the second implementation for > interrop. Later on the parquet-cpp implementation was added as a standalone > implementation in the Parquet project. This is the implementation that > lives in the arrow repository. > The parquet java implementation and the parquet cpp implementation in the > arrow repo are on top of that list IMO. > > > On Thu, May 16, 2024 at 6:17 AM Rok Mihevc > <[email protected]> wrote: > > > I would support a "two interoperable open source implementations" > > requirement. > > > > Rok > > > > On Thu, May 16, 2024 at 10:06 AM Antoine Pitrou <[email protected]> > > wrote: > > > > > > > > I'm in (non-binding) agreement with Ed here. I would just add that the > > > requirement for two interoperable implementations should mandate that > > > these are open source implementations. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > On Tue, 14 May 2024 14:48:09 -0700 > > > Ed Seidl <[email protected]> wrote: > > > > Given the breadth of the parquet community at this point, I don't think > > > > we should be singling out one or two "reference" implementations. Even > > > > parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY > > > > encoding in a user-accessible way (it's only available as part of the > > > > DELTA_BYTE_ARRAY writer). There are many situations in which the > > > > former would be the superior choice, and in fact the specification > > > > documentation still lists DLBA as "always preferred over PLAIN for byte > > > > array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only added > > > > to parquet-cpp in the last year [2], and column indexes a few months > > > > before that [3]. > > > > > > > > Instead, I think we should leave out any mention of a reference > > > > implementation, > > > > and continue to require two, independent, interoperable implementations > > > > before adopting a change to the spec. This, IMO, would go a long way > > > towards > > > > increasing excitement for Parquet outside the parquet-mr/arrow world. > > > > > > > > Just my (non-binding) two cents. > > > > > > > > Cheers, > > > > Ed > > > > > > > > [1] > > > > > > > > > https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 > > > > > > [2] https://github.com/apache/arrow/pull/14341 > > > > [3] https://github.com/apache/arrow/pull/34054 > > > > > > > > On 5/14/24 9:44 AM, Julien Le Dem wrote: > > > > > I agree that parquet-mr implementation is a requirement to evolve the > > > > > > > > spec. > > > > > It makes sense to me that we call parquet-mr the reference > > > implementation > > > > > and make it a requirement to evolve the spec. > > > > > I would add the requirement to implement it in the parquet cpp > > > > > implementation that lives in apache Arrow: > > > > > https://github.com/apache/arrow/tree/main/cpp/src/parquet > > > > > This code used to live in the parquet-cpp repo in the Parquet > > project. > > > > > Being language agnostic is an important feature of the format. > > > > > Interoperability tests should also be included. > > > > > > > > > > On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou < > > > antoine-+zn9apsxkcednm+yrofe0a-xmd5yjdbdmrexy1tmh2...@public.gmane.org> > > > wrote: > > > > > > > > > >> AFAIK, the only Parquet implementation under the Apache Parquet > > > project > > > > >> is parquet-mr :-) > > > > >> > > > > >> > > > > >> On Tue, 14 May 2024 10:58:58 +0200 > > > > >> Rok Mihevc <[email protected]> wrote: > > > > >>> Second Raphael's point. > > > > >>> Would it be reasonable to say specification change requires > > > > >> implementation > > > > >>> in two parquet implementations within Apache Parquet project? > > > > >>> > > > > >>> Rok > > > > >>> > > > > >>> On Tue, May 14, 2024 at 10:50 AM Gang Wu < > > > > >> ustcwg-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org> > > > > >> wrote: > > > > >>>> IMHO, it looks more reasonable if a reference implementation is > > > > >> required > > > > >>>> to support most (not all) elements from the specification. > > > > >>>> > > > > >>>> Another question is: should we discuss (and vote for) each > > candidate > > > > >>>> one by one? We can start with parquet-mr which is most well-known > > > > >>>> implementation. > > > > >>>> > > > > >>>> Best, > > > > >>>> Gang > > > > >>>> > > > > >>>> On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies > > > > >>>> <r.taylordavies-gM/Ye1E23mxENrl/[email protected]> > > > > >>>> wrote: > > > > >>>> > > > > >>>>> Potentially it would be helpful to flip the question around. As > > > > >> Andrew > > > > >>>>> articulates, a reference implementation is required to implement > > > all > > > > >>>>> elements from the specification, and therefore the major > > > consequence > > > > >> of > > > > >>>>> labeling parquet-mr thusly would be that any specification change > > > > >>>>> > > > > >> would > > > > >>>>> have to be implemented within parquet-mr as part of the > > > > >> standardisation > > > > >>>>> process. It would be insufficient for it to be implemented in, > > for > > > > >>>>> example, two of the parquet implementations maintained by the > > arrow > > > > >>>>> project. I personally think that would be a shame and likely > > > exclude > > > > >>>>> many people who would otherwise be interested in evolving the > > > parquet > > > > >>>>> specification, but think that is at the core of this question. > > > > >>>>> > > > > >>>>> Kind Regards, > > > > >>>>> > > > > >>>>> Raphael > > > > >>>>> > > > > >>>>> On 13/05/2024 20:55, Andrew Lamb wrote: > > > > >>>>>> Question: Should we label parquet-mr or any other parquet > > > > >>>> implementations > > > > >>>>>> "reference" implications"? > > > > >>>>>> > > > > >>>>>> This came up as part of Vinoo's great PR to list different > > parquet > > > > >>>>>> reference implementations[1][2]. > > > > >>>>>> > > > > >>>>>> The term "reference implementation" often has an official > > > > >> connotation. > > > > >>>>> For > > > > >>>>>> example the wikipedia definition is "a program that implements > > all > > > > >>>>>> requirements from a corresponding specification. The reference > > > > >>>>>> implementation ... should be considered the "correct" behavior > > > of > > > > >> any > > > > >>>>> other > > > > >>>>>> implementation of it."[3] > > > > >>>>>> > > > > >>>>>> Given the close association of parquet-mr to the parquet > > > standard, > > > > >> it > > > > >>>> is > > > > >>>>> a > > > > >>>>>> natural candidate to label as "reference implementation." > > > However, > > > > >> it > > > > >>>> is > > > > >>>>>> not clear to me if there is consensus that it should be thusly > > > > >> labeled. > > > > >>>>>> I have a strong opinion that a consensus on this question would > > > be > > > > >> very > > > > >>>>>> helpful. I don't actually have a strong opinion about the answer > > > > >>>>>> > > > > >>>>>> Andrew > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> [1]: > > > > >> > > https://github.com/apache/parquet-site/pull/53#discussion_r1582882267 > > > > > > > >>>>>> [2]: > > > > >> > > https://github.com/apache/parquet-site/pull/53#discussion_r1598283465 > > > > > > > >>>>>> [3]: https://en.wikipedia.org/wiki/Reference_implementation > > > > >>>>>> > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > > > > > > >
