It's not just whether it's readable or not.
It is also whether the format allows reaching the performance
characteristics expected.
*A* reference implementation should be developed at the same time as the
format change to confirm that we reach the stated goals.
This is needed whether we consider it *the* reference implementation or
just *a* reference implementation for this particular change.


On Fri, May 17, 2024 at 2:51 AM Steve Loughran <[email protected]>
wrote:

> I'd argue the compatibility across implementation is "can they correctly
> read the data generated by the others?", so there's less of an RI than
> compliance testing, the way closed source stuff often works.
>
> Specification
>
>    1. Files generated by the implementation which are believed to match the
>    specification
>    2. Assertions about the contents of these files (this is
>    something which needs to be declared in a way that can be used by test
>    runners of the different implementations, so tricky.
>    3. Tests which validate those assertions on the parsed contents
>
>
> I've never done anything like this before. maybe tanyone who has tried to
> implement an SQL standard has some suggestions. Indeed, SQL might be
> language for those assertions, which would then have to go through
> spark/hive/impala/etc for validation. Which is ultimately what you want,
> just a lot harder to build, test, debug and identify what is broken
>
> On Fri, 17 May 2024 at 09:40, Antoine Pitrou <[email protected]> wrote:
>
> >
> > +1 (non-binding :-)) on the idea of having a shortlist of "accredited"
> > implementations.
> >
> > I would suggest to add a third implementation such as parquet-rs, since
> > its authors are active here; especially as the Parquet Java and C++
> > teams seem to have some overlap historically, and a third
> > implementation helps bring different perspectives.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Thu, 16 May 2024 17:37:35 -0700
> > Julien Le Dem <[email protected]> wrote:
> > > I would support it as long as we maintain a list of the implementations
> > > that we consider "accredited" to be reference implementations (we
> being a
> > > PMC vote here).
> > > Not all implementations are created equal from an adoption point of
> view.
> > > Originally the Impala implementation was the second implementation for
> > > interrop. Later on the parquet-cpp implementation was added as a
> > standalone
> > > implementation in the Parquet project. This is the implementation that
> > > lives in the arrow repository.
> > > The parquet java implementation and the parquet cpp implementation in
> the
> > > arrow repo are on top of that list IMO.
> > >
> > >
> > > On Thu, May 16, 2024 at 6:17 AM Rok Mihevc <
> > [email protected]> wrote:
> > >
> > > > I would support a "two interoperable open source implementations"
> > > > requirement.
> > > >
> > > > Rok
> > > >
> > > > On Thu, May 16, 2024 at 10:06 AM Antoine Pitrou <[email protected]>
> > > > wrote:
> > > >
> > > > >
> > > > > I'm in (non-binding) agreement with Ed here. I would just add that
> > the
> > > > > requirement for two interoperable implementations should mandate
> that
> > > > > these are open source implementations.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > On Tue, 14 May 2024 14:48:09 -0700
> > > > > Ed Seidl <[email protected]> wrote:
> > > > > > Given the breadth of the parquet community at this point, I don't
> > think
> > > > > > we should be singling out one or two "reference" implementations.
> > Even
> > > > > > parquet-mr, AFAIK, still doesn't implement
> DELTA_LENGTH_BYTE_ARRAY
> > > > > > encoding in a user-accessible way (it's only available as part of
> > the
> > > > > > DELTA_BYTE_ARRAY writer). There are many situations in which the
> > > > > > former would be the superior choice, and in fact the
> specification
> > > > > > documentation still lists DLBA as "always preferred over PLAIN
> for
> > byte
> > > > > > array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only
> > added
> > > > > > to parquet-cpp in the last year [2], and column indexes a few
> > months
> > > > > > before that [3].
> > > > > >
> > > > > > Instead, I think we should leave out any mention of a reference
> > > > > > implementation,
> > > > > > and continue to require two, independent, interoperable
> > implementations
> > > > > > before adopting a change to the spec. This, IMO, would go a long
> > way
> > > > > towards
> > > > > > increasing excitement for Parquet outside the parquet-mr/arrow
> > world.
> > > > > >
> > > > > > Just my (non-binding) two cents.
> > > > > >
> > > > > > Cheers,
> > > > > > Ed
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> >
> > > > > > [2] https://github.com/apache/arrow/pull/14341
> > > > > > [3] https://github.com/apache/arrow/pull/34054
> > > > > >
> > > > > > On 5/14/24 9:44 AM, Julien Le Dem wrote:
> > > > > > > I agree that parquet-mr implementation is a requirement to
> > evolve the
> > > > > spec.
> > > > > > > It makes sense to me that we call parquet-mr the reference
> > > > > implementation
> > > > > > > and make it a requirement to evolve the spec.
> > > > > > > I would add the requirement to implement it in the parquet cpp
> > > > > > > implementation that lives in apache Arrow:
> > > > > > > https://github.com/apache/arrow/tree/main/cpp/src/parquet
> > > > > > > This code used to live in the parquet-cpp repo in the Parquet
> > > > project.
> > > > > > > Being language agnostic is an important feature of the format.
> > > > > > > Interoperability tests should also be included.
> > > > > > >
> > > > > > > On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou <
> > > > >
> > antoine-+zn9apsxkcednm+yrofe0a-xmd5yjdbdmrexy1tmh2...@public.gmane.org>
> > wrote:
> > > > > > >
> > > > > > >> AFAIK, the only Parquet implementation under the Apache
> > Parquet
> > > > > project
> > > > > > >> is parquet-mr :-)
> > > > > > >>
> > > > > > >>
> > > > > > >> On Tue, 14 May 2024 10:58:58 +0200
> > > > > > >> Rok Mihevc <[email protected]> wrote:
> > > > > > >>> Second Raphael's point.
> > > > > > >>> Would it be reasonable to say specification change requires
> > > > > > >> implementation
> > > > > > >>> in two parquet implementations within Apache Parquet project?
> > > > > > >>>
> > > > > > >>> Rok
> > > > > > >>>
> > > > > > >>> On Tue, May 14, 2024 at 10:50 AM Gang Wu <
> > > > > > >>
> > ustcwg-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org>
> > wrote:
> > > > > > >>>> IMHO, it looks more reasonable if a reference implementation
> > is
> > > > > > >> required
> > > > > > >>>> to support most (not all) elements from the specification.
> > > > > > >>>>
> > > > > > >>>> Another question is: should we discuss (and vote for) each
> > > > candidate
> > > > > > >>>> one by one? We can start with parquet-mr which is most
> > well-known
> > > > > > >>>> implementation.
> > > > > > >>>>
> > > > > > >>>> Best,
> > > > > > >>>> Gang
> > > > > > >>>>
> > > > > > >>>> On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies
> > > > > > >>>> <r.taylordavies-gM/Ye1E23mxENrl/
> > [email protected]> wrote:
> > > > > > >>>>
> > > > > > >>>>> Potentially it would be helpful to flip the question
> around.
> > As
> > > > > > >> Andrew
> > > > > > >>>>> articulates, a reference implementation is required to
> > implement
> > > > > all
> > > > > > >>>>> elements from the specification, and therefore the major
> > > > > consequence
> > > > > > >> of
> > > > > > >>>>> labeling parquet-mr thusly would be that any specification
> > change
> > > > > > >> would
> > > > > > >>>>> have to be implemented within parquet-mr as part of the
> > > > > > >> standardisation
> > > > > > >>>>> process. It would be insufficient for it to be implemented
> > in,
> > > > for
> > > > > > >>>>> example, two of the parquet implementations maintained by
> > the
> > > > arrow
> > > > > > >>>>> project. I personally think that would be a shame and
> > likely
> > > > > exclude
> > > > > > >>>>> many people who would otherwise be interested in evolving
> > the
> > > > > parquet
> > > > > > >>>>> specification, but think that is at the core of this
> > question.
> > > > > > >>>>>
> > > > > > >>>>> Kind Regards,
> > > > > > >>>>>
> > > > > > >>>>> Raphael
> > > > > > >>>>>
> > > > > > >>>>> On 13/05/2024 20:55, Andrew Lamb wrote:
> > > > > > >>>>>> Question: Should we label parquet-mr or any other parquet
> > > > > > >>>> implementations
> > > > > > >>>>>> "reference" implications"?
> > > > > > >>>>>>
> > > > > > >>>>>> This came up as part of Vinoo's great PR to list different
> > > > parquet
> > > > > > >>>>>> reference implementations[1][2].
> > > > > > >>>>>>
> > > > > > >>>>>> The term "reference implementation" often has an official
> > > > > > >> connotation.
> > > > > > >>>>> For
> > > > > > >>>>>> example the wikipedia definition is "a program that
> > implements
> > > > all
> > > > > > >>>>>> requirements from a corresponding specification. The
> > reference
> > > > > > >>>>>> implementation ... should be considered the "correct"
> > behavior
> > > > > of
> > > > > > >> any
> > > > > > >>>>> other
> > > > > > >>>>>> implementation of it."[3]
> > > > > > >>>>>>
> > > > > > >>>>>> Given the close association of parquet-mr to the parquet
> > > > > standard,
> > > > > > >> it
> > > > > > >>>> is
> > > > > > >>>>> a
> > > > > > >>>>>> natural candidate to label as "reference implementation."
> > > > > However,
> > > > > > >> it
> > > > > > >>>> is
> > > > > > >>>>>> not clear to me if there is consensus that it should be
> > thusly
> > > > > > >> labeled.
> > > > > > >>>>>> I have a strong opinion that a consensus on this question
> > would
> > > > > be
> > > > > > >> very
> > > > > > >>>>>> helpful. I don't actually have a strong opinion about the
> > answer
> > > > > > >>>>>>
> > > > > > >>>>>> Andrew
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>> [1]:
> > > > > > >>
> > > >
> https://github.com/apache/parquet-site/pull/53#discussion_r1582882267
> >
> > > > >
> > > > > > >>>>>> [2]:
> > > > > > >>
> > > >
> https://github.com/apache/parquet-site/pull/53#discussion_r1598283465
> >
> > > > >
> > > > > > >>>>>> [3]:
> > https://en.wikipedia.org/wiki/Reference_implementation
> > > > > > >>>>>>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> >
>

Reply via email to