Re: [DISCUSS] Extensibility of Parquet

Micah Kornfield Sat, 01 Jun 2024 00:25:29 -0700

>
> Agreed, but how is this not a problem for "pluggable" encodings as well?

Did you mean non-pluggable encodings?  If so, it is a problem but given the
slow growth of canonical encodings (and actually defaulting to new ones) it
is much less likely for incompatibility/unreadable encodings in the wild.

I also have some concerns in general about the implications of unofficial
encodings proliferating in the ecosystem. Suppose a producer decides they
want to use a plugable encoding and starts writing files with this
encoding.  Then the end user wants to use a different reader, and asks for
support in that reader.  In the best case the encoding is open source and
other readers can port it if desired.  If adoption grows among readers,
then it either becomes a defacto fork of Parquet or we are forced into
adding it into the spec.  Again in the best case scenario this isn't a
problem because encoding adds substantial value.  In the non-best case
scenarios Parquet can effectively get forked N-times or the parquet spec
becomes unwieldy with a lot of encodings that don't provide a lot of
differentiation from one another.

> I'm not sure where that idea comes from. I did *not* suggest that
> implementations load arbitrary code from third-party Github repositories
> :-)

I did not say you did :)  I interpreted pluggability as somehow dynamic
loading of code, which whatever the source I have concerns about.

The
> vendor's reader will read the new encoding from a different location in the
> file, while other readers will read the old

I'm OK with the idea of private plugability as proposed by Alkis as long as
the files stay readable by public implementations.  I worry a little bit
about potential "benchmark wars" that could ensue that are entirely based
on private encodings but I tend to be a worrier I guess, so we can cross
that bridge when we come to it.

Cheers,
Micah

On Fri, May 31, 2024 at 5:47 PM Julien Le Dem <jul...@apache.org> wrote:

> I think it would be a good idea to have an extension mechanism that allows
> embedding extra information in the format.
> Something akin to what Alkis is suggesting having a reserved extension
> point.
> - The file can still be read by a standard parquet implementation without
> extra libraries
> - Vendors can embed custom indices, duplicate data in a proprietary
> encoding, add extra metadata while remaining compatible.
> There are probably a few implementations that add metadata in this way
> adding unused thrift ids (and hoping they won't be used)
>
> It respects the "fully specified" nature of Parquet and you won't have
> weird files you can't read without an opaque library. However it codifies
> how you add extra information in place.
>
> On Thu, May 30, 2024 at 7:21 AM Gang Wu <ust...@gmail.com> wrote:
>
> > This is similar to what we do internally to provide non-standard encoding
> > by duplicating data in the customized index pages. It is free to vendor's
> > choice to pay extra storage cost for better encoding support. So I like
> > this
> > idea to support encoding extensions.
> >
> > Best,
> > Gang
> >
> > On Thu, May 30, 2024 at 8:09 PM Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.invalid> wrote:
> >
> > > With the extension point described here:
> > > https://github.com/apache/parquet-format/pull/254
> > >
> > > We can have vendor encodings without drawbacks.
> > >
> > > For example a vendor wants to add another encoding for integers. It
> > extends
> > > ColumnChunk, and embeds an additional location in the file where the
> > > alternative representation lives. The old encoding is preserved. The
> > > vendor's reader will read the new encoding from a different location in
> > the
> > > file, while other readers will read the old. If and when this new
> > encoding
> > > is accepted as standard, the dual encoding of the column chunk can
> stop.
> > >
> > > On Thu, May 30, 2024 at 10:28 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > >
> > > > On Thu, 30 May 2024 00:07:35 -0700
> > > > Micah Kornfield <emkornfi...@gmail.com>
> > > > wrote:
> > > > > > A "vendor" encoding would also allow candidate encodings to be
> > shared
> > > > > > accross the ecosystem before they are eventually enchristened as
> > > > regular
> > > > > > encodings in the Thrift metadata.
> > > > >
> > > > >
> > > > > I'm not a huge fan of this for two reasons:
> > > > > 1.  I think it makes it much more complicated for end-users to get
> > > > support
> > > > > if they happen to have a file with a custom encoding.  There are
> > > already
> > > > > enough rough edges in compatibility between implementations that
> this
> > > > gives
> > > > > another degree of freedom where things could break.
> > > >
> > > > Agreed, but how is this not a problem for "pluggable" encodings as
> > well?
> > > >
> > > > > 2.  From a software supply chain perspective I think this makes
> > > Parquet a
> > > > > lot riskier if it is going to arbitrarily load/invoke code from
> > > > potentially
> > > > > unknown sources.
> > > >
> > > > I'm not sure where that idea comes from. I did *not* suggest that
> > > > implementations load arbitrary code from third-party Github
> > repositories
> > > > :-)
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Extensibility of Parquet

Reply via email to