> I'm OK with the idea of private plugability as proposed by Alkis as long
as
> the files stay readable by public implementations.

I think this approach is a good idea as well.  The main use cases I have
for pluggability are research and micro-optimization for specific
scenarios.  If the research is promising it should eventually be promoted
to the format proper.  I would expect the micro-optimizations to remain
private and only used within an organization anyways.






On Sat, Jun 1, 2024 at 12:25 AM Micah Kornfield <[email protected]>
wrote:

> >
> > Agreed, but how is this not a problem for "pluggable" encodings as well?
>
>
> Did you mean non-pluggable encodings?  If so, it is a problem but given the
> slow growth of canonical encodings (and actually defaulting to new ones) it
> is much less likely for incompatibility/unreadable encodings in the wild.
>
> I also have some concerns in general about the implications of unofficial
> encodings proliferating in the ecosystem. Suppose a producer decides they
> want to use a plugable encoding and starts writing files with this
> encoding.  Then the end user wants to use a different reader, and asks for
> support in that reader.  In the best case the encoding is open source and
> other readers can port it if desired.  If adoption grows among readers,
> then it either becomes a defacto fork of Parquet or we are forced into
> adding it into the spec.  Again in the best case scenario this isn't a
> problem because encoding adds substantial value.  In the non-best case
> scenarios Parquet can effectively get forked N-times or the parquet spec
> becomes unwieldy with a lot of encodings that don't provide a lot of
> differentiation from one another.
>
>
> > I'm not sure where that idea comes from. I did *not* suggest that
> > implementations load arbitrary code from third-party Github repositories
> > :-)
>
> I did not say you did :)  I interpreted pluggability as somehow dynamic
> loading of code, which whatever the source I have concerns about.
>
>
> The
> > vendor's reader will read the new encoding from a different location in
> the
> > file, while other readers will read the old
>
>
> I'm OK with the idea of private plugability as proposed by Alkis as long as
> the files stay readable by public implementations.  I worry a little bit
> about potential "benchmark wars" that could ensue that are entirely based
> on private encodings but I tend to be a worrier I guess, so we can cross
> that bridge when we come to it.
>
> Cheers,
> Micah
>
>
>
>
> On Fri, May 31, 2024 at 5:47 PM Julien Le Dem <[email protected]> wrote:
>
> > I think it would be a good idea to have an extension mechanism that
> allows
> > embedding extra information in the format.
> > Something akin to what Alkis is suggesting having a reserved extension
> > point.
> > - The file can still be read by a standard parquet implementation without
> > extra libraries
> > - Vendors can embed custom indices, duplicate data in a proprietary
> > encoding, add extra metadata while remaining compatible.
> > There are probably a few implementations that add metadata in this way
> > adding unused thrift ids (and hoping they won't be used)
> >
> > It respects the "fully specified" nature of Parquet and you won't have
> > weird files you can't read without an opaque library. However it codifies
> > how you add extra information in place.
> >
> > On Thu, May 30, 2024 at 7:21 AM Gang Wu <[email protected]> wrote:
> >
> > > This is similar to what we do internally to provide non-standard
> encoding
> > > by duplicating data in the customized index pages. It is free to
> vendor's
> > > choice to pay extra storage cost for better encoding support. So I like
> > > this
> > > idea to support encoding extensions.
> > >
> > > Best,
> > > Gang
> > >
> > > On Thu, May 30, 2024 at 8:09 PM Alkis Evlogimenos
> > > <[email protected]> wrote:
> > >
> > > > With the extension point described here:
> > > > https://github.com/apache/parquet-format/pull/254
> > > >
> > > > We can have vendor encodings without drawbacks.
> > > >
> > > > For example a vendor wants to add another encoding for integers. It
> > > extends
> > > > ColumnChunk, and embeds an additional location in the file where the
> > > > alternative representation lives. The old encoding is preserved. The
> > > > vendor's reader will read the new encoding from a different location
> in
> > > the
> > > > file, while other readers will read the old. If and when this new
> > > encoding
> > > > is accepted as standard, the dual encoding of the column chunk can
> > stop.
> > > >
> > > > On Thu, May 30, 2024 at 10:28 AM Antoine Pitrou <[email protected]>
> > > > wrote:
> > > >
> > > > > On Thu, 30 May 2024 00:07:35 -0700
> > > > > Micah Kornfield <[email protected]>
> > > > > wrote:
> > > > > > > A "vendor" encoding would also allow candidate encodings to be
> > > shared
> > > > > > > accross the ecosystem before they are eventually enchristened
> as
> > > > > regular
> > > > > > > encodings in the Thrift metadata.
> > > > > >
> > > > > >
> > > > > > I'm not a huge fan of this for two reasons:
> > > > > > 1.  I think it makes it much more complicated for end-users to
> get
> > > > > support
> > > > > > if they happen to have a file with a custom encoding.  There are
> > > > already
> > > > > > enough rough edges in compatibility between implementations that
> > this
> > > > > gives
> > > > > > another degree of freedom where things could break.
> > > > >
> > > > > Agreed, but how is this not a problem for "pluggable" encodings as
> > > well?
> > > > >
> > > > > > 2.  From a software supply chain perspective I think this makes
> > > > Parquet a
> > > > > > lot riskier if it is going to arbitrarily load/invoke code from
> > > > > potentially
> > > > > > unknown sources.
> > > > >
> > > > > I'm not sure where that idea comes from. I did *not* suggest that
> > > > > implementations load arbitrary code from third-party Github
> > > repositories
> > > > > :-)
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to