> I'm OK with the idea of private plugability as proposed by Alkis as long as > the files stay readable by public implementations.
I think this approach is a good idea as well. The main use cases I have for pluggability are research and micro-optimization for specific scenarios. If the research is promising it should eventually be promoted to the format proper. I would expect the micro-optimizations to remain private and only used within an organization anyways. On Sat, Jun 1, 2024 at 12:25 AM Micah Kornfield <[email protected]> wrote: > > > > Agreed, but how is this not a problem for "pluggable" encodings as well? > > > Did you mean non-pluggable encodings? If so, it is a problem but given the > slow growth of canonical encodings (and actually defaulting to new ones) it > is much less likely for incompatibility/unreadable encodings in the wild. > > I also have some concerns in general about the implications of unofficial > encodings proliferating in the ecosystem. Suppose a producer decides they > want to use a plugable encoding and starts writing files with this > encoding. Then the end user wants to use a different reader, and asks for > support in that reader. In the best case the encoding is open source and > other readers can port it if desired. If adoption grows among readers, > then it either becomes a defacto fork of Parquet or we are forced into > adding it into the spec. Again in the best case scenario this isn't a > problem because encoding adds substantial value. In the non-best case > scenarios Parquet can effectively get forked N-times or the parquet spec > becomes unwieldy with a lot of encodings that don't provide a lot of > differentiation from one another. > > > > I'm not sure where that idea comes from. I did *not* suggest that > > implementations load arbitrary code from third-party Github repositories > > :-) > > I did not say you did :) I interpreted pluggability as somehow dynamic > loading of code, which whatever the source I have concerns about. > > > The > > vendor's reader will read the new encoding from a different location in > the > > file, while other readers will read the old > > > I'm OK with the idea of private plugability as proposed by Alkis as long as > the files stay readable by public implementations. I worry a little bit > about potential "benchmark wars" that could ensue that are entirely based > on private encodings but I tend to be a worrier I guess, so we can cross > that bridge when we come to it. > > Cheers, > Micah > > > > > On Fri, May 31, 2024 at 5:47 PM Julien Le Dem <[email protected]> wrote: > > > I think it would be a good idea to have an extension mechanism that > allows > > embedding extra information in the format. > > Something akin to what Alkis is suggesting having a reserved extension > > point. > > - The file can still be read by a standard parquet implementation without > > extra libraries > > - Vendors can embed custom indices, duplicate data in a proprietary > > encoding, add extra metadata while remaining compatible. > > There are probably a few implementations that add metadata in this way > > adding unused thrift ids (and hoping they won't be used) > > > > It respects the "fully specified" nature of Parquet and you won't have > > weird files you can't read without an opaque library. However it codifies > > how you add extra information in place. > > > > On Thu, May 30, 2024 at 7:21 AM Gang Wu <[email protected]> wrote: > > > > > This is similar to what we do internally to provide non-standard > encoding > > > by duplicating data in the customized index pages. It is free to > vendor's > > > choice to pay extra storage cost for better encoding support. So I like > > > this > > > idea to support encoding extensions. > > > > > > Best, > > > Gang > > > > > > On Thu, May 30, 2024 at 8:09 PM Alkis Evlogimenos > > > <[email protected]> wrote: > > > > > > > With the extension point described here: > > > > https://github.com/apache/parquet-format/pull/254 > > > > > > > > We can have vendor encodings without drawbacks. > > > > > > > > For example a vendor wants to add another encoding for integers. It > > > extends > > > > ColumnChunk, and embeds an additional location in the file where the > > > > alternative representation lives. The old encoding is preserved. The > > > > vendor's reader will read the new encoding from a different location > in > > > the > > > > file, while other readers will read the old. If and when this new > > > encoding > > > > is accepted as standard, the dual encoding of the column chunk can > > stop. > > > > > > > > On Thu, May 30, 2024 at 10:28 AM Antoine Pitrou <[email protected]> > > > > wrote: > > > > > > > > > On Thu, 30 May 2024 00:07:35 -0700 > > > > > Micah Kornfield <[email protected]> > > > > > wrote: > > > > > > > A "vendor" encoding would also allow candidate encodings to be > > > shared > > > > > > > accross the ecosystem before they are eventually enchristened > as > > > > > regular > > > > > > > encodings in the Thrift metadata. > > > > > > > > > > > > > > > > > > I'm not a huge fan of this for two reasons: > > > > > > 1. I think it makes it much more complicated for end-users to > get > > > > > support > > > > > > if they happen to have a file with a custom encoding. There are > > > > already > > > > > > enough rough edges in compatibility between implementations that > > this > > > > > gives > > > > > > another degree of freedom where things could break. > > > > > > > > > > Agreed, but how is this not a problem for "pluggable" encodings as > > > well? > > > > > > > > > > > 2. From a software supply chain perspective I think this makes > > > > Parquet a > > > > > > lot riskier if it is going to arbitrarily load/invoke code from > > > > > potentially > > > > > > unknown sources. > > > > > > > > > > I'm not sure where that idea comes from. I did *not* suggest that > > > > > implementations load arbitrary code from third-party Github > > > repositories > > > > > :-) > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > >
