With the extension point described here: https://github.com/apache/parquet-format/pull/254
We can have vendor encodings without drawbacks. For example a vendor wants to add another encoding for integers. It extends ColumnChunk, and embeds an additional location in the file where the alternative representation lives. The old encoding is preserved. The vendor's reader will read the new encoding from a different location in the file, while other readers will read the old. If and when this new encoding is accepted as standard, the dual encoding of the column chunk can stop. On Thu, May 30, 2024 at 10:28 AM Antoine Pitrou <[email protected]> wrote: > On Thu, 30 May 2024 00:07:35 -0700 > Micah Kornfield <[email protected]> > wrote: > > > A "vendor" encoding would also allow candidate encodings to be shared > > > accross the ecosystem before they are eventually enchristened as > regular > > > encodings in the Thrift metadata. > > > > > > I'm not a huge fan of this for two reasons: > > 1. I think it makes it much more complicated for end-users to get > support > > if they happen to have a file with a custom encoding. There are already > > enough rough edges in compatibility between implementations that this > gives > > another degree of freedom where things could break. > > Agreed, but how is this not a problem for "pluggable" encodings as well? > > > 2. From a software supply chain perspective I think this makes Parquet a > > lot riskier if it is going to arbitrarily load/invoke code from > potentially > > unknown sources. > > I'm not sure where that idea comes from. I did *not* suggest that > implementations load arbitrary code from third-party Github repositories > :-) > > Regards > > Antoine. > > >
