It's great to see this thread. Thank you Micah for facilitating
the discussion.

my 2cts:
1. I like the idea of having feature checks rather than an absolute version
number. I am sorry for the confusion created by the V2 moniker. Those were
indeed incremental and backwards compatible additions to the v1 spec and
not a rewrite of the format.

a. It would be great to have a formal release cadence but someone needs to
dedicate time to drive the process.
b. IMO we need an implementer of a query engine to "sponsor" adding a new
feature to the format. They would implement usage at the same time so it
can be validated that additions to the spec achieve the expected perf
improvement in the context of a query engine. For example, some years ago,
Impala was implementing usage of new indexes at the same time they were
specified.
Tracking what engines and versions support the new feature would be useful.
Enough adoption would make it default. This requirement is very different
for a new encoding vs a new additional index or stat.

2. I also think "encoding plugins" are not aligned with the philosophy of
Parquet as the force of the format is to be fully specified cross language
and not just the output of a library.
I do think new encodings and a new metadata representation would be
welcome. Flatbuffer did not exist when I picked Thrift for the footer. The
current metadata representation is a pain to read partially or efficiently.
That said big changes like this need a clear path for adoption and solving
the transition period. teh file does have a magic number "PAR1" at the
beginning and the end that might be used for such incompatible changes at
the metadata layer.

I do think it is easier to integrate more encodings in the ecosystem (say
btrblocks) by adding them to Parquet than by creating a new file format
that would need to build adoption from scratch.

3. Agreed, it is an effort and requires collaboration from key OpenSource
and proprietary engines implementing parquet readers/writers. One way to
facilitate the transition IMO would be to make sure there are native
parquet-arrow implemetations included which is a bit lacking in the java
implementation.

Best
Julien

On Mon, May 13, 2024 at 10:45 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Thanks everybody for the input.  I'll try to summarize some main points and
> my thoughts below.
>
> 1.  "V3" branding is problematic and getting adoption is difficult
> with V2.  I agree, we should not lump all potential improvements into a
> single V3 milestone (I used V3 to indicate that at least some changes might
> be backward incompatible with existing format revisions).   In my mind, I
> think the way to make it more likely that new features are used would be
> starting to think about a more formal release process for them.  For
> example:
>     a.  A clear cadence of major version library releases (e.g. maybe once
> per year).
>     b.  A clear policy for when a new feature becomes the default in a
> library release (e.g. as a strawman once the feature lands in reference
> implementation, it is eligible to become default in the next major release
> that occurs >1 year later).
>     c.  For reference implementations that are effectively doing major
> version releases on each release, I think following parquet-mr for flipping
> defaults would make sense.
>
> 2.  How much of the improvements can be a clean slate vs
> evolutionary/implementation optimizations?  I really think this depends on
> which aspects we are tackling. For metadata issues, I think it might pay to
> rethink things from the ground up, but any proposals along these lines
> should obviously have clear rationales and benchmarks to clarify how the
> decisions are made.  For better encodings, most likely work can be added to
> the existing format.  I don't think allowing for arbitrary plugin encodings
> would be a good thing.  I believe one of the reasons that Parquet has been
> successful has been its specification which allows for guaranteed
> compatibility.
>
> 3.  Amount of effort required/Sustainability of effort.  I agree this is a
> big risk. It will take a lot of work to cover the major parquet bindings,
> which is why I started the thread. Personally, I am fairly time constrained
> and unless my employer is willing to approve devoting work hours to the
> project I likely won't be able to contribute much.  However, it seems like
> there might be enough interest from the community that I can potentially
> make the case for doing so.
>
> Thanks,
> Micah
>
> On Mon, May 13, 2024 at 10:41 AM Ed Seidl <etse...@live.com> wrote:
>
> > I think the whole "V1" vs "V2" mess is unfortunate. IMO there is only one
> > version of the Parquet file format. At its core, the data layout (row
> > groups
> > composed of column chunks composed of Dremel encoded pages) has
> > never changed. Encodings/codecs/structures have been added to that core,
> > but always in a backwards compatible way.
> >
> > I agree that many of the perceived shortcomings might be addressed
> without
> > breaking changes to the file format. I myself would be interested in
> > exploring
> > ways to address the point lookup and wide tables issues while maintaining
> > backwards compatibility. But that said, if there are ways to gain large
> > performance gains that would necessitate an actual new file format
> version
> > (such as replacing thrift, new metadata organization, some alternative to
> > Dremel), I'd be open to exploring those options as well.
> >
> > Thanks,
> > Ed
> >
> > On 5/11/24 3:58 PM, Micah Kornfield wrote:
> > > Hi Parquet Dev,
> > > I wanted to start a conversation within the community about working on
> a
> > > new revision of Parquet.  For context there have been a bunch of new
> > > formats [1][2][3] that show there is decent room for improvement across
> > > data encodings and how metadata is organized.
> > >
> > > Specifically, in a new format revision I think we should be thinking
> > about
> > > the following areas for improvements:
> > > 1.  More efficient encodings that allow for data skipping and SIMD
> > > optimizations.
> > > 2.  More efficient metadata handling for deserialization and projection
> > to
> > > address areas when metadata deserialization time is not trivial [4].
> > > 3.  Possibly thinking about different encodings instead of
> > > repetition/definition for repeated and nested field
> > > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> > type)
> > > that can shred elements into individual columns (a recent thread in
> > Iceberg
> > > mentions doing this at the metadata level [5])
> > >
> > > I think the goals of V3 would be to provide existing API compatibility
> as
> > > broadly as possible (possibly with some performance loss) and expose
> new
> > > API surface areas where appropriate to make use of new elements.  New
> > > encodings could be backported so they can be made use of without
> metadata
> > > changes.  I think unfortunately that for points 2 and 3 we would want
> to
> > > break file level compatibility.  More thought would be needed to
> consider
> > > whether 4 could be backported effectively.
> > >
> > > This is a non-trivial amount of work to get good coverage across
> > > implementations, so before putting together more formal proposal it
> would
> > > be nice to know if:
> > >
> > > 1.  If there is an appetite in the general community to consider these
> > > changes
> > > 2.  If anybody from the community is interested in collaborating on
> > > proposals/implementation in this area.
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] https://github.com/maxi-k/btrblocks
> > > [2] https://github.com/facebookincubator/nimble
> > > [3] https://blog.lancedb.com/lance-v2/
> > > [4] https://github.com/apache/arrow/issues/39676
> > > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> > >
> >
> >
>

Reply via email to