Re: Interest in Parquet V3

Vinoo Ganesh Sun, 12 May 2024 14:07:48 -0700

I don't have strong feelings about this one way or the other, but would
gladly put my hand up to help collaborate on proposals/implementation as we
figure this out.



<[email protected]>


On Sun, May 12, 2024 at 5:31 AM Andrew Lamb <[email protected]> wrote:

> My opinion is that most (if not all) of the proposed benefits from these
> new formats can be achieved using the currrent parquet format and improved
> implementations (possibly with some minor extensions such as user defined
> encoding schemes)[1]
>
> Another reason people propose replacing parquet I think is the "what is V2
> and what supports it" confusion, along with a perception that the Apache
> Parquet community mostly focuses on parquet-mr and not the format or the
> myriad of other implementations. Thankfully this is starting to change[2]
>
> Thus, I think the best response for the Parquet community to these new
> format proposals is to clarify the current implementation situation (which
> will indirectly lead to more investment in current implementations)
>
> Note this doesn't preclude "v3" of parquet, but I think in order to
> drive V3 adoption we first need to get the existing communication in better
> working order
>
> Andrew
>
> [1] I realize I need some more data to back up that assertion, and I am
> working on it.
> [2] https://github.com/apache/parquet-site/pull/53
>
>
>
> On Sun, May 12, 2024 at 4:48 AM Gang Wu <[email protected]> wrote:
>
> > Hi Micah,
> >
> > I have also noticed the emergence of these new file formats which are
> > challenging the popularity of Apache Parquet. It would always be good
> > to evolve Parquet to be competitive. Personally I'm +1 on this. I'm also
> > proposing adding a new geometry type to the specs: [1]. This seems
> > to align with the goal of V3 to some extent.
> >
> > On the other hand, I'm also concerned with some aspects:
> > 1. Are there sufficient developers to work on this? As a committer to
> both
> > parquet-cpp and parquet-mr, I can take part in the V3 but I'm not sure if
> > there are enough active contributors. It would be good if some companies
> > could have dedicated people to work on this and move things forward.
> > 2. Users may not be willing to adopt new formats if current businesses
> > do not have any issue. Especially for users from large enterprises. Think
> > about the current issues of V2 [2].
> >
> > All in all, I feel excited about V3.
> >
> > [1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx
> > [2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn
> >
> > Best,
> > Gang
> >
> > On Sun, May 12, 2024 at 6:59 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > Hi Parquet Dev,
> > > I wanted to start a conversation within the community about working on
> a
> > > new revision of Parquet.  For context there have been a bunch of new
> > > formats [1][2][3] that show there is decent room for improvement across
> > > data encodings and how metadata is organized.
> > >
> > > Specifically, in a new format revision I think we should be thinking
> > about
> > > the following areas for improvements:
> > > 1.  More efficient encodings that allow for data skipping and SIMD
> > > optimizations.
> > > 2.  More efficient metadata handling for deserialization and projection
> > to
> > > address areas when metadata deserialization time is not trivial [4].
> > > 3.  Possibly thinking about different encodings instead of
> > > repetition/definition for repeated and nested field
> > > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> > type)
> > > that can shred elements into individual columns (a recent thread in
> > Iceberg
> > > mentions doing this at the metadata level [5])
> > >
> > > I think the goals of V3 would be to provide existing API compatibility
> as
> > > broadly as possible (possibly with some performance loss) and expose
> new
> > > API surface areas where appropriate to make use of new elements.  New
> > > encodings could be backported so they can be made use of without
> metadata
> > > changes.  I think unfortunately that for points 2 and 3 we would want
> to
> > > break file level compatibility.  More thought would be needed to
> consider
> > > whether 4 could be backported effectively.
> > >
> > > This is a non-trivial amount of work to get good coverage across
> > > implementations, so before putting together more formal proposal it
> would
> > > be nice to know if:
> > >
> > > 1.  If there is an appetite in the general community to consider these
> > > changes
> > > 2.  If anybody from the community is interested in collaborating on
> > > proposals/implementation in this area.
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] https://github.com/maxi-k/btrblocks
> > > [2] https://github.com/facebookincubator/nimble
> > > [3] https://blog.lancedb.com/lance-v2/
> > > [4] https://github.com/apache/arrow/issues/39676
> > > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> > >
> >
>

Re: Interest in Parquet V3

Reply via email to