My opinion is that most (if not all) of the proposed benefits from these
new formats can be achieved using the currrent parquet format and improved
implementations (possibly with some minor extensions such as user defined
encoding schemes)[1]

Another reason people propose replacing parquet I think is the "what is V2
and what supports it" confusion, along with a perception that the Apache
Parquet community mostly focuses on parquet-mr and not the format or the
myriad of other implementations. Thankfully this is starting to change[2]

Thus, I think the best response for the Parquet community to these new
format proposals is to clarify the current implementation situation (which
will indirectly lead to more investment in current implementations)

Note this doesn't preclude "v3" of parquet, but I think in order to
drive V3 adoption we first need to get the existing communication in better
working order

Andrew

[1] I realize I need some more data to back up that assertion, and I am
working on it.
[2] https://github.com/apache/parquet-site/pull/53



On Sun, May 12, 2024 at 4:48 AM Gang Wu <[email protected]> wrote:

> Hi Micah,
>
> I have also noticed the emergence of these new file formats which are
> challenging the popularity of Apache Parquet. It would always be good
> to evolve Parquet to be competitive. Personally I'm +1 on this. I'm also
> proposing adding a new geometry type to the specs: [1]. This seems
> to align with the goal of V3 to some extent.
>
> On the other hand, I'm also concerned with some aspects:
> 1. Are there sufficient developers to work on this? As a committer to both
> parquet-cpp and parquet-mr, I can take part in the V3 but I'm not sure if
> there are enough active contributors. It would be good if some companies
> could have dedicated people to work on this and move things forward.
> 2. Users may not be willing to adopt new formats if current businesses
> do not have any issue. Especially for users from large enterprises. Think
> about the current issues of V2 [2].
>
> All in all, I feel excited about V3.
>
> [1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx
> [2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn
>
> Best,
> Gang
>
> On Sun, May 12, 2024 at 6:59 AM Micah Kornfield <[email protected]>
> wrote:
>
> > Hi Parquet Dev,
> > I wanted to start a conversation within the community about working on a
> > new revision of Parquet.  For context there have been a bunch of new
> > formats [1][2][3] that show there is decent room for improvement across
> > data encodings and how metadata is organized.
> >
> > Specifically, in a new format revision I think we should be thinking
> about
> > the following areas for improvements:
> > 1.  More efficient encodings that allow for data skipping and SIMD
> > optimizations.
> > 2.  More efficient metadata handling for deserialization and projection
> to
> > address areas when metadata deserialization time is not trivial [4].
> > 3.  Possibly thinking about different encodings instead of
> > repetition/definition for repeated and nested field
> > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> type)
> > that can shred elements into individual columns (a recent thread in
> Iceberg
> > mentions doing this at the metadata level [5])
> >
> > I think the goals of V3 would be to provide existing API compatibility as
> > broadly as possible (possibly with some performance loss) and expose new
> > API surface areas where appropriate to make use of new elements.  New
> > encodings could be backported so they can be made use of without metadata
> > changes.  I think unfortunately that for points 2 and 3 we would want to
> > break file level compatibility.  More thought would be needed to consider
> > whether 4 could be backported effectively.
> >
> > This is a non-trivial amount of work to get good coverage across
> > implementations, so before putting together more formal proposal it would
> > be nice to know if:
> >
> > 1.  If there is an appetite in the general community to consider these
> > changes
> > 2.  If anybody from the community is interested in collaborating on
> > proposals/implementation in this area.
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/maxi-k/btrblocks
> > [2] https://github.com/facebookincubator/nimble
> > [3] https://blog.lancedb.com/lance-v2/
> > [4] https://github.com/apache/arrow/issues/39676
> > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> >
>

Reply via email to