Re: Interest in Parquet V3

Gang Wu Sun, 12 May 2024 01:48:23 -0700

Hi Micah,

I have also noticed the emergence of these new file formats which are
challenging the popularity of Apache Parquet. It would always be good
to evolve Parquet to be competitive. Personally I'm +1 on this. I'm also
proposing adding a new geometry type to the specs: [1]. This seems
to align with the goal of V3 to some extent.


On the other hand, I'm also concerned with some aspects:
1. Are there sufficient developers to work on this? As a committer to both
parquet-cpp and parquet-mr, I can take part in the V3 but I'm not sure if
there are enough active contributors. It would be good if some companies
could have dedicated people to work on this and move things forward.
2. Users may not be willing to adopt new formats if current businesses
do not have any issue. Especially for users from large enterprises. Think
about the current issues of V2 [2].

All in all, I feel excited about V3.

[1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx
[2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn

Best,
Gang

On Sun, May 12, 2024 at 6:59 AM Micah Kornfield <[email protected]>
wrote:

> Hi Parquet Dev,
> I wanted to start a conversation within the community about working on a
> new revision of Parquet.  For context there have been a bunch of new
> formats [1][2][3] that show there is decent room for improvement across
> data encodings and how metadata is organized.
>
> Specifically, in a new format revision I think we should be thinking about
> the following areas for improvements:
> 1.  More efficient encodings that allow for data skipping and SIMD
> optimizations.
> 2.  More efficient metadata handling for deserialization and projection to
> address areas when metadata deserialization time is not trivial [4].
> 3.  Possibly thinking about different encodings instead of
> repetition/definition for repeated and nested field
> 4.  Support for optimizing semi-structured data (e.g. JSON or Variant type)
> that can shred elements into individual columns (a recent thread in Iceberg
> mentions doing this at the metadata level [5])
>
> I think the goals of V3 would be to provide existing API compatibility as
> broadly as possible (possibly with some performance loss) and expose new
> API surface areas where appropriate to make use of new elements.  New
> encodings could be backported so they can be made use of without metadata
> changes.  I think unfortunately that for points 2 and 3 we would want to
> break file level compatibility.  More thought would be needed to consider
> whether 4 could be backported effectively.
>
> This is a non-trivial amount of work to get good coverage across
> implementations, so before putting together more formal proposal it would
> be nice to know if:
>
> 1.  If there is an appetite in the general community to consider these
> changes
> 2.  If anybody from the community is interested in collaborating on
> proposals/implementation in this area.
>
> Thanks,
> Micah
>
> [1] https://github.com/maxi-k/btrblocks
> [2] https://github.com/facebookincubator/nimble
> [3] https://blog.lancedb.com/lance-v2/
> [4] https://github.com/apache/arrow/issues/39676
> [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
>

Re: Interest in Parquet V3

Reply via email to