Re: Interest in Parquet V3

Raphael Taylor-Davies Mon, 13 May 2024 08:10:54 -0700

Further to what has already been said, I have likewise found the v2branding quite hard to follow, but more fundamentally I have struggledto understand its purpose. As far as I understand it, version 2 groupstogether a number of disjoint features from new data pages to differentencodings, that practically speaking implementations can and do supportindependently. Adding further confusion to this situation is that thereare also a number of features such as page indexes, bloom filters,statistics improvements, etc... that appear to sit outside of thisversioning?

I guess I wonder if rather than having a parquet format version 2, oreven a parquet format version 3, we could just document what features agiven parquet implementation actually supports. I believe Andrew intendsto pick up on where previous efforts here left off. Not only would thisallow for quicker ecosystem adoption of smaller / less controversialchanges, for example version 2 data pages, but could also be used tohighlight higher-level functionality such as late materialization thatare more a function of the reader implementation than the format itself.

I can't confess to having closely followed every proposed parquetreplacement but I have not yet seen anything that couldn't be done in anadditive fashion on top of parquet, by extending the format and/or theimplementations. I personally would be very interested in deltaencodings that are more amenable to record skipping and SIMD, as I havestruggled to make the Rust version of the existing parquet DELTAencodings perform as well as the PLAIN encodings.


Kind Regards,

Raphael

On 13/05/2024 13:55, Antoine Pitrou wrote:

Same as Andrew.

1) the "v3" messaging is intuitively a turn-off as it's already not
obvious whether Parquet "v2" is usable with implementations currenly
found in the wild. Concretely, the "v2" branding is commonly confused
with the Parquet format version, and it's almost impossible to explain
how they relate and differ without diving into implementation minutiae.

2) the "v3" messaging doesn't say anything about compatibility or
features: is "v3" a functional superset of "v2"? is it a clean slate
redesign of the Parquet format? does it use different technologies (for
example Flatbuffers instead of Thrift)?

While I would be curious to see a list of proposed changes, I'm also not
very convinced that launching such an initiative is desirable nor
sustainable for the Parquet development community.

Regards

Antoine.


On Sun, 12 May 2024 05:30:57 -0400
Andrew Lamb <[email protected]>
wrote:

My opinion is that most (if not all) of the proposed benefits from these
new formats can be achieved using the currrent parquet format and improved
implementations (possibly with some minor extensions such as user defined
encoding schemes)[1]

Another reason people propose replacing parquet I think is the "what is V2
and what supports it" confusion, along with a perception that the Apache
Parquet community mostly focuses on parquet-mr and not the format or the
myriad of other implementations. Thankfully this is starting to change[2]

Thus, I think the best response for the Parquet community to these new
format proposals is to clarify the current implementation situation (which
will indirectly lead to more investment in current implementations)

Note this doesn't preclude "v3" of parquet, but I think in order to
drive V3 adoption we first need to get the existing communication in better
working order

Andrew

[1] I realize I need some more data to back up that assertion, and I am
working on it.
[2] https://github.com/apache/parquet-site/pull/53



On Sun, May 12, 2024 at 4:48 AM Gang Wu 
<[email protected]> wrote:

Hi Micah,

I have also noticed the emergence of these new file formats which are
challenging the popularity of Apache Parquet. It would always be good
to evolve Parquet to be competitive. Personally I'm +1 on this. I'm also
proposing adding a new geometry type to the specs: [1]. This seems
to align with the goal of V3 to some extent.

On the other hand, I'm also concerned with some aspects:
1. Are there sufficient developers to work on this? As a committer to both
parquet-cpp and parquet-mr, I can take part in the V3 but I'm not sure if
there are enough active contributors. It would be good if some companies
could have dedicated people to work on this and move things forward.
2. Users may not be willing to adopt new formats if current businesses
do not have any issue. Especially for users from large enterprises. Think
about the current issues of V2 [2].

All in all, I feel excited about V3.

[1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx
[2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn

Best,
Gang

On Sun, May 12, 2024 at 6:59 AM Micah Kornfield <[email protected]>
wrote:

Hi Parquet Dev,
I wanted to start a conversation within the community about working on a
new revision of Parquet.  For context there have been a bunch of new
formats [1][2][3] that show there is decent room for improvement across
data encodings and how metadata is organized.

Specifically, in a new format revision I think we should be thinking

about

the following areas for improvements:
1.  More efficient encodings that allow for data skipping and SIMD
optimizations.
2.  More efficient metadata handling for deserialization and projection

to

address areas when metadata deserialization time is not trivial [4].
3.  Possibly thinking about different encodings instead of
repetition/definition for repeated and nested field
4.  Support for optimizing semi-structured data (e.g. JSON or Variant

type)

that can shred elements into individual columns (a recent thread in

Iceberg

mentions doing this at the metadata level [5])

I think the goals of V3 would be to provide existing API compatibility as
broadly as possible (possibly with some performance loss) and expose new
API surface areas where appropriate to make use of new elements.  New
encodings could be backported so they can be made use of without metadata
changes.  I think unfortunately that for points 2 and 3 we would want to
break file level compatibility.  More thought would be needed to consider
whether 4 could be backported effectively.

This is a non-trivial amount of work to get good coverage across
implementations, so before putting together more formal proposal it would
be nice to know if:

1.  If there is an appetite in the general community to consider these
changes
2.  If anybody from the community is interested in collaborating on
proposals/implementation in this area.

Thanks,
Micah

[1] https://github.com/maxi-k/btrblocks
[2] https://github.com/facebookincubator/nimble
[3] https://blog.lancedb.com/lance-v2/
[4] https://github.com/apache/arrow/issues/39676
[5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34

Re: Interest in Parquet V3

Reply via email to