Further to what has already been said, I have likewise found the v2
branding quite hard to follow, but more fundamentally I have struggled
to understand its purpose. As far as I understand it, version 2 groups
together a number of disjoint features from new data pages to different
encodings, that practically speaking implementations can and do support
independently. Adding further confusion to this situation is that there
are also a number of features such as page indexes, bloom filters,
statistics improvements, etc... that appear to sit outside of this
versioning?
I guess I wonder if rather than having a parquet format version 2, or
even a parquet format version 3, we could just document what features a
given parquet implementation actually supports. I believe Andrew intends
to pick up on where previous efforts here left off. Not only would this
allow for quicker ecosystem adoption of smaller / less controversial
changes, for example version 2 data pages, but could also be used to
highlight higher-level functionality such as late materialization that
are more a function of the reader implementation than the format itself.
I can't confess to having closely followed every proposed parquet
replacement but I have not yet seen anything that couldn't be done in an
additive fashion on top of parquet, by extending the format and/or the
implementations. I personally would be very interested in delta
encodings that are more amenable to record skipping and SIMD, as I have
struggled to make the Rust version of the existing parquet DELTA
encodings perform as well as the PLAIN encodings.
Kind Regards,
Raphael
On 13/05/2024 13:55, Antoine Pitrou wrote:
Same as Andrew.
1) the "v3" messaging is intuitively a turn-off as it's already not
obvious whether Parquet "v2" is usable with implementations currenly
found in the wild. Concretely, the "v2" branding is commonly confused
with the Parquet format version, and it's almost impossible to explain
how they relate and differ without diving into implementation minutiae.
2) the "v3" messaging doesn't say anything about compatibility or
features: is "v3" a functional superset of "v2"? is it a clean slate
redesign of the Parquet format? does it use different technologies (for
example Flatbuffers instead of Thrift)?
While I would be curious to see a list of proposed changes, I'm also not
very convinced that launching such an initiative is desirable nor
sustainable for the Parquet development community.
Regards
Antoine.
On Sun, 12 May 2024 05:30:57 -0400
Andrew Lamb <[email protected]>
wrote:
My opinion is that most (if not all) of the proposed benefits from these
new formats can be achieved using the currrent parquet format and improved
implementations (possibly with some minor extensions such as user defined
encoding schemes)[1]
Another reason people propose replacing parquet I think is the "what is V2
and what supports it" confusion, along with a perception that the Apache
Parquet community mostly focuses on parquet-mr and not the format or the
myriad of other implementations. Thankfully this is starting to change[2]
Thus, I think the best response for the Parquet community to these new
format proposals is to clarify the current implementation situation (which
will indirectly lead to more investment in current implementations)
Note this doesn't preclude "v3" of parquet, but I think in order to
drive V3 adoption we first need to get the existing communication in better
working order
Andrew
[1] I realize I need some more data to back up that assertion, and I am
working on it.
[2] https://github.com/apache/parquet-site/pull/53
On Sun, May 12, 2024 at 4:48 AM Gang Wu
<[email protected]> wrote:
Hi Micah,
I have also noticed the emergence of these new file formats which are
challenging the popularity of Apache Parquet. It would always be good
to evolve Parquet to be competitive. Personally I'm +1 on this. I'm also
proposing adding a new geometry type to the specs: [1]. This seems
to align with the goal of V3 to some extent.
On the other hand, I'm also concerned with some aspects:
1. Are there sufficient developers to work on this? As a committer to both
parquet-cpp and parquet-mr, I can take part in the V3 but I'm not sure if
there are enough active contributors. It would be good if some companies
could have dedicated people to work on this and move things forward.
2. Users may not be willing to adopt new formats if current businesses
do not have any issue. Especially for users from large enterprises. Think
about the current issues of V2 [2].
All in all, I feel excited about V3.
[1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx
[2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn
Best,
Gang
On Sun, May 12, 2024 at 6:59 AM Micah Kornfield <[email protected]>
wrote:
Hi Parquet Dev,
I wanted to start a conversation within the community about working on a
new revision of Parquet. For context there have been a bunch of new
formats [1][2][3] that show there is decent room for improvement across
data encodings and how metadata is organized.
Specifically, in a new format revision I think we should be thinking
about
the following areas for improvements:
1. More efficient encodings that allow for data skipping and SIMD
optimizations.
2. More efficient metadata handling for deserialization and projection
to
address areas when metadata deserialization time is not trivial [4].
3. Possibly thinking about different encodings instead of
repetition/definition for repeated and nested field
4. Support for optimizing semi-structured data (e.g. JSON or Variant
type)
that can shred elements into individual columns (a recent thread in
Iceberg
mentions doing this at the metadata level [5])
I think the goals of V3 would be to provide existing API compatibility as
broadly as possible (possibly with some performance loss) and expose new
API surface areas where appropriate to make use of new elements. New
encodings could be backported so they can be made use of without metadata
changes. I think unfortunately that for points 2 and 3 we would want to
break file level compatibility. More thought would be needed to consider
whether 4 could be backported effectively.
This is a non-trivial amount of work to get good coverage across
implementations, so before putting together more formal proposal it would
be nice to know if:
1. If there is an appetite in the general community to consider these
changes
2. If anybody from the community is interested in collaborating on
proposals/implementation in this area.
Thanks,
Micah
[1] https://github.com/maxi-k/btrblocks
[2] https://github.com/facebookincubator/nimble
[3] https://blog.lancedb.com/lance-v2/
[4] https://github.com/apache/arrow/issues/39676
[5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34