I think the whole "V1" vs "V2" mess is unfortunate. IMO there is only one
version of the Parquet file format. At its core, the data layout (row groups
composed of column chunks composed of Dremel encoded pages) has
never changed. Encodings/codecs/structures have been added to that core,
but always in a backwards compatible way.
I agree that many of the perceived shortcomings might be addressed without
breaking changes to the file format. I myself would be interested in
exploring
ways to address the point lookup and wide tables issues while maintaining
backwards compatibility. But that said, if there are ways to gain large
performance gains that would necessitate an actual new file format version
(such as replacing thrift, new metadata organization, some alternative to
Dremel), I'd be open to exploring those options as well.
Thanks,
Ed
On 5/11/24 3:58 PM, Micah Kornfield wrote:
Hi Parquet Dev,
I wanted to start a conversation within the community about working on a
new revision of Parquet. For context there have been a bunch of new
formats [1][2][3] that show there is decent room for improvement across
data encodings and how metadata is organized.
Specifically, in a new format revision I think we should be thinking about
the following areas for improvements:
1. More efficient encodings that allow for data skipping and SIMD
optimizations.
2. More efficient metadata handling for deserialization and projection to
address areas when metadata deserialization time is not trivial [4].
3. Possibly thinking about different encodings instead of
repetition/definition for repeated and nested field
4. Support for optimizing semi-structured data (e.g. JSON or Variant type)
that can shred elements into individual columns (a recent thread in Iceberg
mentions doing this at the metadata level [5])
I think the goals of V3 would be to provide existing API compatibility as
broadly as possible (possibly with some performance loss) and expose new
API surface areas where appropriate to make use of new elements. New
encodings could be backported so they can be made use of without metadata
changes. I think unfortunately that for points 2 and 3 we would want to
break file level compatibility. More thought would be needed to consider
whether 4 could be backported effectively.
This is a non-trivial amount of work to get good coverage across
implementations, so before putting together more formal proposal it would
be nice to know if:
1. If there is an appetite in the general community to consider these
changes
2. If anybody from the community is interested in collaborating on
proposals/implementation in this area.
Thanks,
Micah
[1] https://github.com/maxi-k/btrblocks
[2] https://github.com/facebookincubator/nimble
[3] https://blog.lancedb.com/lance-v2/
[4] https://github.com/apache/arrow/issues/39676
[5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34