One of the things they've done in the Delta table format which I think is
smart is to stop using version numbers and instead start identifying
specific features used by the table in a generic fashion. So instead of
checking an opaque version number, a reader looks at the list of features
and can say "I don't recognize the feature identified as 'deletionVectors'
and therefore I can't read this table."

On Mon, May 13, 2024 at 8:10 AM Raphael Taylor-Davies
<[email protected]> wrote:

> Further to what has already been said, I have likewise found the v2
> branding quite hard to follow, but more fundamentally I have struggled
> to understand its purpose. As far as I understand it, version 2 groups
> together a number of disjoint features from new data pages to different
> encodings, that practically speaking implementations can and do support
> independently. Adding further confusion to this situation is that there
> are also a number of features such as page indexes, bloom filters,
> statistics improvements, etc... that appear to sit outside of this
> versioning?
>
> I guess I wonder if rather than having a parquet format version 2, or
> even a parquet format version 3, we could just document what features a
> given parquet implementation actually supports. I believe Andrew intends
> to pick up on where previous efforts here left off. Not only would this
> allow for quicker ecosystem adoption of smaller / less controversial
> changes, for example version 2 data pages, but could also be used to
> highlight higher-level functionality such as late materialization that
> are more a function of the reader implementation than the format itself.
>
> I can't confess to having closely followed every proposed parquet
> replacement but I have not yet seen anything that couldn't be done in an
> additive fashion on top of parquet, by extending the format and/or the
> implementations. I personally would be very interested in delta
> encodings that are more amenable to record skipping and SIMD, as I have
> struggled to make the Rust version of the existing parquet DELTA
> encodings perform as well as the PLAIN encodings.
>
> Kind Regards,
>
> Raphael
>
> On 13/05/2024 13:55, Antoine Pitrou wrote:
> > Same as Andrew.
> >
> > 1) the "v3" messaging is intuitively a turn-off as it's already not
> > obvious whether Parquet "v2" is usable with implementations currenly
> > found in the wild. Concretely, the "v2" branding is commonly confused
> > with the Parquet format version, and it's almost impossible to explain
> > how they relate and differ without diving into implementation minutiae.
> >
> > 2) the "v3" messaging doesn't say anything about compatibility or
> > features: is "v3" a functional superset of "v2"? is it a clean slate
> > redesign of the Parquet format? does it use different technologies (for
> > example Flatbuffers instead of Thrift)?
> >
> > While I would be curious to see a list of proposed changes, I'm also not
> > very convinced that launching such an initiative is desirable nor
> > sustainable for the Parquet development community.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Sun, 12 May 2024 05:30:57 -0400
> > Andrew Lamb <[email protected]>
> > wrote:
> >> My opinion is that most (if not all) of the proposed benefits from these
> >> new formats can be achieved using the currrent parquet format and
> improved
> >> implementations (possibly with some minor extensions such as user
> defined
> >> encoding schemes)[1]
> >>
> >> Another reason people propose replacing parquet I think is the "what is
> V2
> >> and what supports it" confusion, along with a perception that the Apache
> >> Parquet community mostly focuses on parquet-mr and not the format or the
> >> myriad of other implementations. Thankfully this is starting to
> change[2]
> >>
> >> Thus, I think the best response for the Parquet community to these new
> >> format proposals is to clarify the current implementation situation
> (which
> >> will indirectly lead to more investment in current implementations)
> >>
> >> Note this doesn't preclude "v3" of parquet, but I think in order to
> >> drive V3 adoption we first need to get the existing communication in
> better
> >> working order
> >>
> >> Andrew
> >>
> >> [1] I realize I need some more data to back up that assertion, and I am
> >> working on it.
> >> [2] https://github.com/apache/parquet-site/pull/53
> >>
> >>
> >>
> >> On Sun, May 12, 2024 at 4:48 AM Gang Wu <
> [email protected]> wrote:
> >>
> >>> Hi Micah,
> >>>
> >>> I have also noticed the emergence of these new file formats which are
> >>> challenging the popularity of Apache Parquet. It would always be good
> >>> to evolve Parquet to be competitive. Personally I'm +1 on this. I'm
> also
> >>> proposing adding a new geometry type to the specs: [1]. This seems
> >>> to align with the goal of V3 to some extent.
> >>>
> >>> On the other hand, I'm also concerned with some aspects:
> >>> 1. Are there sufficient developers to work on this? As a committer to
> both
> >>> parquet-cpp and parquet-mr, I can take part in the V3 but I'm not sure
> if
> >>> there are enough active contributors. It would be good if some
> companies
> >>> could have dedicated people to work on this and move things forward.
> >>> 2. Users may not be willing to adopt new formats if current businesses
> >>> do not have any issue. Especially for users from large enterprises.
> Think
> >>> about the current issues of V2 [2].
> >>>
> >>> All in all, I feel excited about V3.
> >>>
> >>> [1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx
> >>> [2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn
> >>>
> >>> Best,
> >>> Gang
> >>>
> >>> On Sun, May 12, 2024 at 6:59 AM Micah Kornfield <[email protected]
> >
> >>> wrote:
> >>>
> >>>> Hi Parquet Dev,
> >>>> I wanted to start a conversation within the community about working
> on a
> >>>> new revision of Parquet.  For context there have been a bunch of new
> >>>> formats [1][2][3] that show there is decent room for improvement
> across
> >>>> data encodings and how metadata is organized.
> >>>>
> >>>> Specifically, in a new format revision I think we should be thinking
> >>> about
> >>>> the following areas for improvements:
> >>>> 1.  More efficient encodings that allow for data skipping and SIMD
> >>>> optimizations.
> >>>> 2.  More efficient metadata handling for deserialization and
> projection
> >>> to
> >>>> address areas when metadata deserialization time is not trivial [4].
> >>>> 3.  Possibly thinking about different encodings instead of
> >>>> repetition/definition for repeated and nested field
> >>>> 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> >>> type)
> >>>> that can shred elements into individual columns (a recent thread in
> >>> Iceberg
> >>>> mentions doing this at the metadata level [5])
> >>>>
> >>>> I think the goals of V3 would be to provide existing API
> compatibility as
> >>>> broadly as possible (possibly with some performance loss) and expose
> new
> >>>> API surface areas where appropriate to make use of new elements.  New
> >>>> encodings could be backported so they can be made use of without
> metadata
> >>>> changes.  I think unfortunately that for points 2 and 3 we would want
> to
> >>>> break file level compatibility.  More thought would be needed to
> consider
> >>>> whether 4 could be backported effectively.
> >>>>
> >>>> This is a non-trivial amount of work to get good coverage across
> >>>> implementations, so before putting together more formal proposal it
> would
> >>>> be nice to know if:
> >>>>
> >>>> 1.  If there is an appetite in the general community to consider these
> >>>> changes
> >>>> 2.  If anybody from the community is interested in collaborating on
> >>>> proposals/implementation in this area.
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>> [1] https://github.com/maxi-k/btrblocks
> >>>> [2] https://github.com/facebookincubator/nimble
> >>>> [3] https://blog.lancedb.com/lance-v2/
> >>>> [4] https://github.com/apache/arrow/issues/39676
> >>>> [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> >>>>
> >>>
> >
> >
>

Reply via email to