Re: Interest in Parquet V3

Curt Hagenlocher Mon, 13 May 2024 10:07:01 -0700

There must be something in the water: Nimble and Lance: The Parquet Killers
- by Chris Riccomini (materializedview.io)
<https://materializedview.io/p/nimble-and-lance-parquet-killers?utm_campaign=email-half-post&r=3pf5se&utm_source=substack&utm_medium=email>


On Mon, May 13, 2024 at 10:01 AM Rok Mihevc <rok.mih...@gmail.com> wrote:

> I would be quite interested in working on data skipping and metadata
> bottlenecks (points 1. and 2.).
>
> On Mon, May 13, 2024 at 5:28 PM Curt Hagenlocher <c...@hagenlocher.org>
> wrote:
>
> > One of the things they've done in the Delta table format which I think is
> > smart is to stop using version numbers and instead start identifying
> > specific features used by the table in a generic fashion. So instead of
> > checking an opaque version number, a reader looks at the list of features
> > and can say "I don't recognize the feature identified as
> 'deletionVectors'
> > and therefore I can't read this table."
> >
> > On Mon, May 13, 2024 at 8:10 AM Raphael Taylor-Davies
> > <r.taylordav...@googlemail.com.invalid> wrote:
> >
> > > Further to what has already been said, I have likewise found the v2
> > > branding quite hard to follow, but more fundamentally I have struggled
> > > to understand its purpose. As far as I understand it, version 2 groups
> > > together a number of disjoint features from new data pages to different
> > > encodings, that practically speaking implementations can and do support
> > > independently. Adding further confusion to this situation is that there
> > > are also a number of features such as page indexes, bloom filters,
> > > statistics improvements, etc... that appear to sit outside of this
> > > versioning?
> > >
> > > I guess I wonder if rather than having a parquet format version 2, or
> > > even a parquet format version 3, we could just document what features a
> > > given parquet implementation actually supports. I believe Andrew
> intends
> > > to pick up on where previous efforts here left off. Not only would this
> > > allow for quicker ecosystem adoption of smaller / less controversial
> > > changes, for example version 2 data pages, but could also be used to
> > > highlight higher-level functionality such as late materialization that
> > > are more a function of the reader implementation than the format
> itself.
> > >
> > > I can't confess to having closely followed every proposed parquet
> > > replacement but I have not yet seen anything that couldn't be done in
> an
> > > additive fashion on top of parquet, by extending the format and/or the
> > > implementations. I personally would be very interested in delta
> > > encodings that are more amenable to record skipping and SIMD, as I have
> > > struggled to make the Rust version of the existing parquet DELTA
> > > encodings perform as well as the PLAIN encodings.
> > >
> > > Kind Regards,
> > >
> > > Raphael
> > >
> > > On 13/05/2024 13:55, Antoine Pitrou wrote:
> > > > Same as Andrew.
> > > >
> > > > 1) the "v3" messaging is intuitively a turn-off as it's already not
> > > > obvious whether Parquet "v2" is usable with implementations currenly
> > > > found in the wild. Concretely, the "v2" branding is commonly confused
> > > > with the Parquet format version, and it's almost impossible to
> explain
> > > > how they relate and differ without diving into implementation
> minutiae.
> > > >
> > > > 2) the "v3" messaging doesn't say anything about compatibility or
> > > > features: is "v3" a functional superset of "v2"? is it a clean slate
> > > > redesign of the Parquet format? does it use different technologies
> (for
> > > > example Flatbuffers instead of Thrift)?
> > > >
> > > > While I would be curious to see a list of proposed changes, I'm also
> > not
> > > > very convinced that launching such an initiative is desirable nor
> > > > sustainable for the Parquet development community.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > On Sun, 12 May 2024 05:30:57 -0400
> > > > Andrew Lamb <andrewlam...@gmail.com>
> > > > wrote:
> > > >> My opinion is that most (if not all) of the proposed benefits from
> > these
> > > >> new formats can be achieved using the currrent parquet format and
> > > improved
> > > >> implementations (possibly with some minor extensions such as user
> > > defined
> > > >> encoding schemes)[1]
> > > >>
> > > >> Another reason people propose replacing parquet I think is the "what
> > is
> > > V2
> > > >> and what supports it" confusion, along with a perception that the
> > Apache
> > > >> Parquet community mostly focuses on parquet-mr and not the format or
> > the
> > > >> myriad of other implementations. Thankfully this is starting to
> > > change[2]
> > > >>
> > > >> Thus, I think the best response for the Parquet community to these
> new
> > > >> format proposals is to clarify the current implementation situation
> > > (which
> > > >> will indirectly lead to more investment in current implementations)
> > > >>
> > > >> Note this doesn't preclude "v3" of parquet, but I think in order to
> > > >> drive V3 adoption we first need to get the existing communication in
> > > better
> > > >> working order
> > > >>
> > > >> Andrew
> > > >>
> > > >> [1] I realize I need some more data to back up that assertion, and I
> > am
> > > >> working on it.
> > > >> [2] https://github.com/apache/parquet-site/pull/53
> > > >>
> > > >>
> > > >>
> > > >> On Sun, May 12, 2024 at 4:48 AM Gang Wu <
> > > ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > > >>
> > > >>> Hi Micah,
> > > >>>
> > > >>> I have also noticed the emergence of these new file formats which
> are
> > > >>> challenging the popularity of Apache Parquet. It would always be
> good
> > > >>> to evolve Parquet to be competitive. Personally I'm +1 on this. I'm
> > > also
> > > >>> proposing adding a new geometry type to the specs: [1]. This seems
> > > >>> to align with the goal of V3 to some extent.
> > > >>>
> > > >>> On the other hand, I'm also concerned with some aspects:
> > > >>> 1. Are there sufficient developers to work on this? As a committer
> to
> > > both
> > > >>> parquet-cpp and parquet-mr, I can take part in the V3 but I'm not
> > sure
> > > if
> > > >>> there are enough active contributors. It would be good if some
> > > companies
> > > >>> could have dedicated people to work on this and move things
> forward.
> > > >>> 2. Users may not be willing to adopt new formats if current
> > businesses
> > > >>> do not have any issue. Especially for users from large enterprises.
> > > Think
> > > >>> about the current issues of V2 [2].
> > > >>>
> > > >>> All in all, I feel excited about V3.
> > > >>>
> > > >>> [1]
> https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx
> > > >>> [2]
> https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn
> > > >>>
> > > >>> Best,
> > > >>> Gang
> > > >>>
> > > >>> On Sun, May 12, 2024 at 6:59 AM Micah Kornfield <
> > emkornfi...@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>>> Hi Parquet Dev,
> > > >>>> I wanted to start a conversation within the community about
> working
> > > on a
> > > >>>> new revision of Parquet.  For context there have been a bunch of
> new
> > > >>>> formats [1][2][3] that show there is decent room for improvement
> > > across
> > > >>>> data encodings and how metadata is organized.
> > > >>>>
> > > >>>> Specifically, in a new format revision I think we should be
> thinking
> > > >>> about
> > > >>>> the following areas for improvements:
> > > >>>> 1.  More efficient encodings that allow for data skipping and SIMD
> > > >>>> optimizations.
> > > >>>> 2.  More efficient metadata handling for deserialization and
> > > projection
> > > >>> to
> > > >>>> address areas when metadata deserialization time is not trivial
> [4].
> > > >>>> 3.  Possibly thinking about different encodings instead of
> > > >>>> repetition/definition for repeated and nested field
> > > >>>> 4.  Support for optimizing semi-structured data (e.g. JSON or
> > Variant
> > > >>> type)
> > > >>>> that can shred elements into individual columns (a recent thread
> in
> > > >>> Iceberg
> > > >>>> mentions doing this at the metadata level [5])
> > > >>>>
> > > >>>> I think the goals of V3 would be to provide existing API
> > > compatibility as
> > > >>>> broadly as possible (possibly with some performance loss) and
> expose
> > > new
> > > >>>> API surface areas where appropriate to make use of new elements.
> > New
> > > >>>> encodings could be backported so they can be made use of without
> > > metadata
> > > >>>> changes.  I think unfortunately that for points 2 and 3 we would
> > want
> > > to
> > > >>>> break file level compatibility.  More thought would be needed to
> > > consider
> > > >>>> whether 4 could be backported effectively.
> > > >>>>
> > > >>>> This is a non-trivial amount of work to get good coverage across
> > > >>>> implementations, so before putting together more formal proposal
> it
> > > would
> > > >>>> be nice to know if:
> > > >>>>
> > > >>>> 1.  If there is an appetite in the general community to consider
> > these
> > > >>>> changes
> > > >>>> 2.  If anybody from the community is interested in collaborating
> on
> > > >>>> proposals/implementation in this area.
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Micah
> > > >>>>
> > > >>>> [1] https://github.com/maxi-k/btrblocks
> > > >>>> [2] https://github.com/facebookincubator/nimble
> > > >>>> [3] https://blog.lancedb.com/lance-v2/
> > > >>>> [4] https://github.com/apache/arrow/issues/39676
> > > >>>> [5]
> > https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> > > >>>>
> > > >>>
> > > >
> > > >
> > >
> >
>

Re: Interest in Parquet V3

Reply via email to