I would be quite interested in working on data skipping and metadata
bottlenecks (points 1. and 2.).

On Mon, May 13, 2024 at 5:28 PM Curt Hagenlocher <c...@hagenlocher.org>
wrote:

> One of the things they've done in the Delta table format which I think is
> smart is to stop using version numbers and instead start identifying
> specific features used by the table in a generic fashion. So instead of
> checking an opaque version number, a reader looks at the list of features
> and can say "I don't recognize the feature identified as 'deletionVectors'
> and therefore I can't read this table."
>
> On Mon, May 13, 2024 at 8:10 AM Raphael Taylor-Davies
> <r.taylordav...@googlemail.com.invalid> wrote:
>
> > Further to what has already been said, I have likewise found the v2
> > branding quite hard to follow, but more fundamentally I have struggled
> > to understand its purpose. As far as I understand it, version 2 groups
> > together a number of disjoint features from new data pages to different
> > encodings, that practically speaking implementations can and do support
> > independently. Adding further confusion to this situation is that there
> > are also a number of features such as page indexes, bloom filters,
> > statistics improvements, etc... that appear to sit outside of this
> > versioning?
> >
> > I guess I wonder if rather than having a parquet format version 2, or
> > even a parquet format version 3, we could just document what features a
> > given parquet implementation actually supports. I believe Andrew intends
> > to pick up on where previous efforts here left off. Not only would this
> > allow for quicker ecosystem adoption of smaller / less controversial
> > changes, for example version 2 data pages, but could also be used to
> > highlight higher-level functionality such as late materialization that
> > are more a function of the reader implementation than the format itself.
> >
> > I can't confess to having closely followed every proposed parquet
> > replacement but I have not yet seen anything that couldn't be done in an
> > additive fashion on top of parquet, by extending the format and/or the
> > implementations. I personally would be very interested in delta
> > encodings that are more amenable to record skipping and SIMD, as I have
> > struggled to make the Rust version of the existing parquet DELTA
> > encodings perform as well as the PLAIN encodings.
> >
> > Kind Regards,
> >
> > Raphael
> >
> > On 13/05/2024 13:55, Antoine Pitrou wrote:
> > > Same as Andrew.
> > >
> > > 1) the "v3" messaging is intuitively a turn-off as it's already not
> > > obvious whether Parquet "v2" is usable with implementations currenly
> > > found in the wild. Concretely, the "v2" branding is commonly confused
> > > with the Parquet format version, and it's almost impossible to explain
> > > how they relate and differ without diving into implementation minutiae.
> > >
> > > 2) the "v3" messaging doesn't say anything about compatibility or
> > > features: is "v3" a functional superset of "v2"? is it a clean slate
> > > redesign of the Parquet format? does it use different technologies (for
> > > example Flatbuffers instead of Thrift)?
> > >
> > > While I would be curious to see a list of proposed changes, I'm also
> not
> > > very convinced that launching such an initiative is desirable nor
> > > sustainable for the Parquet development community.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Sun, 12 May 2024 05:30:57 -0400
> > > Andrew Lamb <andrewlam...@gmail.com>
> > > wrote:
> > >> My opinion is that most (if not all) of the proposed benefits from
> these
> > >> new formats can be achieved using the currrent parquet format and
> > improved
> > >> implementations (possibly with some minor extensions such as user
> > defined
> > >> encoding schemes)[1]
> > >>
> > >> Another reason people propose replacing parquet I think is the "what
> is
> > V2
> > >> and what supports it" confusion, along with a perception that the
> Apache
> > >> Parquet community mostly focuses on parquet-mr and not the format or
> the
> > >> myriad of other implementations. Thankfully this is starting to
> > change[2]
> > >>
> > >> Thus, I think the best response for the Parquet community to these new
> > >> format proposals is to clarify the current implementation situation
> > (which
> > >> will indirectly lead to more investment in current implementations)
> > >>
> > >> Note this doesn't preclude "v3" of parquet, but I think in order to
> > >> drive V3 adoption we first need to get the existing communication in
> > better
> > >> working order
> > >>
> > >> Andrew
> > >>
> > >> [1] I realize I need some more data to back up that assertion, and I
> am
> > >> working on it.
> > >> [2] https://github.com/apache/parquet-site/pull/53
> > >>
> > >>
> > >>
> > >> On Sun, May 12, 2024 at 4:48 AM Gang Wu <
> > ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > >>
> > >>> Hi Micah,
> > >>>
> > >>> I have also noticed the emergence of these new file formats which are
> > >>> challenging the popularity of Apache Parquet. It would always be good
> > >>> to evolve Parquet to be competitive. Personally I'm +1 on this. I'm
> > also
> > >>> proposing adding a new geometry type to the specs: [1]. This seems
> > >>> to align with the goal of V3 to some extent.
> > >>>
> > >>> On the other hand, I'm also concerned with some aspects:
> > >>> 1. Are there sufficient developers to work on this? As a committer to
> > both
> > >>> parquet-cpp and parquet-mr, I can take part in the V3 but I'm not
> sure
> > if
> > >>> there are enough active contributors. It would be good if some
> > companies
> > >>> could have dedicated people to work on this and move things forward.
> > >>> 2. Users may not be willing to adopt new formats if current
> businesses
> > >>> do not have any issue. Especially for users from large enterprises.
> > Think
> > >>> about the current issues of V2 [2].
> > >>>
> > >>> All in all, I feel excited about V3.
> > >>>
> > >>> [1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx
> > >>> [2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn
> > >>>
> > >>> Best,
> > >>> Gang
> > >>>
> > >>> On Sun, May 12, 2024 at 6:59 AM Micah Kornfield <
> emkornfi...@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> Hi Parquet Dev,
> > >>>> I wanted to start a conversation within the community about working
> > on a
> > >>>> new revision of Parquet.  For context there have been a bunch of new
> > >>>> formats [1][2][3] that show there is decent room for improvement
> > across
> > >>>> data encodings and how metadata is organized.
> > >>>>
> > >>>> Specifically, in a new format revision I think we should be thinking
> > >>> about
> > >>>> the following areas for improvements:
> > >>>> 1.  More efficient encodings that allow for data skipping and SIMD
> > >>>> optimizations.
> > >>>> 2.  More efficient metadata handling for deserialization and
> > projection
> > >>> to
> > >>>> address areas when metadata deserialization time is not trivial [4].
> > >>>> 3.  Possibly thinking about different encodings instead of
> > >>>> repetition/definition for repeated and nested field
> > >>>> 4.  Support for optimizing semi-structured data (e.g. JSON or
> Variant
> > >>> type)
> > >>>> that can shred elements into individual columns (a recent thread in
> > >>> Iceberg
> > >>>> mentions doing this at the metadata level [5])
> > >>>>
> > >>>> I think the goals of V3 would be to provide existing API
> > compatibility as
> > >>>> broadly as possible (possibly with some performance loss) and expose
> > new
> > >>>> API surface areas where appropriate to make use of new elements.
> New
> > >>>> encodings could be backported so they can be made use of without
> > metadata
> > >>>> changes.  I think unfortunately that for points 2 and 3 we would
> want
> > to
> > >>>> break file level compatibility.  More thought would be needed to
> > consider
> > >>>> whether 4 could be backported effectively.
> > >>>>
> > >>>> This is a non-trivial amount of work to get good coverage across
> > >>>> implementations, so before putting together more formal proposal it
> > would
> > >>>> be nice to know if:
> > >>>>
> > >>>> 1.  If there is an appetite in the general community to consider
> these
> > >>>> changes
> > >>>> 2.  If anybody from the community is interested in collaborating on
> > >>>> proposals/implementation in this area.
> > >>>>
> > >>>> Thanks,
> > >>>> Micah
> > >>>>
> > >>>> [1] https://github.com/maxi-k/btrblocks
> > >>>> [2] https://github.com/facebookincubator/nimble
> > >>>> [3] https://blog.lancedb.com/lance-v2/
> > >>>> [4] https://github.com/apache/arrow/issues/39676
> > >>>> [5]
> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> > >>>>
> > >>>
> > >
> > >
> >
>

Reply via email to