I would be quite interested in working on data skipping and metadata bottlenecks (points 1. and 2.).
On Mon, May 13, 2024 at 5:28 PM Curt Hagenlocher <c...@hagenlocher.org> wrote: > One of the things they've done in the Delta table format which I think is > smart is to stop using version numbers and instead start identifying > specific features used by the table in a generic fashion. So instead of > checking an opaque version number, a reader looks at the list of features > and can say "I don't recognize the feature identified as 'deletionVectors' > and therefore I can't read this table." > > On Mon, May 13, 2024 at 8:10 AM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: > > > Further to what has already been said, I have likewise found the v2 > > branding quite hard to follow, but more fundamentally I have struggled > > to understand its purpose. As far as I understand it, version 2 groups > > together a number of disjoint features from new data pages to different > > encodings, that practically speaking implementations can and do support > > independently. Adding further confusion to this situation is that there > > are also a number of features such as page indexes, bloom filters, > > statistics improvements, etc... that appear to sit outside of this > > versioning? > > > > I guess I wonder if rather than having a parquet format version 2, or > > even a parquet format version 3, we could just document what features a > > given parquet implementation actually supports. I believe Andrew intends > > to pick up on where previous efforts here left off. Not only would this > > allow for quicker ecosystem adoption of smaller / less controversial > > changes, for example version 2 data pages, but could also be used to > > highlight higher-level functionality such as late materialization that > > are more a function of the reader implementation than the format itself. > > > > I can't confess to having closely followed every proposed parquet > > replacement but I have not yet seen anything that couldn't be done in an > > additive fashion on top of parquet, by extending the format and/or the > > implementations. I personally would be very interested in delta > > encodings that are more amenable to record skipping and SIMD, as I have > > struggled to make the Rust version of the existing parquet DELTA > > encodings perform as well as the PLAIN encodings. > > > > Kind Regards, > > > > Raphael > > > > On 13/05/2024 13:55, Antoine Pitrou wrote: > > > Same as Andrew. > > > > > > 1) the "v3" messaging is intuitively a turn-off as it's already not > > > obvious whether Parquet "v2" is usable with implementations currenly > > > found in the wild. Concretely, the "v2" branding is commonly confused > > > with the Parquet format version, and it's almost impossible to explain > > > how they relate and differ without diving into implementation minutiae. > > > > > > 2) the "v3" messaging doesn't say anything about compatibility or > > > features: is "v3" a functional superset of "v2"? is it a clean slate > > > redesign of the Parquet format? does it use different technologies (for > > > example Flatbuffers instead of Thrift)? > > > > > > While I would be curious to see a list of proposed changes, I'm also > not > > > very convinced that launching such an initiative is desirable nor > > > sustainable for the Parquet development community. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > On Sun, 12 May 2024 05:30:57 -0400 > > > Andrew Lamb <andrewlam...@gmail.com> > > > wrote: > > >> My opinion is that most (if not all) of the proposed benefits from > these > > >> new formats can be achieved using the currrent parquet format and > > improved > > >> implementations (possibly with some minor extensions such as user > > defined > > >> encoding schemes)[1] > > >> > > >> Another reason people propose replacing parquet I think is the "what > is > > V2 > > >> and what supports it" confusion, along with a perception that the > Apache > > >> Parquet community mostly focuses on parquet-mr and not the format or > the > > >> myriad of other implementations. Thankfully this is starting to > > change[2] > > >> > > >> Thus, I think the best response for the Parquet community to these new > > >> format proposals is to clarify the current implementation situation > > (which > > >> will indirectly lead to more investment in current implementations) > > >> > > >> Note this doesn't preclude "v3" of parquet, but I think in order to > > >> drive V3 adoption we first need to get the existing communication in > > better > > >> working order > > >> > > >> Andrew > > >> > > >> [1] I realize I need some more data to back up that assertion, and I > am > > >> working on it. > > >> [2] https://github.com/apache/parquet-site/pull/53 > > >> > > >> > > >> > > >> On Sun, May 12, 2024 at 4:48 AM Gang Wu < > > ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > > >> > > >>> Hi Micah, > > >>> > > >>> I have also noticed the emergence of these new file formats which are > > >>> challenging the popularity of Apache Parquet. It would always be good > > >>> to evolve Parquet to be competitive. Personally I'm +1 on this. I'm > > also > > >>> proposing adding a new geometry type to the specs: [1]. This seems > > >>> to align with the goal of V3 to some extent. > > >>> > > >>> On the other hand, I'm also concerned with some aspects: > > >>> 1. Are there sufficient developers to work on this? As a committer to > > both > > >>> parquet-cpp and parquet-mr, I can take part in the V3 but I'm not > sure > > if > > >>> there are enough active contributors. It would be good if some > > companies > > >>> could have dedicated people to work on this and move things forward. > > >>> 2. Users may not be willing to adopt new formats if current > businesses > > >>> do not have any issue. Especially for users from large enterprises. > > Think > > >>> about the current issues of V2 [2]. > > >>> > > >>> All in all, I feel excited about V3. > > >>> > > >>> [1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx > > >>> [2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn > > >>> > > >>> Best, > > >>> Gang > > >>> > > >>> On Sun, May 12, 2024 at 6:59 AM Micah Kornfield < > emkornfi...@gmail.com > > > > > >>> wrote: > > >>> > > >>>> Hi Parquet Dev, > > >>>> I wanted to start a conversation within the community about working > > on a > > >>>> new revision of Parquet. For context there have been a bunch of new > > >>>> formats [1][2][3] that show there is decent room for improvement > > across > > >>>> data encodings and how metadata is organized. > > >>>> > > >>>> Specifically, in a new format revision I think we should be thinking > > >>> about > > >>>> the following areas for improvements: > > >>>> 1. More efficient encodings that allow for data skipping and SIMD > > >>>> optimizations. > > >>>> 2. More efficient metadata handling for deserialization and > > projection > > >>> to > > >>>> address areas when metadata deserialization time is not trivial [4]. > > >>>> 3. Possibly thinking about different encodings instead of > > >>>> repetition/definition for repeated and nested field > > >>>> 4. Support for optimizing semi-structured data (e.g. JSON or > Variant > > >>> type) > > >>>> that can shred elements into individual columns (a recent thread in > > >>> Iceberg > > >>>> mentions doing this at the metadata level [5]) > > >>>> > > >>>> I think the goals of V3 would be to provide existing API > > compatibility as > > >>>> broadly as possible (possibly with some performance loss) and expose > > new > > >>>> API surface areas where appropriate to make use of new elements. > New > > >>>> encodings could be backported so they can be made use of without > > metadata > > >>>> changes. I think unfortunately that for points 2 and 3 we would > want > > to > > >>>> break file level compatibility. More thought would be needed to > > consider > > >>>> whether 4 could be backported effectively. > > >>>> > > >>>> This is a non-trivial amount of work to get good coverage across > > >>>> implementations, so before putting together more formal proposal it > > would > > >>>> be nice to know if: > > >>>> > > >>>> 1. If there is an appetite in the general community to consider > these > > >>>> changes > > >>>> 2. If anybody from the community is interested in collaborating on > > >>>> proposals/implementation in this area. > > >>>> > > >>>> Thanks, > > >>>> Micah > > >>>> > > >>>> [1] https://github.com/maxi-k/btrblocks > > >>>> [2] https://github.com/facebookincubator/nimble > > >>>> [3] https://blog.lancedb.com/lance-v2/ > > >>>> [4] https://github.com/apache/arrow/issues/39676 > > >>>> [5] > https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 > > >>>> > > >>> > > > > > > > > >