One of the things they've done in the Delta table format which I think is smart is to stop using version numbers and instead start identifying specific features used by the table in a generic fashion. So instead of checking an opaque version number, a reader looks at the list of features and can say "I don't recognize the feature identified as 'deletionVectors' and therefore I can't read this table."
On Mon, May 13, 2024 at 8:10 AM Raphael Taylor-Davies <[email protected]> wrote: > Further to what has already been said, I have likewise found the v2 > branding quite hard to follow, but more fundamentally I have struggled > to understand its purpose. As far as I understand it, version 2 groups > together a number of disjoint features from new data pages to different > encodings, that practically speaking implementations can and do support > independently. Adding further confusion to this situation is that there > are also a number of features such as page indexes, bloom filters, > statistics improvements, etc... that appear to sit outside of this > versioning? > > I guess I wonder if rather than having a parquet format version 2, or > even a parquet format version 3, we could just document what features a > given parquet implementation actually supports. I believe Andrew intends > to pick up on where previous efforts here left off. Not only would this > allow for quicker ecosystem adoption of smaller / less controversial > changes, for example version 2 data pages, but could also be used to > highlight higher-level functionality such as late materialization that > are more a function of the reader implementation than the format itself. > > I can't confess to having closely followed every proposed parquet > replacement but I have not yet seen anything that couldn't be done in an > additive fashion on top of parquet, by extending the format and/or the > implementations. I personally would be very interested in delta > encodings that are more amenable to record skipping and SIMD, as I have > struggled to make the Rust version of the existing parquet DELTA > encodings perform as well as the PLAIN encodings. > > Kind Regards, > > Raphael > > On 13/05/2024 13:55, Antoine Pitrou wrote: > > Same as Andrew. > > > > 1) the "v3" messaging is intuitively a turn-off as it's already not > > obvious whether Parquet "v2" is usable with implementations currenly > > found in the wild. Concretely, the "v2" branding is commonly confused > > with the Parquet format version, and it's almost impossible to explain > > how they relate and differ without diving into implementation minutiae. > > > > 2) the "v3" messaging doesn't say anything about compatibility or > > features: is "v3" a functional superset of "v2"? is it a clean slate > > redesign of the Parquet format? does it use different technologies (for > > example Flatbuffers instead of Thrift)? > > > > While I would be curious to see a list of proposed changes, I'm also not > > very convinced that launching such an initiative is desirable nor > > sustainable for the Parquet development community. > > > > Regards > > > > Antoine. > > > > > > On Sun, 12 May 2024 05:30:57 -0400 > > Andrew Lamb <[email protected]> > > wrote: > >> My opinion is that most (if not all) of the proposed benefits from these > >> new formats can be achieved using the currrent parquet format and > improved > >> implementations (possibly with some minor extensions such as user > defined > >> encoding schemes)[1] > >> > >> Another reason people propose replacing parquet I think is the "what is > V2 > >> and what supports it" confusion, along with a perception that the Apache > >> Parquet community mostly focuses on parquet-mr and not the format or the > >> myriad of other implementations. Thankfully this is starting to > change[2] > >> > >> Thus, I think the best response for the Parquet community to these new > >> format proposals is to clarify the current implementation situation > (which > >> will indirectly lead to more investment in current implementations) > >> > >> Note this doesn't preclude "v3" of parquet, but I think in order to > >> drive V3 adoption we first need to get the existing communication in > better > >> working order > >> > >> Andrew > >> > >> [1] I realize I need some more data to back up that assertion, and I am > >> working on it. > >> [2] https://github.com/apache/parquet-site/pull/53 > >> > >> > >> > >> On Sun, May 12, 2024 at 4:48 AM Gang Wu < > [email protected]> wrote: > >> > >>> Hi Micah, > >>> > >>> I have also noticed the emergence of these new file formats which are > >>> challenging the popularity of Apache Parquet. It would always be good > >>> to evolve Parquet to be competitive. Personally I'm +1 on this. I'm > also > >>> proposing adding a new geometry type to the specs: [1]. This seems > >>> to align with the goal of V3 to some extent. > >>> > >>> On the other hand, I'm also concerned with some aspects: > >>> 1. Are there sufficient developers to work on this? As a committer to > both > >>> parquet-cpp and parquet-mr, I can take part in the V3 but I'm not sure > if > >>> there are enough active contributors. It would be good if some > companies > >>> could have dedicated people to work on this and move things forward. > >>> 2. Users may not be willing to adopt new formats if current businesses > >>> do not have any issue. Especially for users from large enterprises. > Think > >>> about the current issues of V2 [2]. > >>> > >>> All in all, I feel excited about V3. > >>> > >>> [1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx > >>> [2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn > >>> > >>> Best, > >>> Gang > >>> > >>> On Sun, May 12, 2024 at 6:59 AM Micah Kornfield <[email protected] > > > >>> wrote: > >>> > >>>> Hi Parquet Dev, > >>>> I wanted to start a conversation within the community about working > on a > >>>> new revision of Parquet. For context there have been a bunch of new > >>>> formats [1][2][3] that show there is decent room for improvement > across > >>>> data encodings and how metadata is organized. > >>>> > >>>> Specifically, in a new format revision I think we should be thinking > >>> about > >>>> the following areas for improvements: > >>>> 1. More efficient encodings that allow for data skipping and SIMD > >>>> optimizations. > >>>> 2. More efficient metadata handling for deserialization and > projection > >>> to > >>>> address areas when metadata deserialization time is not trivial [4]. > >>>> 3. Possibly thinking about different encodings instead of > >>>> repetition/definition for repeated and nested field > >>>> 4. Support for optimizing semi-structured data (e.g. JSON or Variant > >>> type) > >>>> that can shred elements into individual columns (a recent thread in > >>> Iceberg > >>>> mentions doing this at the metadata level [5]) > >>>> > >>>> I think the goals of V3 would be to provide existing API > compatibility as > >>>> broadly as possible (possibly with some performance loss) and expose > new > >>>> API surface areas where appropriate to make use of new elements. New > >>>> encodings could be backported so they can be made use of without > metadata > >>>> changes. I think unfortunately that for points 2 and 3 we would want > to > >>>> break file level compatibility. More thought would be needed to > consider > >>>> whether 4 could be backported effectively. > >>>> > >>>> This is a non-trivial amount of work to get good coverage across > >>>> implementations, so before putting together more formal proposal it > would > >>>> be nice to know if: > >>>> > >>>> 1. If there is an appetite in the general community to consider these > >>>> changes > >>>> 2. If anybody from the community is interested in collaborating on > >>>> proposals/implementation in this area. > >>>> > >>>> Thanks, > >>>> Micah > >>>> > >>>> [1] https://github.com/maxi-k/btrblocks > >>>> [2] https://github.com/facebookincubator/nimble > >>>> [3] https://blog.lancedb.com/lance-v2/ > >>>> [4] https://github.com/apache/arrow/issues/39676 > >>>> [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 > >>>> > >>> > > > > >
