There must be something in the water: Nimble and Lance: The Parquet Killers - by Chris Riccomini (materializedview.io) <https://materializedview.io/p/nimble-and-lance-parquet-killers?utm_campaign=email-half-post&r=3pf5se&utm_source=substack&utm_medium=email>
On Mon, May 13, 2024 at 10:01 AM Rok Mihevc <rok.mih...@gmail.com> wrote: > I would be quite interested in working on data skipping and metadata > bottlenecks (points 1. and 2.). > > On Mon, May 13, 2024 at 5:28 PM Curt Hagenlocher <c...@hagenlocher.org> > wrote: > > > One of the things they've done in the Delta table format which I think is > > smart is to stop using version numbers and instead start identifying > > specific features used by the table in a generic fashion. So instead of > > checking an opaque version number, a reader looks at the list of features > > and can say "I don't recognize the feature identified as > 'deletionVectors' > > and therefore I can't read this table." > > > > On Mon, May 13, 2024 at 8:10 AM Raphael Taylor-Davies > > <r.taylordav...@googlemail.com.invalid> wrote: > > > > > Further to what has already been said, I have likewise found the v2 > > > branding quite hard to follow, but more fundamentally I have struggled > > > to understand its purpose. As far as I understand it, version 2 groups > > > together a number of disjoint features from new data pages to different > > > encodings, that practically speaking implementations can and do support > > > independently. Adding further confusion to this situation is that there > > > are also a number of features such as page indexes, bloom filters, > > > statistics improvements, etc... that appear to sit outside of this > > > versioning? > > > > > > I guess I wonder if rather than having a parquet format version 2, or > > > even a parquet format version 3, we could just document what features a > > > given parquet implementation actually supports. I believe Andrew > intends > > > to pick up on where previous efforts here left off. Not only would this > > > allow for quicker ecosystem adoption of smaller / less controversial > > > changes, for example version 2 data pages, but could also be used to > > > highlight higher-level functionality such as late materialization that > > > are more a function of the reader implementation than the format > itself. > > > > > > I can't confess to having closely followed every proposed parquet > > > replacement but I have not yet seen anything that couldn't be done in > an > > > additive fashion on top of parquet, by extending the format and/or the > > > implementations. I personally would be very interested in delta > > > encodings that are more amenable to record skipping and SIMD, as I have > > > struggled to make the Rust version of the existing parquet DELTA > > > encodings perform as well as the PLAIN encodings. > > > > > > Kind Regards, > > > > > > Raphael > > > > > > On 13/05/2024 13:55, Antoine Pitrou wrote: > > > > Same as Andrew. > > > > > > > > 1) the "v3" messaging is intuitively a turn-off as it's already not > > > > obvious whether Parquet "v2" is usable with implementations currenly > > > > found in the wild. Concretely, the "v2" branding is commonly confused > > > > with the Parquet format version, and it's almost impossible to > explain > > > > how they relate and differ without diving into implementation > minutiae. > > > > > > > > 2) the "v3" messaging doesn't say anything about compatibility or > > > > features: is "v3" a functional superset of "v2"? is it a clean slate > > > > redesign of the Parquet format? does it use different technologies > (for > > > > example Flatbuffers instead of Thrift)? > > > > > > > > While I would be curious to see a list of proposed changes, I'm also > > not > > > > very convinced that launching such an initiative is desirable nor > > > > sustainable for the Parquet development community. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > On Sun, 12 May 2024 05:30:57 -0400 > > > > Andrew Lamb <andrewlam...@gmail.com> > > > > wrote: > > > >> My opinion is that most (if not all) of the proposed benefits from > > these > > > >> new formats can be achieved using the currrent parquet format and > > > improved > > > >> implementations (possibly with some minor extensions such as user > > > defined > > > >> encoding schemes)[1] > > > >> > > > >> Another reason people propose replacing parquet I think is the "what > > is > > > V2 > > > >> and what supports it" confusion, along with a perception that the > > Apache > > > >> Parquet community mostly focuses on parquet-mr and not the format or > > the > > > >> myriad of other implementations. Thankfully this is starting to > > > change[2] > > > >> > > > >> Thus, I think the best response for the Parquet community to these > new > > > >> format proposals is to clarify the current implementation situation > > > (which > > > >> will indirectly lead to more investment in current implementations) > > > >> > > > >> Note this doesn't preclude "v3" of parquet, but I think in order to > > > >> drive V3 adoption we first need to get the existing communication in > > > better > > > >> working order > > > >> > > > >> Andrew > > > >> > > > >> [1] I realize I need some more data to back up that assertion, and I > > am > > > >> working on it. > > > >> [2] https://github.com/apache/parquet-site/pull/53 > > > >> > > > >> > > > >> > > > >> On Sun, May 12, 2024 at 4:48 AM Gang Wu < > > > ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > > > >> > > > >>> Hi Micah, > > > >>> > > > >>> I have also noticed the emergence of these new file formats which > are > > > >>> challenging the popularity of Apache Parquet. It would always be > good > > > >>> to evolve Parquet to be competitive. Personally I'm +1 on this. I'm > > > also > > > >>> proposing adding a new geometry type to the specs: [1]. This seems > > > >>> to align with the goal of V3 to some extent. > > > >>> > > > >>> On the other hand, I'm also concerned with some aspects: > > > >>> 1. Are there sufficient developers to work on this? As a committer > to > > > both > > > >>> parquet-cpp and parquet-mr, I can take part in the V3 but I'm not > > sure > > > if > > > >>> there are enough active contributors. It would be good if some > > > companies > > > >>> could have dedicated people to work on this and move things > forward. > > > >>> 2. Users may not be willing to adopt new formats if current > > businesses > > > >>> do not have any issue. Especially for users from large enterprises. > > > Think > > > >>> about the current issues of V2 [2]. > > > >>> > > > >>> All in all, I feel excited about V3. > > > >>> > > > >>> [1] > https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx > > > >>> [2] > https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn > > > >>> > > > >>> Best, > > > >>> Gang > > > >>> > > > >>> On Sun, May 12, 2024 at 6:59 AM Micah Kornfield < > > emkornfi...@gmail.com > > > > > > > >>> wrote: > > > >>> > > > >>>> Hi Parquet Dev, > > > >>>> I wanted to start a conversation within the community about > working > > > on a > > > >>>> new revision of Parquet. For context there have been a bunch of > new > > > >>>> formats [1][2][3] that show there is decent room for improvement > > > across > > > >>>> data encodings and how metadata is organized. > > > >>>> > > > >>>> Specifically, in a new format revision I think we should be > thinking > > > >>> about > > > >>>> the following areas for improvements: > > > >>>> 1. More efficient encodings that allow for data skipping and SIMD > > > >>>> optimizations. > > > >>>> 2. More efficient metadata handling for deserialization and > > > projection > > > >>> to > > > >>>> address areas when metadata deserialization time is not trivial > [4]. > > > >>>> 3. Possibly thinking about different encodings instead of > > > >>>> repetition/definition for repeated and nested field > > > >>>> 4. Support for optimizing semi-structured data (e.g. JSON or > > Variant > > > >>> type) > > > >>>> that can shred elements into individual columns (a recent thread > in > > > >>> Iceberg > > > >>>> mentions doing this at the metadata level [5]) > > > >>>> > > > >>>> I think the goals of V3 would be to provide existing API > > > compatibility as > > > >>>> broadly as possible (possibly with some performance loss) and > expose > > > new > > > >>>> API surface areas where appropriate to make use of new elements. > > New > > > >>>> encodings could be backported so they can be made use of without > > > metadata > > > >>>> changes. I think unfortunately that for points 2 and 3 we would > > want > > > to > > > >>>> break file level compatibility. More thought would be needed to > > > consider > > > >>>> whether 4 could be backported effectively. > > > >>>> > > > >>>> This is a non-trivial amount of work to get good coverage across > > > >>>> implementations, so before putting together more formal proposal > it > > > would > > > >>>> be nice to know if: > > > >>>> > > > >>>> 1. If there is an appetite in the general community to consider > > these > > > >>>> changes > > > >>>> 2. If anybody from the community is interested in collaborating > on > > > >>>> proposals/implementation in this area. > > > >>>> > > > >>>> Thanks, > > > >>>> Micah > > > >>>> > > > >>>> [1] https://github.com/maxi-k/btrblocks > > > >>>> [2] https://github.com/facebookincubator/nimble > > > >>>> [3] https://blog.lancedb.com/lance-v2/ > > > >>>> [4] https://github.com/apache/arrow/issues/39676 > > > >>>> [5] > > https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 > > > >>>> > > > >>> > > > > > > > > > > > > > >