I don't have strong feelings about this one way or the other, but would gladly put my hand up to help collaborate on proposals/implementation as we figure this out.
<[email protected]> On Sun, May 12, 2024 at 5:31 AM Andrew Lamb <[email protected]> wrote: > My opinion is that most (if not all) of the proposed benefits from these > new formats can be achieved using the currrent parquet format and improved > implementations (possibly with some minor extensions such as user defined > encoding schemes)[1] > > Another reason people propose replacing parquet I think is the "what is V2 > and what supports it" confusion, along with a perception that the Apache > Parquet community mostly focuses on parquet-mr and not the format or the > myriad of other implementations. Thankfully this is starting to change[2] > > Thus, I think the best response for the Parquet community to these new > format proposals is to clarify the current implementation situation (which > will indirectly lead to more investment in current implementations) > > Note this doesn't preclude "v3" of parquet, but I think in order to > drive V3 adoption we first need to get the existing communication in better > working order > > Andrew > > [1] I realize I need some more data to back up that assertion, and I am > working on it. > [2] https://github.com/apache/parquet-site/pull/53 > > > > On Sun, May 12, 2024 at 4:48 AM Gang Wu <[email protected]> wrote: > > > Hi Micah, > > > > I have also noticed the emergence of these new file formats which are > > challenging the popularity of Apache Parquet. It would always be good > > to evolve Parquet to be competitive. Personally I'm +1 on this. I'm also > > proposing adding a new geometry type to the specs: [1]. This seems > > to align with the goal of V3 to some extent. > > > > On the other hand, I'm also concerned with some aspects: > > 1. Are there sufficient developers to work on this? As a committer to > both > > parquet-cpp and parquet-mr, I can take part in the V3 but I'm not sure if > > there are enough active contributors. It would be good if some companies > > could have dedicated people to work on this and move things forward. > > 2. Users may not be willing to adopt new formats if current businesses > > do not have any issue. Especially for users from large enterprises. Think > > about the current issues of V2 [2]. > > > > All in all, I feel excited about V3. > > > > [1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx > > [2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn > > > > Best, > > Gang > > > > On Sun, May 12, 2024 at 6:59 AM Micah Kornfield <[email protected]> > > wrote: > > > > > Hi Parquet Dev, > > > I wanted to start a conversation within the community about working on > a > > > new revision of Parquet. For context there have been a bunch of new > > > formats [1][2][3] that show there is decent room for improvement across > > > data encodings and how metadata is organized. > > > > > > Specifically, in a new format revision I think we should be thinking > > about > > > the following areas for improvements: > > > 1. More efficient encodings that allow for data skipping and SIMD > > > optimizations. > > > 2. More efficient metadata handling for deserialization and projection > > to > > > address areas when metadata deserialization time is not trivial [4]. > > > 3. Possibly thinking about different encodings instead of > > > repetition/definition for repeated and nested field > > > 4. Support for optimizing semi-structured data (e.g. JSON or Variant > > type) > > > that can shred elements into individual columns (a recent thread in > > Iceberg > > > mentions doing this at the metadata level [5]) > > > > > > I think the goals of V3 would be to provide existing API compatibility > as > > > broadly as possible (possibly with some performance loss) and expose > new > > > API surface areas where appropriate to make use of new elements. New > > > encodings could be backported so they can be made use of without > metadata > > > changes. I think unfortunately that for points 2 and 3 we would want > to > > > break file level compatibility. More thought would be needed to > consider > > > whether 4 could be backported effectively. > > > > > > This is a non-trivial amount of work to get good coverage across > > > implementations, so before putting together more formal proposal it > would > > > be nice to know if: > > > > > > 1. If there is an appetite in the general community to consider these > > > changes > > > 2. If anybody from the community is interested in collaborating on > > > proposals/implementation in this area. > > > > > > Thanks, > > > Micah > > > > > > [1] https://github.com/maxi-k/btrblocks > > > [2] https://github.com/facebookincubator/nimble > > > [3] https://blog.lancedb.com/lance-v2/ > > > [4] https://github.com/apache/arrow/issues/39676 > > > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 > > > > > >
