I am also in. I would focus on making Parquet more compatible – we have had this issue from the beginning. There shouldn't be a reason to have tools generate different flavors of the format.
Lukas po 20. 5. 2024 v 20:06 odesílatel Parth Chandra <[email protected]> napsal: > Hi Parquet team, > > It is very exciting to see this effort. Thanks Micah for starting this. > > For most use case that our team sees the broad areas for improvement > appear to be - > 1) Optimizing for cloud storage (latency is high, seeks are expensive) > 2) Optimized metadata reading - we've seen 30% (sometimes more) of > Spark's scan operator time spent in reading footers. > 3) Anything that improves support for data lakes. > > Also I'll be happy to help wherever I can. > > Parth > > On Sun, May 19, 2024 at 10:59 AM Xinli shang <[email protected]> > wrote: > > > Sorry I am late to the party! It's great to see this discussion! Thank > you > > everyone for the many good points and thank you, Micah, for starting the > > discussion and putting it together into a document, which is very > helpful! > > I agree with most of the points we discussed above, and we need to > improve > > Parquet and sometimes even speed up to catch up with industry changes. > > > > With that said, we need people to work on it, as Julien mentioned. The > > document > > < > > > https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit > > > > > that Micah created covers pretty much everything we discussed here. I > > encourage all of us to contribute by raising questions, providing > > suggestions, adding missing functionality, etc. Once we reach a consensus > > on each topic, we can create different tracks and working streams to kick > > off the implementations. > > > > I believe continuously improving Parquet would benefit the industry more > > than creating a new format, which could add friction. These improvement > > ideas are exciting opportunities. If you, your team members, or friends > > have time and interest, please encourage them to contribute. > > > > Our Parquet community meeting is next week, on May 28, 2024. We can have > > discussions there if you can join. Currently, it is scheduled for 7:00 am > > PDT, but I can change it according to the majority's availability. > > > > On Fri, May 17, 2024 at 3:58 PM Rok Mihevc <[email protected]> wrote: > > > > > Hi all, > > > > > > I've discussed with my colleagues and we would dedicate two engineers > for > > > 4-6 months on tasks related to implementing the format changes. We're > > > already active in design discussions and can help with C++, Rust and C# > > > implementations. I thought it'd be good to state this explicitly FWIW. > > > > > > Our main areas of interest are efficient reads for tables with wide > > schemas > > > and faster random rowgroup access [1]. > > > > > > To workaround the wide schemas issue we actually implemented an > internal > > > tool [3] for storing index information into a separate file which > allows > > > for reading only the necessary subset of metadata. We would offer this > > > approach for consideration as a possible approach to solve the wide > > schema > > > problem. > > > > > > [1] https://github.com/apache/arrow/issues/39676 > > > [2] https://github.com/G-Research/PalletJack > > > > > > Rok > > > > > > On Sun, May 12, 2024 at 12:59 AM Micah Kornfield < > [email protected]> > > > wrote: > > > > > > > Hi Parquet Dev, > > > > I wanted to start a conversation within the community about working > on > > a > > > > new revision of Parquet. For context there have been a bunch of new > > > > formats [1][2][3] that show there is decent room for improvement > across > > > > data encodings and how metadata is organized. > > > > > > > > Specifically, in a new format revision I think we should be thinking > > > about > > > > the following areas for improvements: > > > > 1. More efficient encodings that allow for data skipping and SIMD > > > > optimizations. > > > > 2. More efficient metadata handling for deserialization and > projection > > > to > > > > address areas when metadata deserialization time is not trivial [4]. > > > > 3. Possibly thinking about different encodings instead of > > > > repetition/definition for repeated and nested field > > > > 4. Support for optimizing semi-structured data (e.g. JSON or Variant > > > type) > > > > that can shred elements into individual columns (a recent thread in > > > Iceberg > > > > mentions doing this at the metadata level [5]) > > > > > > > > I think the goals of V3 would be to provide existing API > compatibility > > as > > > > broadly as possible (possibly with some performance loss) and expose > > new > > > > API surface areas where appropriate to make use of new elements. New > > > > encodings could be backported so they can be made use of without > > metadata > > > > changes. I think unfortunately that for points 2 and 3 we would want > > to > > > > break file level compatibility. More thought would be needed to > > consider > > > > whether 4 could be backported effectively. > > > > > > > > This is a non-trivial amount of work to get good coverage across > > > > implementations, so before putting together more formal proposal it > > would > > > > be nice to know if: > > > > > > > > 1. If there is an appetite in the general community to consider > these > > > > changes > > > > 2. If anybody from the community is interested in collaborating on > > > > proposals/implementation in this area. > > > > > > > > Thanks, > > > > Micah > > > > > > > > [1] https://github.com/maxi-k/btrblocks > > > > [2] https://github.com/facebookincubator/nimble > > > > [3] https://blog.lancedb.com/lance-v2/ > > > > [4] https://github.com/apache/arrow/issues/39676 > > > > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 > > > > > > > > > > > > > -- > > Xinli Shang > > >
