I am also in.

I would focus on making Parquet more compatible – we have had this issue
from the beginning. There shouldn't be a reason to have tools generate
different flavors of the format.

Lukas


po 20. 5. 2024 v 20:06 odesílatel Parth Chandra <par...@apache.org> napsal:

> Hi Parquet team,
>
>  It is very exciting to see this effort. Thanks Micah for starting this.
>
>  For most use case that our team sees the broad areas for improvement
> appear to be -
>    1) Optimizing for cloud storage (latency is high, seeks are expensive)
>    2) Optimized metadata reading - we've seen 30% (sometimes more) of
> Spark's scan operator time spent in reading footers.
>    3) Anything that improves support for data lakes.
>
>   Also I'll be happy to help wherever I can.
>
> Parth
>
> On Sun, May 19, 2024 at 10:59 AM Xinli shang <sha...@uber.com.invalid>
> wrote:
>
> > Sorry I am late to the party! It's great to see this discussion! Thank
> you
> > everyone for the many good points and thank you, Micah, for starting the
> > discussion and putting it together into a document, which is very
> helpful!
> > I agree with most of the points we discussed above, and we need to
> improve
> > Parquet and sometimes even speed up to catch up with industry changes.
> >
> > With that said, we need people to work on it, as Julien mentioned. The
> > document
> > <
> >
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> > >
> > that Micah created covers pretty much everything we discussed here. I
> > encourage all of us to contribute by raising questions, providing
> > suggestions, adding missing functionality, etc. Once we reach a consensus
> > on each topic, we can create different tracks and working streams to kick
> > off the implementations.
> >
> > I believe continuously improving Parquet would benefit the industry more
> > than creating a new format, which could add friction. These improvement
> > ideas are exciting opportunities. If you, your team members, or friends
> > have time and interest, please encourage them to contribute.
> >
> > Our Parquet community meeting is next week, on May 28, 2024. We can have
> > discussions there if you can join. Currently, it is scheduled for 7:00 am
> > PDT, but I can change it according to the majority's availability.
> >
> > On Fri, May 17, 2024 at 3:58 PM Rok Mihevc <rok.mih...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I've discussed with my colleagues and we would dedicate two engineers
> for
> > > 4-6 months on tasks related to implementing the format changes. We're
> > > already active in design discussions and can help with C++, Rust and C#
> > > implementations. I thought it'd be good to state this explicitly FWIW.
> > >
> > > Our main areas of interest are efficient reads for tables with wide
> > schemas
> > > and faster random rowgroup access [1].
> > >
> > > To workaround the wide schemas issue we actually implemented an
> internal
> > > tool [3] for storing index information into a separate file which
> allows
> > > for reading only the necessary subset of metadata. We would offer this
> > > approach for consideration as a possible approach to solve the wide
> > schema
> > > problem.
> > >
> > > [1] https://github.com/apache/arrow/issues/39676
> > > [2] https://github.com/G-Research/PalletJack
> > >
> > > Rok
> > >
> > > On Sun, May 12, 2024 at 12:59 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > > wrote:
> > >
> > > > Hi Parquet Dev,
> > > > I wanted to start a conversation within the community about working
> on
> > a
> > > > new revision of Parquet.  For context there have been a bunch of new
> > > > formats [1][2][3] that show there is decent room for improvement
> across
> > > > data encodings and how metadata is organized.
> > > >
> > > > Specifically, in a new format revision I think we should be thinking
> > > about
> > > > the following areas for improvements:
> > > > 1.  More efficient encodings that allow for data skipping and SIMD
> > > > optimizations.
> > > > 2.  More efficient metadata handling for deserialization and
> projection
> > > to
> > > > address areas when metadata deserialization time is not trivial [4].
> > > > 3.  Possibly thinking about different encodings instead of
> > > > repetition/definition for repeated and nested field
> > > > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> > > type)
> > > > that can shred elements into individual columns (a recent thread in
> > > Iceberg
> > > > mentions doing this at the metadata level [5])
> > > >
> > > > I think the goals of V3 would be to provide existing API
> compatibility
> > as
> > > > broadly as possible (possibly with some performance loss) and expose
> > new
> > > > API surface areas where appropriate to make use of new elements.  New
> > > > encodings could be backported so they can be made use of without
> > metadata
> > > > changes.  I think unfortunately that for points 2 and 3 we would want
> > to
> > > > break file level compatibility.  More thought would be needed to
> > consider
> > > > whether 4 could be backported effectively.
> > > >
> > > > This is a non-trivial amount of work to get good coverage across
> > > > implementations, so before putting together more formal proposal it
> > would
> > > > be nice to know if:
> > > >
> > > > 1.  If there is an appetite in the general community to consider
> these
> > > > changes
> > > > 2.  If anybody from the community is interested in collaborating on
> > > > proposals/implementation in this area.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > [1] https://github.com/maxi-k/btrblocks
> > > > [2] https://github.com/facebookincubator/nimble
> > > > [3] https://blog.lancedb.com/lance-v2/
> > > > [4] https://github.com/apache/arrow/issues/39676
> > > > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> > > >
> > >
> >
> >
> > --
> > Xinli Shang
> >
>

Reply via email to