call it parquet.ml then.

which is what I've had in my head as I was thinking about this last week.
as the datatypes and the library uses (GPUs, ...) would be targeted at
this. I'd also like a design optimised for high-latency cloud storage where
seek sucks but parallel reads are easy, and we can look at what can be done
in reads too. Oh, and low latency SSDs too. Consider this: it is possible
to make changes in the filesystem layer to suit, with the vector IO API
being a key example.


I might sketch out some of my thoughts here, but first I'm curious about
what other people see their needs are, and what can be done in a form of
evolution.





On Mon, 13 May 2024 at 18:41, Ed Seidl <etse...@live.com> wrote:

> I think the whole "V1" vs "V2" mess is unfortunate. IMO there is only one
> version of the Parquet file format. At its core, the data layout (row
> groups
> composed of column chunks composed of Dremel encoded pages) has
> never changed. Encodings/codecs/structures have been added to that core,
> but always in a backwards compatible way.
>
> I agree that many of the perceived shortcomings might be addressed without
> breaking changes to the file format. I myself would be interested in
> exploring
> ways to address the point lookup and wide tables issues while maintaining
> backwards compatibility. But that said, if there are ways to gain large
> performance gains that would necessitate an actual new file format version
> (such as replacing thrift, new metadata organization, some alternative to
> Dremel), I'd be open to exploring those options as well.
>
> Thanks,
> Ed
>
> On 5/11/24 3:58 PM, Micah Kornfield wrote:
> > Hi Parquet Dev,
> > I wanted to start a conversation within the community about working on a
> > new revision of Parquet.  For context there have been a bunch of new
> > formats [1][2][3] that show there is decent room for improvement across
> > data encodings and how metadata is organized.
> >
> > Specifically, in a new format revision I think we should be thinking
> about
> > the following areas for improvements:
> > 1.  More efficient encodings that allow for data skipping and SIMD
> > optimizations.
> > 2.  More efficient metadata handling for deserialization and projection
> to
> > address areas when metadata deserialization time is not trivial [4].
> > 3.  Possibly thinking about different encodings instead of
> > repetition/definition for repeated and nested field
> > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> type)
> > that can shred elements into individual columns (a recent thread in
> Iceberg
> > mentions doing this at the metadata level [5])
> >
> > I think the goals of V3 would be to provide existing API compatibility as
> > broadly as possible (possibly with some performance loss) and expose new
> > API surface areas where appropriate to make use of new elements.  New
> > encodings could be backported so they can be made use of without metadata
> > changes.  I think unfortunately that for points 2 and 3 we would want to
> > break file level compatibility.  More thought would be needed to consider
> > whether 4 could be backported effectively.
> >
> > This is a non-trivial amount of work to get good coverage across
> > implementations, so before putting together more formal proposal it would
> > be nice to know if:
> >
> > 1.  If there is an appetite in the general community to consider these
> > changes
> > 2.  If anybody from the community is interested in collaborating on
> > proposals/implementation in this area.
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/maxi-k/btrblocks
> > [2] https://github.com/facebookincubator/nimble
> > [3] https://blog.lancedb.com/lance-v2/
> > [4] https://github.com/apache/arrow/issues/39676
> > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> >
>
>

Reply via email to