BTW, has everyone read "An Empirical Evaluation of Columnar Storage
Formats"?

https://arxiv.org/abs/2304.05028

good review of how things could be better with real numbers. Highlights
that encoding plugins may be inefficient, based on the ORC experience.

w.r.t metadata


   1. could the old and the new footers coexist, so new code would read the
   new footer?
   2. I want iceberg to store the file length footer offset in its indices,
   so that a HEAD and a GET can be saved. It all adds up.

I would love to get deeply involved in this as it is *interesting* but
cannot make any commitments and if I just turn up and give the other teams
(hive, spark, impala) code they'll either not pick it up or expect me to
maintain.

What I can offer is support down the stack where we can improve the storage
APIs to expose more aspects of cloud storage in a way that parquet can
optimise its integration with post-Posix storage

Optimize for cloud storage reads where seek() is very expensive, every
filesystem call has a literal cost, but simultaneous parallel HTTP requests
work.
SSD support, again where parallel DMA fetches are fast


the vector IO API is a key part of this, but we could do something for
writes where cloud stores with support for partial/queued writes would
allow for parallel block uploads; native IO would use java.nio
Channel.write(ByteBuffer src)

interface BlockWrite {
 int minimumBlockSizeForAllButLastBlock()
 int maximumBlockCount()
 int maximumBlockSize()
 Future<Integer> writeBlock(ByteBuffer buffer, boolean isLastBlock)
 Future<Result> close();
}

there's also ongoing work with prefetching/footer caching which really the
app/library should be controlling, rather than have the filesystem clients
guessing what a good footer length is




On Tue, 14 May 2024 at 05:25, Julien Le Dem <jul...@apache.org> wrote:

> It's great to see this thread. Thank you Micah for facilitating
> the discussion.
>
> my 2cts:
> 1. I like the idea of having feature checks rather than an absolute version
> number. I am sorry for the confusion created by the V2 moniker. Those were
> indeed incremental and backwards compatible additions to the v1 spec and
> not a rewrite of the format.
>
> a. It would be great to have a formal release cadence but someone needs to
> dedicate time to drive the process.
> b. IMO we need an implementer of a query engine to "sponsor" adding a new
> feature to the format. They would implement usage at the same time so it
> can be validated that additions to the spec achieve the expected perf
> improvement in the context of a query engine. For example, some years ago,
> Impala was implementing usage of new indexes at the same time they were
> specified.
> Tracking what engines and versions support the new feature would be useful.
> Enough adoption would make it default. This requirement is very different
> for a new encoding vs a new additional index or stat.
>
> 2. I also think "encoding plugins" are not aligned with the philosophy of
> Parquet as the force of the format is to be fully specified cross language
> and not just the output of a library.
> I do think new encodings and a new metadata representation would be
> welcome. Flatbuffer did not exist when I picked Thrift for the footer. The
> current metadata representation is a pain to read partially or efficiently.
> That said big changes like this need a clear path for adoption and solving
> the transition period. teh file does have a magic number "PAR1" at the
> beginning and the end that might be used for such incompatible changes at
> the metadata layer.
>
> I do think it is easier to integrate more encodings in the ecosystem (say
> btrblocks) by adding them to Parquet than by creating a new file format
> that would need to build adoption from scratch.
>
> 3. Agreed, it is an effort and requires collaboration from key OpenSource
> and proprietary engines implementing parquet readers/writers. One way to
> facilitate the transition IMO would be to make sure there are native
> parquet-arrow implemetations included which is a bit lacking in the java
> implementation.
>
> Best
> Julien
>
> On Mon, May 13, 2024 at 10:45 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Thanks everybody for the input.  I'll try to summarize some main points
> and
> > my thoughts below.
> >
> > 1.  "V3" branding is problematic and getting adoption is difficult
> > with V2.  I agree, we should not lump all potential improvements into a
> > single V3 milestone (I used V3 to indicate that at least some changes
> might
> > be backward incompatible with existing format revisions).   In my mind, I
> > think the way to make it more likely that new features are used would be
> > starting to think about a more formal release process for them.  For
> > example:
> >     a.  A clear cadence of major version library releases (e.g. maybe
> once
> > per year).
> >     b.  A clear policy for when a new feature becomes the default in a
> > library release (e.g. as a strawman once the feature lands in reference
> > implementation, it is eligible to become default in the next major
> release
> > that occurs >1 year later).
> >     c.  For reference implementations that are effectively doing major
> > version releases on each release, I think following parquet-mr for
> flipping
> > defaults would make sense.
> >
> > 2.  How much of the improvements can be a clean slate vs
> > evolutionary/implementation optimizations?  I really think this depends
> on
> > which aspects we are tackling. For metadata issues, I think it might pay
> to
> > rethink things from the ground up, but any proposals along these lines
> > should obviously have clear rationales and benchmarks to clarify how the
> > decisions are made.  For better encodings, most likely work can be added
> to
> > the existing format.  I don't think allowing for arbitrary plugin
> encodings
> > would be a good thing.  I believe one of the reasons that Parquet has
> been
> > successful has been its specification which allows for guaranteed
> > compatibility.
> >
> > 3.  Amount of effort required/Sustainability of effort.  I agree this is
> a
> > big risk. It will take a lot of work to cover the major parquet bindings,
> > which is why I started the thread. Personally, I am fairly time
> constrained
> > and unless my employer is willing to approve devoting work hours to the
> > project I likely won't be able to contribute much.  However, it seems
> like
> > there might be enough interest from the community that I can potentially
> > make the case for doing so.
> >
> > Thanks,
> > Micah
> >
> > On Mon, May 13, 2024 at 10:41 AM Ed Seidl <etse...@live.com> wrote:
> >
> > > I think the whole "V1" vs "V2" mess is unfortunate. IMO there is only
> one
> > > version of the Parquet file format. At its core, the data layout (row
> > > groups
> > > composed of column chunks composed of Dremel encoded pages) has
> > > never changed. Encodings/codecs/structures have been added to that
> core,
> > > but always in a backwards compatible way.
> > >
> > > I agree that many of the perceived shortcomings might be addressed
> > without
> > > breaking changes to the file format. I myself would be interested in
> > > exploring
> > > ways to address the point lookup and wide tables issues while
> maintaining
> > > backwards compatibility. But that said, if there are ways to gain large
> > > performance gains that would necessitate an actual new file format
> > version
> > > (such as replacing thrift, new metadata organization, some alternative
> to
> > > Dremel), I'd be open to exploring those options as well.
> > >
> > > Thanks,
> > > Ed
> > >
> > > On 5/11/24 3:58 PM, Micah Kornfield wrote:
> > > > Hi Parquet Dev,
> > > > I wanted to start a conversation within the community about working
> on
> > a
> > > > new revision of Parquet.  For context there have been a bunch of new
> > > > formats [1][2][3] that show there is decent room for improvement
> across
> > > > data encodings and how metadata is organized.
> > > >
> > > > Specifically, in a new format revision I think we should be thinking
> > > about
> > > > the following areas for improvements:
> > > > 1.  More efficient encodings that allow for data skipping and SIMD
> > > > optimizations.
> > > > 2.  More efficient metadata handling for deserialization and
> projection
> > > to
> > > > address areas when metadata deserialization time is not trivial [4].
> > > > 3.  Possibly thinking about different encodings instead of
> > > > repetition/definition for repeated and nested field
> > > > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> > > type)
> > > > that can shred elements into individual columns (a recent thread in
> > > Iceberg
> > > > mentions doing this at the metadata level [5])
> > > >
> > > > I think the goals of V3 would be to provide existing API
> compatibility
> > as
> > > > broadly as possible (possibly with some performance loss) and expose
> > new
> > > > API surface areas where appropriate to make use of new elements.  New
> > > > encodings could be backported so they can be made use of without
> > metadata
> > > > changes.  I think unfortunately that for points 2 and 3 we would want
> > to
> > > > break file level compatibility.  More thought would be needed to
> > consider
> > > > whether 4 could be backported effectively.
> > > >
> > > > This is a non-trivial amount of work to get good coverage across
> > > > implementations, so before putting together more formal proposal it
> > would
> > > > be nice to know if:
> > > >
> > > > 1.  If there is an appetite in the general community to consider
> these
> > > > changes
> > > > 2.  If anybody from the community is interested in collaborating on
> > > > proposals/implementation in this area.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > [1] https://github.com/maxi-k/btrblocks
> > > > [2] https://github.com/facebookincubator/nimble
> > > > [3] https://blog.lancedb.com/lance-v2/
> > > > [4] https://github.com/apache/arrow/issues/39676
> > > > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> > > >
> > >
> > >
> >
>

Reply via email to