Re: Interest in Parquet V3

Raphael Taylor-Davies Tue, 14 May 2024 07:55:52 -0700

Just to double check we're all on the same page w.r.t metadata, Ipresume we're referring to FileMetadata [1]? If so this containsinformation on the schema and locations of the column chunks. Allstatistics information, including that of column chunks, can bereferenced solely by offset and not stored inline, although mostimplementations do inline this metadata by default. Whilst the thriftcompact encoding may not be the most efficient thing in the world, itisn't majorly different from protobuf or avro, and so I would hazardthat simply storing statistics separately might be sufficient for thewide column use-cases, without requiring switching to something likeflatbuffers? Provided the column chunk metadata is still written nearthe footer, it would still be possible to fetch all the necessarymetadata in a single IO request: the rust implementation already doessomething similar for reading the page index.

I feel I should also highlight that flatbuffers are not a silver bullet,they add an indirection to every field access, unless using structswhich can't be evolved, and performing offset validation can, dependingon the payload, take longer than decoding from an equivalent protobuf /avro / thrift message. The major thing that tends to give flatbuffers anedge is that parsing doesn't allocate, however, there is no reason thata thrift, protobuf, or avro decoder needs to allocate strings eitherwhen decoding from a fixed buffer, and the parquet schema does notcontain many nested lists which are the other major source of allocations.

I therefore can't help thinking there is a lot that could be done toimprove the wide column use-case within parquet implementations, beforenecessarily needing to reach for format changes.

[1]:https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1109


On 14/05/2024 14:31, Steve Loughran wrote:

BTW, has everyone read "An Empirical Evaluation of Columnar Storage
Formats"?

https://arxiv.org/abs/2304.05028

good review of how things could be better with real numbers. Highlights
that encoding plugins may be inefficient, based on the ORC experience.

w.r.t metadata


    1. could the old and the new footers coexist, so new code would read the
    new footer?
    2. I want iceberg to store the file length footer offset in its indices,
    so that a HEAD and a GET can be saved. It all adds up.

I would love to get deeply involved in this as it is *interesting* but
cannot make any commitments and if I just turn up and give the other teams
(hive, spark, impala) code they'll either not pick it up or expect me to
maintain.

What I can offer is support down the stack where we can improve the storage
APIs to expose more aspects of cloud storage in a way that parquet can
optimise its integration with post-Posix storage

Optimize for cloud storage reads where seek() is very expensive, every
filesystem call has a literal cost, but simultaneous parallel HTTP requests
work.
SSD support, again where parallel DMA fetches are fast


the vector IO API is a key part of this, but we could do something for
writes where cloud stores with support for partial/queued writes would
allow for parallel block uploads; native IO would use java.nio
Channel.write(ByteBuffer src)

interface BlockWrite {
  int minimumBlockSizeForAllButLastBlock()
  int maximumBlockCount()
  int maximumBlockSize()
  Future<Integer> writeBlock(ByteBuffer buffer, boolean isLastBlock)
  Future<Result> close();
}

there's also ongoing work with prefetching/footer caching which really the
app/library should be controlling, rather than have the filesystem clients
guessing what a good footer length is




On Tue, 14 May 2024 at 05:25, Julien Le Dem <jul...@apache.org> wrote:

It's great to see this thread. Thank you Micah for facilitating
the discussion.

my 2cts:
1. I like the idea of having feature checks rather than an absolute version
number. I am sorry for the confusion created by the V2 moniker. Those were
indeed incremental and backwards compatible additions to the v1 spec and
not a rewrite of the format.

a. It would be great to have a formal release cadence but someone needs to
dedicate time to drive the process.
b. IMO we need an implementer of a query engine to "sponsor" adding a new
feature to the format. They would implement usage at the same time so it
can be validated that additions to the spec achieve the expected perf
improvement in the context of a query engine. For example, some years ago,
Impala was implementing usage of new indexes at the same time they were
specified.
Tracking what engines and versions support the new feature would be useful.
Enough adoption would make it default. This requirement is very different
for a new encoding vs a new additional index or stat.

2. I also think "encoding plugins" are not aligned with the philosophy of
Parquet as the force of the format is to be fully specified cross language
and not just the output of a library.
I do think new encodings and a new metadata representation would be
welcome. Flatbuffer did not exist when I picked Thrift for the footer. The
current metadata representation is a pain to read partially or efficiently.
That said big changes like this need a clear path for adoption and solving
the transition period. teh file does have a magic number "PAR1" at the
beginning and the end that might be used for such incompatible changes at
the metadata layer.

I do think it is easier to integrate more encodings in the ecosystem (say
btrblocks) by adding them to Parquet than by creating a new file format
that would need to build adoption from scratch.

3. Agreed, it is an effort and requires collaboration from key OpenSource
and proprietary engines implementing parquet readers/writers. One way to
facilitate the transition IMO would be to make sure there are native
parquet-arrow implemetations included which is a bit lacking in the java
implementation.

Best
Julien

On Mon, May 13, 2024 at 10:45 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

Thanks everybody for the input.  I'll try to summarize some main points

and

my thoughts below.

1.  "V3" branding is problematic and getting adoption is difficult
with V2.  I agree, we should not lump all potential improvements into a
single V3 milestone (I used V3 to indicate that at least some changes

might

be backward incompatible with existing format revisions).   In my mind, I
think the way to make it more likely that new features are used would be
starting to think about a more formal release process for them.  For
example:
     a.  A clear cadence of major version library releases (e.g. maybe

once

per year).
     b.  A clear policy for when a new feature becomes the default in a
library release (e.g. as a strawman once the feature lands in reference
implementation, it is eligible to become default in the next major

release

that occurs >1 year later).
     c.  For reference implementations that are effectively doing major
version releases on each release, I think following parquet-mr for

flipping

defaults would make sense.

2.  How much of the improvements can be a clean slate vs
evolutionary/implementation optimizations?  I really think this depends

on

which aspects we are tackling. For metadata issues, I think it might pay

to

rethink things from the ground up, but any proposals along these lines
should obviously have clear rationales and benchmarks to clarify how the
decisions are made.  For better encodings, most likely work can be added

to

the existing format.  I don't think allowing for arbitrary plugin

encodings

would be a good thing.  I believe one of the reasons that Parquet has

been

successful has been its specification which allows for guaranteed
compatibility.

3.  Amount of effort required/Sustainability of effort.  I agree this is

big risk. It will take a lot of work to cover the major parquet bindings,
which is why I started the thread. Personally, I am fairly time

constrained

and unless my employer is willing to approve devoting work hours to the
project I likely won't be able to contribute much.  However, it seems

like

there might be enough interest from the community that I can potentially
make the case for doing so.

Thanks,
Micah

On Mon, May 13, 2024 at 10:41 AM Ed Seidl <etse...@live.com> wrote:

I think the whole "V1" vs "V2" mess is unfortunate. IMO there is only

one

version of the Parquet file format. At its core, the data layout (row
groups
composed of column chunks composed of Dremel encoded pages) has
never changed. Encodings/codecs/structures have been added to that

core,

but always in a backwards compatible way.

I agree that many of the perceived shortcomings might be addressed

without

breaking changes to the file format. I myself would be interested in
exploring
ways to address the point lookup and wide tables issues while

maintaining

backwards compatibility. But that said, if there are ways to gain large
performance gains that would necessitate an actual new file format

version

(such as replacing thrift, new metadata organization, some alternative

to

Dremel), I'd be open to exploring those options as well.

Thanks,
Ed

On 5/11/24 3:58 PM, Micah Kornfield wrote:

Hi Parquet Dev,
I wanted to start a conversation within the community about working

on

new revision of Parquet.  For context there have been a bunch of new
formats [1][2][3] that show there is decent room for improvement

across

data encodings and how metadata is organized.

Specifically, in a new format revision I think we should be thinking

about

the following areas for improvements:
1.  More efficient encodings that allow for data skipping and SIMD
optimizations.
2.  More efficient metadata handling for deserialization and

projection

to

address areas when metadata deserialization time is not trivial [4].
3.  Possibly thinking about different encodings instead of
repetition/definition for repeated and nested field
4.  Support for optimizing semi-structured data (e.g. JSON or Variant

type)

that can shred elements into individual columns (a recent thread in

Iceberg

mentions doing this at the metadata level [5])

I think the goals of V3 would be to provide existing API

compatibility

as

broadly as possible (possibly with some performance loss) and expose

new

API surface areas where appropriate to make use of new elements.  New
encodings could be backported so they can be made use of without

metadata

changes.  I think unfortunately that for points 2 and 3 we would want

to

break file level compatibility.  More thought would be needed to

consider

whether 4 could be backported effectively.

This is a non-trivial amount of work to get good coverage across
implementations, so before putting together more formal proposal it

would

be nice to know if:

1.  If there is an appetite in the general community to consider

these

changes
2.  If anybody from the community is interested in collaborating on
proposals/implementation in this area.

Thanks,
Micah

[1] https://github.com/maxi-k/btrblocks
[2] https://github.com/facebookincubator/nimble
[3] https://blog.lancedb.com/lance-v2/
[4] https://github.com/apache/arrow/issues/39676
[5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34

Re: Interest in Parquet V3

Reply via email to