Just to double check we're all on the same page w.r.t metadata, I presume we're referring to FileMetadata [1]? If so this contains information on the schema and locations of the column chunks. All statistics information, including that of column chunks, can be referenced solely by offset and not stored inline, although most implementations do inline this metadata by default. Whilst the thrift compact encoding may not be the most efficient thing in the world, it isn't majorly different from protobuf or avro, and so I would hazard that simply storing statistics separately might be sufficient for the wide column use-cases, without requiring switching to something like flatbuffers? Provided the column chunk metadata is still written near the footer, it would still be possible to fetch all the necessary metadata in a single IO request: the rust implementation already does something similar for reading the page index.

I feel I should also highlight that flatbuffers are not a silver bullet, they add an indirection to every field access, unless using structs which can't be evolved, and performing offset validation can, depending on the payload, take longer than decoding from an equivalent protobuf / avro / thrift message. The major thing that tends to give flatbuffers an edge is that parsing doesn't allocate, however, there is no reason that a thrift, protobuf, or avro decoder needs to allocate strings either when decoding from a fixed buffer, and the parquet schema does not contain many nested lists which are the other major source of allocations.

I therefore can't help thinking there is a lot that could be done to improve the wide column use-case within parquet implementations, before necessarily needing to reach for format changes.

[1]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1109

On 14/05/2024 14:31, Steve Loughran wrote:
BTW, has everyone read "An Empirical Evaluation of Columnar Storage
Formats"?

https://arxiv.org/abs/2304.05028

good review of how things could be better with real numbers. Highlights
that encoding plugins may be inefficient, based on the ORC experience.

w.r.t metadata


    1. could the old and the new footers coexist, so new code would read the
    new footer?
    2. I want iceberg to store the file length footer offset in its indices,
    so that a HEAD and a GET can be saved. It all adds up.

I would love to get deeply involved in this as it is *interesting* but
cannot make any commitments and if I just turn up and give the other teams
(hive, spark, impala) code they'll either not pick it up or expect me to
maintain.

What I can offer is support down the stack where we can improve the storage
APIs to expose more aspects of cloud storage in a way that parquet can
optimise its integration with post-Posix storage

Optimize for cloud storage reads where seek() is very expensive, every
filesystem call has a literal cost, but simultaneous parallel HTTP requests
work.
SSD support, again where parallel DMA fetches are fast


the vector IO API is a key part of this, but we could do something for
writes where cloud stores with support for partial/queued writes would
allow for parallel block uploads; native IO would use java.nio
Channel.write(ByteBuffer src)

interface BlockWrite {
  int minimumBlockSizeForAllButLastBlock()
  int maximumBlockCount()
  int maximumBlockSize()
  Future<Integer> writeBlock(ByteBuffer buffer, boolean isLastBlock)
  Future<Result> close();
}

there's also ongoing work with prefetching/footer caching which really the
app/library should be controlling, rather than have the filesystem clients
guessing what a good footer length is




On Tue, 14 May 2024 at 05:25, Julien Le Dem <jul...@apache.org> wrote:

It's great to see this thread. Thank you Micah for facilitating
the discussion.

my 2cts:
1. I like the idea of having feature checks rather than an absolute version
number. I am sorry for the confusion created by the V2 moniker. Those were
indeed incremental and backwards compatible additions to the v1 spec and
not a rewrite of the format.

a. It would be great to have a formal release cadence but someone needs to
dedicate time to drive the process.
b. IMO we need an implementer of a query engine to "sponsor" adding a new
feature to the format. They would implement usage at the same time so it
can be validated that additions to the spec achieve the expected perf
improvement in the context of a query engine. For example, some years ago,
Impala was implementing usage of new indexes at the same time they were
specified.
Tracking what engines and versions support the new feature would be useful.
Enough adoption would make it default. This requirement is very different
for a new encoding vs a new additional index or stat.

2. I also think "encoding plugins" are not aligned with the philosophy of
Parquet as the force of the format is to be fully specified cross language
and not just the output of a library.
I do think new encodings and a new metadata representation would be
welcome. Flatbuffer did not exist when I picked Thrift for the footer. The
current metadata representation is a pain to read partially or efficiently.
That said big changes like this need a clear path for adoption and solving
the transition period. teh file does have a magic number "PAR1" at the
beginning and the end that might be used for such incompatible changes at
the metadata layer.

I do think it is easier to integrate more encodings in the ecosystem (say
btrblocks) by adding them to Parquet than by creating a new file format
that would need to build adoption from scratch.

3. Agreed, it is an effort and requires collaboration from key OpenSource
and proprietary engines implementing parquet readers/writers. One way to
facilitate the transition IMO would be to make sure there are native
parquet-arrow implemetations included which is a bit lacking in the java
implementation.

Best
Julien

On Mon, May 13, 2024 at 10:45 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

Thanks everybody for the input.  I'll try to summarize some main points
and
my thoughts below.

1.  "V3" branding is problematic and getting adoption is difficult
with V2.  I agree, we should not lump all potential improvements into a
single V3 milestone (I used V3 to indicate that at least some changes
might
be backward incompatible with existing format revisions).   In my mind, I
think the way to make it more likely that new features are used would be
starting to think about a more formal release process for them.  For
example:
     a.  A clear cadence of major version library releases (e.g. maybe
once
per year).
     b.  A clear policy for when a new feature becomes the default in a
library release (e.g. as a strawman once the feature lands in reference
implementation, it is eligible to become default in the next major
release
that occurs >1 year later).
     c.  For reference implementations that are effectively doing major
version releases on each release, I think following parquet-mr for
flipping
defaults would make sense.

2.  How much of the improvements can be a clean slate vs
evolutionary/implementation optimizations?  I really think this depends
on
which aspects we are tackling. For metadata issues, I think it might pay
to
rethink things from the ground up, but any proposals along these lines
should obviously have clear rationales and benchmarks to clarify how the
decisions are made.  For better encodings, most likely work can be added
to
the existing format.  I don't think allowing for arbitrary plugin
encodings
would be a good thing.  I believe one of the reasons that Parquet has
been
successful has been its specification which allows for guaranteed
compatibility.

3.  Amount of effort required/Sustainability of effort.  I agree this is
a
big risk. It will take a lot of work to cover the major parquet bindings,
which is why I started the thread. Personally, I am fairly time
constrained
and unless my employer is willing to approve devoting work hours to the
project I likely won't be able to contribute much.  However, it seems
like
there might be enough interest from the community that I can potentially
make the case for doing so.

Thanks,
Micah

On Mon, May 13, 2024 at 10:41 AM Ed Seidl <etse...@live.com> wrote:

I think the whole "V1" vs "V2" mess is unfortunate. IMO there is only
one
version of the Parquet file format. At its core, the data layout (row
groups
composed of column chunks composed of Dremel encoded pages) has
never changed. Encodings/codecs/structures have been added to that
core,
but always in a backwards compatible way.

I agree that many of the perceived shortcomings might be addressed
without
breaking changes to the file format. I myself would be interested in
exploring
ways to address the point lookup and wide tables issues while
maintaining
backwards compatibility. But that said, if there are ways to gain large
performance gains that would necessitate an actual new file format
version
(such as replacing thrift, new metadata organization, some alternative
to
Dremel), I'd be open to exploring those options as well.

Thanks,
Ed

On 5/11/24 3:58 PM, Micah Kornfield wrote:
Hi Parquet Dev,
I wanted to start a conversation within the community about working
on
a
new revision of Parquet.  For context there have been a bunch of new
formats [1][2][3] that show there is decent room for improvement
across
data encodings and how metadata is organized.

Specifically, in a new format revision I think we should be thinking
about
the following areas for improvements:
1.  More efficient encodings that allow for data skipping and SIMD
optimizations.
2.  More efficient metadata handling for deserialization and
projection
to
address areas when metadata deserialization time is not trivial [4].
3.  Possibly thinking about different encodings instead of
repetition/definition for repeated and nested field
4.  Support for optimizing semi-structured data (e.g. JSON or Variant
type)
that can shred elements into individual columns (a recent thread in
Iceberg
mentions doing this at the metadata level [5])

I think the goals of V3 would be to provide existing API
compatibility
as
broadly as possible (possibly with some performance loss) and expose
new
API surface areas where appropriate to make use of new elements.  New
encodings could be backported so they can be made use of without
metadata
changes.  I think unfortunately that for points 2 and 3 we would want
to
break file level compatibility.  More thought would be needed to
consider
whether 4 could be backported effectively.

This is a non-trivial amount of work to get good coverage across
implementations, so before putting together more formal proposal it
would
be nice to know if:

1.  If there is an appetite in the general community to consider
these
changes
2.  If anybody from the community is interested in collaborating on
proposals/implementation in this area.

Thanks,
Micah

[1] https://github.com/maxi-k/btrblocks
[2] https://github.com/facebookincubator/nimble
[3] https://blog.lancedb.com/lance-v2/
[4] https://github.com/apache/arrow/issues/39676
[5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34


Reply via email to