This is the commit:
https://github.com/apache/arrow/pull/43793/commits/e6e73f2cafaff846e8ed732a5239e5c4e5bc724c
.

Metadata size is important because it is always on the critical path for
any read/scan. Time to process metadata is the sum of fetching from cloud
and parsing. More advanced engines cache file metadata. In such cases the
smaller the metadata the more an engine can cache. In short, the cost (and
thus the bar) of adding bytes to the footer is very high; similarly, we
should be highly incentivized to remove bytes from metadata.

This optimization in particular is one of the larger ones so I will explore
it. My plan is to implement the double footer in Spark and shadow roll it
to understand its performance/efficacy.

To drive better design decisions for the new footer, we installed telemetry
across Databricks workloads which among others tracks footer features. The
telemetry is ramping up but there is enough volume already to make some
observations:

   - 4M footers per hour (reads)
   - 300k have statistics
   - 200k have column/offset index
   - 4 have some row group larger than 2^31 uncompressed bytes
   - none have some row group with more than 2^31 values
   - none are encrypted
   - none are using bloom filters

Given the above numbers I do not worry too much about limiting row groups
to 2^31 bytes. The simplest fix for a writer is to limit row groups to 2^31
logical bytes and then run encoding/compression. Given that row groups are
typically targeting a size of 64/128MB that should work rather well unless
the data in question is of extremely low entropy and compresses too well.


On Thu, Aug 29, 2024 at 11:22 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Which fields are we talking about exactly?
>
> On the byte size front, I see `RowGroup.total_byte_size`,
> `RowGroup.total_compressed_size` but also
> `ColumnMetadata.total_compressed_size` and
> `ColumnMetadata.total_uncompressed_size`.
>
> On the value count front, I see `RowGroup.num_rows` and
> `ColumnChunk.num_values`.
>
> For byte sizes, having to rechunk if they end up larger than 2^31 could
> be really annoying, as the byte size is not known before encoding.
>
> For value counts, chunking can be decided purely based on the input
> data, which is better. That said, it would still be non-trivial if
> repeated values are involved (for example, you would have to
> recursively inspect the children of an Arrow List array to find out
> whether the number of leaf values is greater than 2^31).
>
> In the end, we are preoccupied with metadata size not becoming much
> larger than with Thrift, but I don't know if we really want to go after
> the smallest possible footprint?
>
> (but if we do, I would suggest perhaps LZ4-compress the Flatbuffers
> metadata :-))
>
> Regards
>
> Antoine.
>
>
>
> On Wed, 28 Aug 2024 11:24:49 +0200
> Alkis Evlogimenos
> <alkis.evlogime...@databricks.com.INVALID>
> wrote:
> > Yes the gains are substantial. This is one of the biggest optimizations.
> >
> > They are between 25% to 75% (4x reduction) depending on how much other
> > stuff the footer has. Footers without stats get about 4x smaller. With
> > stats they are 2x smaller.
> >
> > On Wed, Aug 28, 2024 at 10:32 AM Antoine Pitrou <
> antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> >
> > >
> > > Do you gain much from limiting row groups to 2^31 values and bytes? I
> > > generally find 32-bit lengths to a bit an anti-pattern, as they require
> > > dedicated logic in the writer to ensure sufficient chunking.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Mon, 26 Aug 2024 10:35:38 +0200
> > > Alkis Evlogimenos
> > > <alkis.evlogime...@databricks.com.INVALID>
> > > wrote:
> > > > At the top of the benchmark code I have numbers and short
> description of
> > > > each optimization:
> > > >
> > >
> https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129
>
> > > >
> > > > Summarizing them here:
> > > > - statistics min/max: use fixed 4/8 bytes for types of known length
> and
> > > > leave variable length for encoding for binary strings
> > > > - limit row groups to 2^31 values and 2^31 bytes and make all
> column
> > > chunk
> > > > offsets relative to the row group offset
> > > > - skip writing `num_values` in column chunk if it is the same as
> > > > `num_values` in row group (this is very common in practice)
> > > > - remove `encoding_stats` and replace with a boolean denoting that
> all
> > > > pages are dict encoded or not (engines use this to do dictId only
> > > execution)
> > > > - remove `path_in_schema` (can be computed dynamically after parsing
> as
> > > > necessary)
> > > > - remove deprecated `file_offset` in column chunk
> > > > - statistics min/max for strings: encode as common prefix + fixed 8
> bytes
> > > > for min/max, zero padded
> > > >
> > > > Cheers,
> > > >
> > > > On Fri, Aug 23, 2024 at 6:59 PM Jan Finis <
> > > jpfinis-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > > >
> > > > > Amazing, thanks Alkis!
> > > > >
> > > > > Can you give a quick comment on what specific fact made the
> footers so
> > > much
> > > > > smaller in their flatbuf representation? Given that flatbuf
> compresses
> > > way
> > > > > less aggressively than thrift, this seems counterintuitive, I
> would
> > > have
> > > > > rather expected quite some size gain.
> > > > >
> > > > > Cheers,
> > > > > Jan
> > > > >
> > > > > Am Fr., 23. Aug. 2024 um 03:41 Uhr schrieb Corwin Joy
> > > <corwinjoy-Re5JQEeQqe8-XMD5yJDbdMReXY1tMh2IBgC/
> g2k4z...@public.gmane.org
> > > > > >:
> > > > >
> > > > > > This looks great! I have added some initial simple comments on
> the
> > > PR
> > > > > that
> > > > > > may help others who want to take a look.
> > > > > >
> > > > > > On Thu, Aug 22, 2024 at 5:46 PM Julien Le Dem
> > > <julien-1oDqGaOF3LlQFI55V6+gNQ-XMD5yJDbdMReXY1tMh2IBti2O/
> jbr...@public.gmane.org> wrote:
> > > > > >
> > > > > > > this looks great,
> > > > > > > thank you for sharing.
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 22, 2024 at 10:42 AM Alkis Evlogimenos
> > > > > > > <
> > >
> alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl79t-xmd5yjdbdmrexy1tmh2...@public.gmane.org
> >
> > > wrote:
> > > > > > >
> > > > > > > > Hey folks.
> > > > > > > >
> > > > > > > > As promised I pushed a PR to the main repo with my attempt
> to use
> > > > > > > > flatbuffers for metadata for parquet:
> > > > > > > > https://github.com/apache/arrow/pull/43793
> > > > > > > >
> > > > > > > > The PR builds on top of the metadata extensions in parquet
> > > > > > > > https://github.com/apache/parquet-format/pull/254 and tests
> how
> > > fast
> > > > > > we
> > > > > > > > can
> > > > > > > > parse thrift, thrift+flatbuf, flatbuf alone and also how
> much
> > > time it
> > > > > > > takes
> > > > > > > > to encode flatbuf. In addition at the start of the benchmark
> it
> > > > > prints
> > > > > > > out
> > > > > > > > the number of row groups/column chunks and
> thrift/flatbuffer
> > > > > serialized
> > > > > > > > bytes.
> > > > > > > >
> > > > > > > > I structured the commits to contain one optimization each
> to
> > > make
> > > > > their
> > > > > > > > effects more visible. I have tracked the progress at the top
> of
> > > the
> > > > > > > > benchmark
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129
> > >
> > > > > > > > >
> > > > > > > > .
> > > > > > > >
> > > > > > > > The current state is complete sans encryption support. All
> the
> > > bugs
> > > > > are
> > > > > > > > mine but ideas are coming from a few folks inside
> Databricks.
> > > As
> > > > > > expected
> > > > > > > > parsing the thrift+extension footer incurs a very small
> > > regression
> > > > > > (~1%).
> > > > > > > > Parsing/verifying flatbuffers is >20x faster than thrift so
> I
> > > haven't
> > > > > > > tried
> > > > > > > > to make changes to its structure for speed. In the last
> commit
> > > the
> > > > > size
> > > > > > > of
> > > > > > > > flatbuffer metadata is anywhere from slightly smaller to
> more
> > > than 4x
> > > > > > > > smaller (!!!).
> > > > > > > >
> > > > > > > > Unfortunately I can't share the footers I used yet. I am
> going
> > > to
> > > > > wait
> > > > > > > for
> > > > > > > > donations <
> https://github.com/apache/parquet-benchmark/pull/1>
> > > to
> > > > > the
> > > > > > > > parquet-benchmarks repository and rerun the benchmark
> against
> > > them.
> > > > > > > >
> > > > > > > > I would like to invite anyone interested in collaborating
> to
> > > take a
> > > > > > look
> > > > > > > at
> > > > > > > > the PR, consider the design decisions made, experiment with
> it,
> > > and
> > > > > > > > contribute.
> > > > > > > >
> > > > > > > > Thank you!
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >
>
>
>
>

Reply via email to