Re: flatbuffers metadata: 32-bit vs. 64-bit sizes

Antoine Pitrou Thu, 29 Aug 2024 02:22:45 -0700


Which fields are we talking about exactly?


On the byte size front, I see `RowGroup.total_byte_size`,
`RowGroup.total_compressed_size` but also
`ColumnMetadata.total_compressed_size` and
`ColumnMetadata.total_uncompressed_size`.

On the value count front, I see `RowGroup.num_rows` and
`ColumnChunk.num_values`.

For byte sizes, having to rechunk if they end up larger than 2^31 could
be really annoying, as the byte size is not known before encoding.

For value counts, chunking can be decided purely based on the input
data, which is better. That said, it would still be non-trivial if
repeated values are involved (for example, you would have to
recursively inspect the children of an Arrow List array to find out
whether the number of leaf values is greater than 2^31).

In the end, we are preoccupied with metadata size not becoming much
larger than with Thrift, but I don't know if we really want to go after
the smallest possible footprint?

(but if we do, I would suggest perhaps LZ4-compress the Flatbuffers
metadata :-))

Regards

Antoine.



On Wed, 28 Aug 2024 11:24:49 +0200
Alkis Evlogimenos
<alkis.evlogime...@databricks.com.INVALID>
wrote:
> Yes the gains are substantial. This is one of the biggest optimizations.
> 
> They are between 25% to 75% (4x reduction) depending on how much other
> stuff the footer has. Footers without stats get about 4x smaller. With
> stats they are 2x smaller.
> 
> On Wed, Aug 28, 2024 at 10:32 AM Antoine Pitrou 
> <antoine-+zn9apsxkcednm+yrof...@public.gmane.org> wrote:
> 
> >
> > Do you gain much from limiting row groups to 2^31 values and bytes? I
> > generally find 32-bit lengths to a bit an anti-pattern, as they require
> > dedicated logic in the writer to ensure sufficient chunking.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Mon, 26 Aug 2024 10:35:38 +0200
> > Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.INVALID>
> > wrote:  
> > > At the top of the benchmark code I have numbers and short description of
> > > each optimization:
> > >  
> > https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129
> >   
> > >
> > > Summarizing them here:
> > > - statistics min/max: use fixed 4/8 bytes for types of known length and
> > > leave variable length for encoding for binary strings
> > > - limit row groups to 2^31 values and 2^31 bytes and make all column  
> > chunk  
> > > offsets relative to the row group offset
> > > - skip writing `num_values` in column chunk if it is the same as
> > > `num_values` in row group (this is very common in practice)
> > > - remove `encoding_stats` and replace with a boolean denoting that all
> > > pages are dict encoded or not (engines use this to do dictId only  
> > execution)  
> > > - remove `path_in_schema` (can be computed dynamically after parsing as
> > > necessary)
> > > - remove deprecated `file_offset` in column chunk
> > > - statistics min/max for strings: encode as common prefix + fixed 8 bytes
> > > for min/max, zero padded
> > >
> > > Cheers,
> > >
> > > On Fri, Aug 23, 2024 at 6:59 PM Jan Finis <  
> > jpfinis-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:  
> > >  
> > > > Amazing, thanks Alkis!
> > > >
> > > > Can you give a quick comment on what specific fact made the footers so  
> > much  
> > > > smaller in their flatbuf representation? Given that flatbuf compresses  
> > way  
> > > > less aggressively than thrift, this seems counterintuitive, I would  
> > have  
> > > > rather expected quite some size gain.
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > Am Fr., 23. Aug. 2024 um 03:41 Uhr schrieb Corwin Joy  
> > <corwinjoy-Re5JQEeQqe8-XMD5yJDbdMReXY1tMh2IBgC/g2k4z...@public.gmane.org  
> > > > >:  
> > > >  
> > > > > This looks great! I have added some initial simple comments on the  
> > PR  
> > > > that  
> > > > > may help others who want to take a look.
> > > > >
> > > > > On Thu, Aug 22, 2024 at 5:46 PM Julien Le Dem  
> > <julien-1oDqGaOF3LlQFI55V6+gNQ-XMD5yJDbdMReXY1tMh2IBti2O/jbr...@public.gmane.org>
> >  wrote:  
> > > > >  
> > > > > > this looks great,
> > > > > > thank you for sharing.
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 22, 2024 at 10:42 AM Alkis Evlogimenos
> > > > > > <  
> > alkis.evlogimenos-z4fuwbjybqlnpcjqcok8iauzikbjl79t-xmd5yjdbdmrexy1tmh2...@public.gmane.org>
> > wrote:  
> > > > > >  
> > > > > > > Hey folks.
> > > > > > >
> > > > > > > As promised I pushed a PR to the main repo with my attempt to use
> > > > > > > flatbuffers for metadata for parquet:
> > > > > > > https://github.com/apache/arrow/pull/43793
> > > > > > >
> > > > > > > The PR builds on top of the metadata extensions in parquet
> > > > > > > https://github.com/apache/parquet-format/pull/254 and tests how  
> > fast  
> > > > > we  
> > > > > > > can
> > > > > > > parse thrift, thrift+flatbuf, flatbuf alone and also how much  
> > time it  
> > > > > > takes  
> > > > > > > to encode flatbuf. In addition at the start of the benchmark it  
> > > > prints  
> > > > > > out  
> > > > > > > the number of row groups/column chunks and thrift/flatbuffer  
> > > > serialized  
> > > > > > > bytes.
> > > > > > >
> > > > > > > I structured the commits to contain one optimization each to  
> > make  
> > > > their  
> > > > > > > effects more visible. I have tracked the progress at the top of  
> > the  
> > > > > > > benchmark
> > > > > > > <
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > https://github.com/apache/arrow/blob/7f550da9980491a4167318db084e1b50cb100b0f/cpp/src/parquet/metadata3_benchmark.cc#L34-L129
> >  
> > > > > > > >  
> > > > > > > .
> > > > > > >
> > > > > > > The current state is complete sans encryption support. All the  
> > bugs  
> > > > are  
> > > > > > > mine but ideas are coming from a few folks inside Databricks.  
> > As  
> > > > > expected  
> > > > > > > parsing the thrift+extension footer incurs a very small  
> > regression  
> > > > > (~1%).  
> > > > > > > Parsing/verifying flatbuffers is >20x faster than thrift so I  
> > haven't  
> > > > > > tried  
> > > > > > > to make changes to its structure for speed. In the last commit  
> > the  
> > > > size  
> > > > > > of  
> > > > > > > flatbuffer metadata is anywhere from slightly smaller to more  
> > than 4x  
> > > > > > > smaller (!!!).
> > > > > > >
> > > > > > > Unfortunately I can't share the footers I used yet. I am going  
> > to  
> > > > wait  
> > > > > > for  
> > > > > > > donations <https://github.com/apache/parquet-benchmark/pull/1>  
> > to  
> > > > the  
> > > > > > > parquet-benchmarks repository and rerun the benchmark against  
> > them.  
> > > > > > >
> > > > > > > I would like to invite anyone interested in collaborating to  
> > take a  
> > > > > look  
> > > > > > at  
> > > > > > > the PR, consider the design decisions made, experiment with it,  
> > and  
> > > > > > > contribute.
> > > > > > >
> > > > > > > Thank you!
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >
> >
> >
> >  
>

Re: flatbuffers metadata: 32-bit vs. 64-bit sizes

Reply via email to