Hi again,

Ok, a quick summary of my current feedback on this:

- decoding speed measurements are given, but not footer size
  measurements; it would be interesting to have both

- it's not obvious whether the stated numbers are for reading all
  columns or a subset of them

- optional LZ4 compression is mentioned, but no numbers are given for
  it; it would be nice if numbers were available for both uncompressed
  and compressed footers

- the numbers seem quite underwhelming currently, I think most of us
  were expecting massive speed improvements given past discussions

- I'm firmly against narrowing sizes to 32 bits; making the footer more
  compact is useful, but not to the point of reducing usefulness or
  generality


A more general proposal: given the slightly underwhelming perf
numbers, has nested Flatbuffers been considered as an alternative?

For example, the RowGroup table could become:
```
table ColumnChunk {
  file_path: string;
  meta_data: ColumnMetadata;
  // etc.
}

struct EncodedColumnChunk {
  // Flatbuffers-encoded ColumnChunk, to be decoded/validated indidually
  column: [ubyte];
}

table RowGroup {
  columns: [EncodedColumnChunk];
  total_byte_size: int;
  num_rows: int;
  sorting_columns: [SortingColumn];
  file_offset: long;
  total_compressed_size: int;
  ordinal: short = null;
}
```

Regards

Antoine.



On Thu, 11 Sep 2025 08:41:34 +0200
Alkis Evlogimenos
<[email protected]>
wrote:
> Hi all. I am sharing as a separate thread the proposal for the footer
> change we have been working on:
> https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit
> .
> 
> The proposal outlines the technical aspects of the design and the
> experimental results of shadow testing this in production workloads. I
> would like to discuss the proposal's most salient points in the next sync:
> 1. the use of flatbuffers as footer serialization format
> 2. the additional limitations imposed on parquet files (row group size
> limit, row group max num row limit)
> 
> I would prefer comments on the google doc to facilitate async discussion.
> 
> Thank you,
> 



Reply via email to