Re: V4: Block-level Pruning for Inlined Metadata (Adaptive Metadata Tree)

Amogh Jahagirdar Thu, 01 Jan 2026 18:58:06 -0800

Sounds good to me, I left some comments
<https://docs.google.com/document/d/1mYTEK5eA6IjOc6yxRCvEBIzdbJO-rjXbr3YHtgWKNdo/edit?tab=t.0>
in the beginning of the proposal; I think this may be partially a
misunderstanding on how exactly the data files are stored in the root
manifest combined with how the tree does need to be rebalanced so that a
tenable root is maintained; that way the amount of work being done when
reading the root can be bounded, even in the presence of hot data. If there
are still doubts, let's chat about it in the sync (will schedule this and
notify in the existing single file commits thread).


Thanks,
Amogh Jahagirdar

On Thu, Jan 1, 2026 at 4:26 PM vaquar khan <[email protected]> wrote:

> Hi Amogh  ,
>
> I think it is best to meet and discuss this directly rather than
> continuing a long email trail, as it seems we are looking at this issue
> through two different lenses.
>
> My perspective comes from my day-to-day job helping big financial
> customers run billions of data records in production using Iceberg. My
> concerns regarding the Adaptive Metadata Tree and scanning overhead are
> rooted in the operational realities I am working on day to day  in these
> high-scale environments.
>
> I am happy to pause here and join the upcoming sync to discuss in detail .
>
>
> Regards,
>
> Viquar khan
>
> On Thu, 1 Jan 2026 at 15:43, Amogh Jahagirdar <[email protected]> wrote:
>
>> >If a reader queries for bucket(user_id) = 5, the Parquet footer stats
>> for *every single file* will report a range of [0, 15]. Min/max pruning
>> eliminates nothing. To determine if bucket 5 actually exists in those
>> files, the REST server or engine must now project the column chunk and
>> decode the dictionary/data pages for all 100 entries.
>> >1. In V3, we could skip a whole manifest group using a single summary
>> tuple (O(1) skipping). In this V4 scenario, we move to O(N) internal
>> scanning. Does the spec intend to accept this linear scan cost as the
>> trade-off for write throughput, or is there a "pre-index" mechanism I’m
>> missing that avoids decoding data pages for every sub-second query?
>> >2. For a high-concurrency REST Catalog, the CPU and memory overhead of
>> performing "partial" Parquet decodes on hundreds of inlined entries per
>> request seems non-trivial. How do we ensure the catalog remains performant
>> if it has to become a "mini-query engine" just to perform basic partition
>> pruning?
>>
>> The provided example doesn't make sense to me since it contradicts
>> bucketing as a partition transform. If user_id was bucketed on, this
>> *necessarily* means that all values in a given file would have the same
>> bucket values of user_id (a file which is partitioned by something must
>> have all of the same partition values). So File 1 could *not* contain a
>> spread of buckets {0, 5, 14, 15} (and same principle with the rest of the
>> files). Of course there can be many user_ids in a file, but the *bucket*
>> they'd be in for a given file must be the same otherwise we cannot say that
>> file is bucketed.
>>
>> Let's say it wasn't bucketing, and it was some arbitrary clustering
>> function represented through expressions. We can't say it's a regression
>> because today this pruning based on column or derived stats is not even
>> possible at the root level of the metadata tree, since manifest lists only
>> have the upper/lower on partition values that exist in a given manifest. So
>> the new version should be a net improvement for planning costs.
>>
>> *In fact, even if this was possible today* this scenario we still
>> wouldn't be able say it's a regression in planning because we'd then be
>> comparing the cost of I/O and decoding N manifest entries in the Avro
>> manifest list (in today's fast append in the worst case there'd be a single
>> manifest per single write, putting aside manifest rewrites), vs the cost of
>> I/o and decoding X data files in the root buffer + M manifests (the fanout
>> from the root of the tree) in the Parquet root manifest. So if we're
>> analyzing the cost of reading the root of the metadata tree, the core of
>> that really is the cost of reading the V4 Parquet root manifest vs the cost
>> of reading the Avro manifest list. This comes down to numbers (more on this
>> later) but assuming comparable sizes and logical contents in the root, just
>> from a theoretical perspective we can see that planning cost is an
>> improvement in v4.
>>
>> >If the solution to the scan cost is to flush to leaf manifests more
>> frequently, don't we risk re-introducing the file-commit frequency issues
>> (and S3/GCS throttling) that Single-File Commits were specifically designed
>> to solve?
>>
>> Not necessarily, as the flushing to leaf manifests doesn't need to happen
>> on the ingest path, it may be a background maintenance (this choice of
>> if/how/when to rebalance is part of the whole "adaptive" part), but I do
>> think here's where we should provide numbers to better demonstrate this.
>> This is a matter of an amortized analysis (root size, expected commit
>> latency, rebalancing costs given some frequency and clustering etc) for
>> different types of workloads at different scale factors, i.e. how much of a
>> buffer for single file commits in the root is desirable, at any given point
>> in time, whereas batch workloads generally won't really care too much about
>> this.
>>
>> I think it's important to emphasize here that while one of the goals is
>> to enable the table format to do single file commits for small writes, this
>> doesn't mean that *every *write *always* has to be a single file commit;
>> there are a different set of tradeoffs that are imposed by that. The
>> proposed adaptive structure does allow for that if desired by doing
>> background maintenance, or writers can choose to incur that cost at some
>> point of their choosing.
>> Additionally, one of the other tests that was run a while back (that I
>> don't think is on the doc, but I'll update it with those details on the
>> appendix) was a simple s3 put latency test on different root sizes; I
>> believe what was tested was root sizes ranging from 3kb to 4mb, and while
>> of course latency increases, it did not increase *linearly *with respect
>> to size, and the latency differences are much smaller than one would expect.
>>
>> >My proposal for the Compact Partition Summary (CPS) is essentially a
>> "fast-lane" index in the Root Manifest header. It provides group-level
>> exclusion for these "dirty" streaming buffers so we don't have to touch the
>> Parquet data pages at all unless we know the data is there.
>>
>> I'll take a look at this when I get a chance this week but I'm pretty
>> skeptical that this additional complexity at the root level really buys us
>> anything here. From a quick scan I did of the contents, parts of the
>> proposal look to be AI generated, which is fine, but the
>> assumptions/conclusions drawn by it aren't quite right imo.
>>
>> I do think this discussion has brought up a good point around numbers,
>> and we didn't get to it in the last sync but one of the topics we wanted to
>> discuss was around inline bitmap representations (which has metadata
>> footprint implications) , and relates to metadata maintenance costs for DML
>> heavy operations. I was planning on setting up another sync when more
>> people are back from holidays, and since this topic also relates to
>> manifest entry sizes, and scaling dynamics, perhaps we could discuss it in
>> the sync as well? That also gives others more time to understand and
>> comment on the proposal.
>>
>> Thanks,
>>
>> Amogh Jahagirdar
>>
>> On Thu, Jan 1, 2026 at 11:07 AM vaquar khan <[email protected]>
>> wrote:
>>
>>> Hi Amogh,
>>>
>>> Thanks for the detailed perspective. I’ve updated the document
>>> permissions—I’d love to get your specific thoughts on the schema sections.
>>>
>>> I want to dig deeper into the "Read Regression" point because I think we
>>> might be looking at different use cases. I completely agree that for
>>> batch-processed, well-sorted data, Parquet's columnar stats are a win.
>>> However, the "hard blocker" I’m seeing is in the streaming ingest tail.
>>>
>>>  Imagine a buffer in the Root Manifest with 100 inlined files. Because
>>> streaming data arrives by time and not by partition, every file contains a
>>> random scatter of user_id buckets.
>>>
>>>    -
>>>
>>>    *File 1:* contains buckets {0, 5, 14, 15} → Parquet Min/Max: [0, 15]
>>>    -
>>>
>>>    *File 2:* contains buckets {1, 2, 8, 15} → Parquet Min/Max: [0, 15]
>>>    -
>>>
>>>    ... and so on for all 100 files.
>>>
>>> If a reader queries for bucket(user_id) = 5, the Parquet footer stats
>>> for *every single file* will report a range of [0, 15]. Min/max pruning
>>> eliminates nothing. To determine if bucket 5 actually exists in those
>>> files, the REST server or engine must now project the column chunk and
>>> decode the dictionary/data pages for all 100 entries.
>>>
>>> This leads to a few questions I’m struggling with regarding the current
>>> V4 vision:
>>>
>>>    1.
>>>
>>>     In V3, we could skip a whole manifest group using a single summary
>>>    tuple (O(1) skipping). In this V4 scenario, we move to O(N) internal
>>>    scanning. Does the spec intend to accept this linear scan cost as the
>>>    trade-off for write throughput, or is there a "pre-index" mechanism I’m
>>>    missing that avoids decoding data pages for every sub-second query?
>>>    2.
>>>
>>>    For a high-concurrency REST Catalog, the CPU and memory overhead of
>>>    performing "partial" Parquet decodes on hundreds of inlined entries per
>>>    request seems non-trivial. How do we ensure the catalog remains 
>>> performant
>>>    if it has to become a "mini-query engine" just to perform basic partition
>>>    pruning?
>>>    3.
>>>
>>>     If the solution to the scan cost is to flush to leaf manifests more
>>>    frequently, don't we risk re-introducing the file-commit frequency issues
>>>    (and S3/GCS throttling) that Single-File Commits were specifically 
>>> designed
>>>    to solve?
>>>
>>> My proposal for the Compact Partition Summary (CPS) is essentially a
>>> "fast-lane" index in the Root Manifest header. It provides group-level
>>> exclusion for these "dirty" streaming buffers so we don't have to touch the
>>> Parquet data pages at all unless we know the data is there.
>>>
>>> Does this scenario resonate with the performance goals you have for your
>>> proposal, or do you see a different way to handle the "random scatter"
>>> metadata problem?
>>> Regards,
>>> Viquar Khan
>>> Sr. Data Architect
>>> https://www.linkedin.com/in/vaquar-khan-b695577/
>>>
>>
>
> --
> Regards,
> Vaquar Khan
>
>

Re: V4: Block-level Pruning for Inlined Metadata (Adaptive Metadata Tree)

Reply via email to