Hi Amogh,
Thanks for the detailed perspective. I’ve updated the document
permissions—I’d love to get your specific thoughts on the schema sections.
I want to dig deeper into the "Read Regression" point because I think we
might be looking at different use cases. I completely agree that for
batch-processed, well-sorted data, Parquet's columnar stats are a win.
However, the "hard blocker" I’m seeing is in the streaming ingest tail.
Imagine a buffer in the Root Manifest with 100 inlined files. Because
streaming data arrives by time and not by partition, every file contains a
random scatter of user_id buckets.
-
*File 1:* contains buckets {0, 5, 14, 15} → Parquet Min/Max: [0, 15]
-
*File 2:* contains buckets {1, 2, 8, 15} → Parquet Min/Max: [0, 15]
-
... and so on for all 100 files.
If a reader queries for bucket(user_id) = 5, the Parquet footer stats
for *every
single file* will report a range of [0, 15]. Min/max pruning eliminates
nothing. To determine if bucket 5 actually exists in those files, the REST
server or engine must now project the column chunk and decode the
dictionary/data pages for all 100 entries.
This leads to a few questions I’m struggling with regarding the current V4
vision:
1.
In V3, we could skip a whole manifest group using a single summary
tuple (O(1) skipping). In this V4 scenario, we move to O(N) internal
scanning. Does the spec intend to accept this linear scan cost as the
trade-off for write throughput, or is there a "pre-index" mechanism I’m
missing that avoids decoding data pages for every sub-second query?
2.
For a high-concurrency REST Catalog, the CPU and memory overhead of
performing "partial" Parquet decodes on hundreds of inlined entries per
request seems non-trivial. How do we ensure the catalog remains performant
if it has to become a "mini-query engine" just to perform basic partition
pruning?
3.
If the solution to the scan cost is to flush to leaf manifests more
frequently, don't we risk re-introducing the file-commit frequency issues
(and S3/GCS throttling) that Single-File Commits were specifically designed
to solve?
My proposal for the Compact Partition Summary (CPS) is essentially a
"fast-lane" index in the Root Manifest header. It provides group-level
exclusion for these "dirty" streaming buffers so we don't have to touch the
Parquet data pages at all unless we know the data is there.
Does this scenario resonate with the performance goals you have for your
proposal, or do you see a different way to handle the "random scatter"
metadata problem?
Regards,
Viquar Khan
Sr. Data Architect
https://www.linkedin.com/in/vaquar-khan-b695577/