[I] Core, Spark: Performant queries over Variant data [iceberg]

via GitHub Thu, 30 Apr 2026 05:45:07 -0700


steveloughran opened a new issue, #16172:
URL: https://github.com/apache/iceberg/issues/16172


   ### Feature Request / Improvement
   
   Issue to group together everything needed for queries over Variant data to 
work well.
   1. Auto generation of shredded fields.
   2. Unmarshalling performance.
   3. Rowgroup and file skipping based on shredded field stats.
   4. Benchmarks to evaluate this
   
   Iceberg query performance relies on spark to pass down variant_get() calls 
to the rowgroup filter, so the changes are interrelated. This stuff will have 
to target spark 4.2 only
   
   
   ## Iceberg
   
   #14297 
   #15628 
   #15510
   #15385
   
   ## Spark
   
   * [54598](https://github.com/apache/spark/pull/54598) Enable Parquet 
rowgroup skipping for variant filters
   * [54394](https://github.com/apache/spark/pull/54394) 
   Support variant_get predicate for DSv2 filter pushdown
   
   ## Parquet: better unmarshalling
   
   * [3452](https://github.com/apache/parquet-java/pull/3452)
   * [3481](https://github.com/apache/parquet-java/pull/3481)
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [ ] I can contribute this improvement/feature independently
   - [x] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Core, Spark: Performant queries over Variant data [iceberg]

Reply via email to