Re: [PR] Spark: Support writing shredded variant in Iceberg-Spark [iceberg]

via GitHub Thu, 30 Apr 2026 09:35:37 -0700


qlong commented on PR #14297:
URL: https://github.com/apache/iceberg/pull/14297#issuecomment-4354310893


   I looked at VariantShreddingAnalyzer and SparkVariantShreddingAnalyzer, 
implementation looks good, just minor nit. 
   
   The current strategy is to shred aggressively, including fields with 
multiple incompatible types by picking the most common one. When a field has 
mixed types, the shredded typed_value is only populated for rows whose value 
matches the chosen type; other rows still carry the full binary value. This 
means bounded column reads are not available for mixed-type fields, and the 
performance gain relative to the added column overhead is not clear. 
   
   I am not suggesting the current design is flawed.  Shredding parameters like 
MIN_FIELD_FREQUENCY and MAX_SHREDDED_FIELDS can be tuned or new strategies 
introduced in follow-ups without breaking existing files. But more performance 
testing on real query patterns would help inform whether these thresholds need 
to be user-tunable. I would not block merge on this, assuming the community 
agrees.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: Support writing shredded variant in Iceberg-Spark [iceberg]

Reply via email to