qlong commented on PR #14297: URL: https://github.com/apache/iceberg/pull/14297#issuecomment-4354310893
I looked at VariantShreddingAnalyzer and SparkVariantShreddingAnalyzer, implementation looks good, just minor nit. The current strategy is to shred aggressively, including fields with multiple incompatible types by picking the most common one. When a field has mixed types, the shredded typed_value is only populated for rows whose value matches the chosen type; other rows still carry the full binary value. This means bounded column reads are not available for mixed-type fields, and the performance gain relative to the added column overhead is not clear. I am not suggesting the current design is flawed. Shredding parameters like MIN_FIELD_FREQUENCY and MAX_SHREDDED_FIELDS can be tuned or new strategies introduced in follow-ups without breaking existing files. But more performance testing on real query patterns would help inform whether these thresholds need to be user-tunable. I would not block merge on this, assuming the community agrees. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
