Re: [PR] feat(variant): [DNM] auto-infer per-file variant shredding schemas on writedding inference [hudi]

via GitHub Thu, 11 Jun 2026 01:45:19 -0700


voonhous commented on PR #18961:
URL: https://github.com/apache/hudi/pull/18961#issuecomment-4678781999


   Heads up on a latent bug inherited from the #18938 read path, found while 
testing inference here and fixed in this PR:
   
   `VariantReconstruction` never engaged on real files. The reader derives the 
file schema by converting the parquet footer MessageType, and that conversion 
loses the variant logical type, so the `getType() == VARIANT` check never 
matched. The shredded base file was then read with the unshredded `{metadata, 
value}` projection, silently dropping all typed_value fields on the AVRO read 
path (a reconstructed row came back as just the residual, e.g. 
`{"bad-key":false}`).
   
   Fixed in this PR with shape-based detection anchored on the requested 
column: the requested side (from the table schema, logical type intact) must be 
a variant, and the on-disk side is matched by the shredded shape `{metadata: 
bytes, value: [nullable] bytes, typed_value}`, so a plain footer-derived record 
still triggers reconstruction.
   
   Worth noting why #18938 did not catch it: the new COW inference test here is 
the first to read a shredded base file end to end through the AVRO reader. The 
existing MOR compaction test compacts a log-only file group (no shredded base 
to read), and post-compaction queries go through the Spark native reader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(variant): [DNM] auto-infer per-file variant shredding schemas on writedding inference [hudi]

Reply via email to