yadavay-amzn opened a new pull request, #55928:
URL: https://github.com/apache/spark/pull/55928

   ### What changes were proposed in this pull request?
   
   `getFieldByKey()` uses binary search for objects with >=32 fields, assuming 
field IDs are sorted alphabetically by key name. The Variant format spec allows 
unsorted objects (indicated by bit 4 of the object header). External producers 
(Parquet, Iceberg) may produce unsorted variants, causing binary search to 
silently return null for keys that exist.
   
   Fix: check the object header sort bit before choosing binary search vs 
linear scan. Fall back to linear scan when fields are unsorted.
   
   ### Why are the changes needed?
   
   Data correctness bug -- `getFieldByKey` silently returns null for fields 
that exist in unsorted variant objects. This affects any variant data produced 
by external systems that do not sort field IDs.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes -- queries on variant columns with unsorted objects will now correctly 
return field values instead of null.
   
   ### How was this patch tested?
   
   Added test in `VariantExpressionSuite` that constructs a 32-field unsorted 
variant object (sort bit=0, field IDs in reverse order) and verifies 
`getFieldByKey` finds keys correctly. Test fails without the fix (binary search 
returns null), passes with it.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to