milenkovicm opened a new issue, #16980:
URL: https://github.com/apache/datafusion/issues/16980

   ### Describe the bug
   
   Current implementation of  `ComposedPhysicalExtensionCodec` is unsound. 
Approach relying on `try_any` may produce a wrong type by accident if types are 
simple enough. 
   
   This is not just theoretical issue, it happened in [ballista codec], where 
encoded parquet file was decoded as csv  instead of parquet. Type was encoded 
by the last encoded in the list but decoded by first encoder just by pure luck. 
( I guess i don't have to mention how hard this was to debug)
   
   In order to make current implementation sound we would need to capture which 
encoder in the list has been used and do a reverse lookup when we do decoding. 
We need to encode tuple (position, serialised_blob). 
   
   [ballista codec]: 
https://github.com/milenkovicm/arrow-ballista/blob/d1295f7d1ab5c40a433ab17a344494f39b18f0af/ballista/core/src/serde/mod.rs#L126-L127
   
   ### To Reproduce
   
   I dont have a reproducer at the moment, i believe it could be done very 
simple 
   
   ### Expected behavior
   
   It is expected that types can't be decoded by accident 
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to