tustvold commented on issue #2444:
URL: https://github.com/apache/arrow-rs/issues/2444#issuecomment-1214767431

   Ok so it would appear that this is a known issue where pyarrow is writing 
ill-formed flatbuffers 
([here](https://issues.apache.org/jira/browse/ARROW-15613)) for extension 
types. There isn't really much we can do here, a flatbuffer string field should 
not contain non-UTF-8 data, and in the case of Rust permitting this would not 
be sound (it could lead to UB). Having spoken with @jorgecarleitao I'm led to 
believe arrow2 also takes the approach of rejecting this.
   
   The proper solution to the problem is for pyarrow to either base64 encode 
the payloads, or for the arrow specification to change `KeyValue.value` to be 
`bytes` not `string`. Both are probably going to be difficult to sell...
   
   That being said, the embedded metadata is a pickled python class, which 
likely isn't hugely useful to a rust client anyway, and so I would recommend 
using 
[skip_arrow_metadata](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_skip_arrow_metadata)
 to tell the parquet reader to just ignore the malformed embedded arrow schema, 
and just infer the data from the underlying parquet schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to