tustvold commented on issue #2444: URL: https://github.com/apache/arrow-rs/issues/2444#issuecomment-1214767431
Ok so it would appear that this is a known issue where pyarrow is writing ill-formed flatbuffers ([here](https://issues.apache.org/jira/browse/ARROW-15613)) for extension types. There isn't really much we can do here, a flatbuffer string field should not contain non-UTF-8 data, and in the case of Rust permitting this would not be sound (it could lead to UB). Having spoken with @jorgecarleitao I'm led to believe arrow2 also takes the approach of rejecting this. The proper solution to the problem is for pyarrow to either base64 encode the payloads, or for the arrow specification to change `KeyValue.value` to be `bytes` not `string`. Both are probably going to be difficult to sell... That being said, the embedded metadata is a pickled python class, which likely isn't hugely useful to a rust client anyway, and so I would recommend using [skip_arrow_metadata](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_skip_arrow_metadata) to tell the parquet reader to just ignore the malformed embedded arrow schema, and just infer the data from the underlying parquet schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
