rok commented on code in PR #16738:
URL: https://github.com/apache/datafusion/pull/16738#discussion_r2310832534
##########
datafusion/datasource-parquet/src/file_format.rs:
##########
@@ -1654,7 +1636,8 @@ async fn output_single_parquet_file_parallelized(
object_store_writer: Box<dyn AsyncWrite + Send + Unpin>,
data: Receiver<RecordBatch>,
output_schema: Arc<Schema>,
- parquet_props: &WriterProperties,
+ writer_properties: &WriterProperties,
+ skip_arrow_metadata: bool,
Review Comment:
Previously we always had set `allow_single_file_parallelism == false`. Now
that we allow for `true` the `WriterProperties` will use [another
path](https://github.com/apache/datafusion/blob/25acb643585fe4460199a8731fc94c24e79466ef/datafusion/datasource-parquet/src/file_format.rs#L1127-L1134)
for creating schema. We now fix this by calling:
```rust
let options = ArrowWriterOptions::new()
.with_properties(writer_properties.clone())
.with_skip_arrow_metadata(skip_arrow_metadata);
```
[here](https://github.com/apache/datafusion/pull/16738/files#diff-a8919cf6209fb777550056cdd7decca3e6ed94370a2821a9395763fdd6271967R1652).
I'm honestly not sure this is a good idea, but it was the simplest way I
could find to fix the schema mismatch that was occurring without this change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]