Re: [PR] Support Arrow IPC Stream Files [datafusion]

via GitHub Thu, 06 Nov 2025 16:35:11 -0800


corasaurus-hex commented on code in PR #18457:
URL: https://github.com/apache/datafusion/pull/18457#discussion_r2501303689



##########
datafusion/datasource-arrow/src/file_format.rs:
##########
@@ -344,40 +382,68 @@ impl DataSink for ArrowFileSink {
     }
 }
 
+// Custom implementation of inferring schema. Should eventually be moved 
upstream to arrow-rs.
+// See <https://github.com/apache/arrow-rs/issues/5021>

Review Comment:
   I looked into using the [various readers 
available](https://docs.rs/arrow-ipc/57.0.0/arrow_ipc/reader/index.html). 
`FileDecoder` requires a schema to create the struct which defeats the point 
entirely, and `FileReader` requires the passed-in object to support `Read + 
Seek` (we're dealing with a stream of bytes here that only does `Read`). I 
think I could keep the magic bytes handling here and then use a `Cursor` over 
the bytes already read and chain it with the remainder of the stream, passing 
that into a `StreamReader` to parse the schema. so, still a little bit of 
parsing but much less



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Support Arrow IPC Stream Files [datafusion]

Reply via email to