tustvold opened a new pull request, #1719:
URL: https://github.com/apache/arrow-rs/pull/1719
# Which issue does this PR close?
Closes #1717
Part of #1163
# Rationale for this change
See tickets, but in short the current write path makes use of a lot of
custom IO abstractions which can be hard to use correctly. In particular the
use of TryClone can easily lead to races if used to share a file descriptor
across threads.
# What changes are included in this PR?
This reworks the write path to use `std::io::Write` and nothing else.
Unfortunately to achieve this requires a few changes:
* A TrackedWrite that keeps track of how many bytes have been written,
allowing removal of Seek
* A callback based approach to get metadata from a child writer to its parent
* Lifetimes, lots of lifetimes...
This last point becomes a bit obnoxious when it interacts with the
`RowGroupWriter` trait. In order to be object-safe, i.e. possible to construct
`Box<dyn RowGroupWriter>`, `RowGroupWriter::close` cannot take `self` by value.
This results in explicit scopes, or calls to `std::mem::drop` in order to
truncate its lifetime.
I'm not entirely sure what the purpose of these traits is, but perhaps we
could simplify things, and fix this slight lifetime annoyance by removing them?
Maybe something for a follow up PR?
# Are there any user-facing changes?
Yes, this makes non-trivial changes to the parquet write API.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]