westonpace commented on issue #36459:
URL: https://github.com/apache/arrow/issues/36459#issuecomment-1631602685
Thanks Alenka. I agree this will need C++ changes. Note that handling
different encodings on the read path is not actually a part of the CSV code.
Instead we have a `TransformInputStream` which is an input stream with some
arbitrary transformation applied to it. Then, in pyarrow, we have an
implementation of `TransformInputStream` which applies the decode-on-read.
This did end up being a little inconvenient for datasets since the user
doesn't create the input stream. So we ended up allowing file formats to
specify a stream transform function that we would apply on every file we read.
So if we want an equivalent feature on the write path we would need:
* Create a TransformOutputStream version of output stream
* Create pyarrow implementation of TransformOutputStream that applies the
encoding
* Add a stream_transform_func to the dataset writer
On the other hand we could just make encoding a property of the CSV writer
as Alenka suggested. This would probably be simpler in the long run and, at
the moment, CSV is the only file format that has an flexibility on encoding.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]