westonpace commented on issue #36459:
URL: https://github.com/apache/arrow/issues/36459#issuecomment-1631602685

   Thanks Alenka.  I agree this will need C++ changes.  Note that handling 
different encodings on the read path is not actually a part of the CSV code.  
Instead we have a `TransformInputStream` which is an input stream with some 
arbitrary transformation applied to it.  Then, in pyarrow, we have an 
implementation of `TransformInputStream` which applies the decode-on-read.
   
   This did end up being a little inconvenient for datasets since the user 
doesn't create the input stream.  So we ended up allowing file formats to 
specify a stream transform function that we would apply on every file we read.
   
   So if we want an equivalent feature on the write path we would need:
   
    * Create a TransformOutputStream version of output stream
    * Create pyarrow implementation of TransformOutputStream that applies the 
encoding
    * Add a stream_transform_func to the dataset writer
   
   On the other hand we could just make encoding a property of the CSV writer 
as Alenka suggested.  This would probably be simpler in the long run and, at 
the moment, CSV is the only file format that has an flexibility on encoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to