[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

GitBox Thu, 04 Aug 2022 02:37:46 -0700


joosthooz commented on PR #13709:
URL: https://github.com/apache/arrow/pull/13709#issuecomment-1205010971


   > Somewhere in the CSV reader itself we should also validate the option
   
   I tried this, but it doesn't work, because in that case we would need to 
re-set the field back to `utf8` when adding a transcoder in python. Otherwise, 
the error is triggered even though we are transcoding to utf8. But then, we 
will again run into the issue where the `ReadOptions` object that the user 
created is changed:
   ```
   >>> import pyarrow.dataset as ds
   >>> import pyarrow.csv as csv
   >>> ro =csv.ReadOptions(encoding='iso8859')
   >>> fo = ds.CsvFileFormat(read_options=ro)
   >>> dataset = ds.dataset("file.csv", format=fo)
   >>> ro.encoding
   'utf8'
   ```
   This would be really strange if you ask me. And if we accept this strange 
behavior, we didn't need to add the `encoding` field in the first place.
   So now, the field is basically ignored in the CSV reader, there only is the 
check in the dataset CSV reader that there must be a wrapping function set if 
the encoding is not utf8.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] joosthooz commented on pull request #13709: ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner

Reply via email to