[GitHub] [arrow] joosthooz commented on pull request #13820: ARROW-16000: [C++][Python] Dataset: Alternative implementation for adding transcoding function option to CSV scanner

GitBox Tue, 16 Aug 2022 06:57:35 -0700


joosthooz commented on PR #13820:
URL: https://github.com/apache/arrow/pull/13820#issuecomment-1216675451


   After having a better look, here's what seems to be happening:
   - The first part of the test checks if parsing the file as binary still 
works. But that doesn't work for utf16 because the column names are not utf8. 
So parsing the column names into the schema fails (silently!).
   - The second part tries to read the file, without specifying an encoding. It 
expects an exception. However, apparently the dataset reader has no problems 
with the null values every other character; it will just interpret it as a 
strange utf8 string.
   
   I've removed those 2 additional checks, and just check if the data is 
transcoded properly. The 2nd check is still present in the new 
`test_column_names_encoding` test (that only tests latin-1)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] joosthooz commented on pull request #13820: ARROW-16000: [C++][Python] Dataset: Alternative implementation for adding transcoding function option to CSV scanner

Reply via email to