[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

Joost Hoozemans (Jira) Tue, 26 Jul 2022 04:37:22 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571364#comment-17571364
 ]


Joost Hoozemans commented on ARROW-16000:
-----------------------------------------

Hi all,

While working on this I'm running into the following challenge: when creating a 
CsvFileFormat, the encoding is forced to utf8 by this line 
https://github.com/apache/arrow/blob/bbf249e056315af0a18d5c0834de9adef117a25f/python/pyarrow/_csv.pyx#L310

>>> from pyarrow import dataset as ds
>>> from pyarrow import csv as csv
>>> f=ds.CsvFileFormat(read_options=csv.ReadOptions(encoding="cp1252"), 
>>> parse_options=csv.ParseOptions(delimiter='|'))
>>> f.default_fragment_scan_options.read_options.encoding
'utf8'

So when creating a dataset with that object, the code there does not know what 
encoding the user originally specified, and it cannot create the transform 
wrapper. Right now I'm working around this by just adding an additional 
'encoding' parameter to dataset, but that is not a solution.

> [C++][Dataset] Support Latin-1 encoding
> ---------------------------------------
>
>                 Key: ARROW-16000
>                 URL: https://issues.apache.org/jira/browse/ARROW-16000
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Assignee: Joost Hoozemans
>            Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

Reply via email to