[ 
https://issues.apache.org/jira/browse/ARROW-14644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454307#comment-17454307
 ] 

Weston Pace commented on ARROW-14644:
-------------------------------------

[~willjones127] Sorry for the assignment dance.  I hadn't realized you had 
assigned this as I had earmarked it this morning.  I did a bit of 
investigation.  I agree it is a C++ issue.  The following python reproduction 
runs into the same problem:

{noformat}
import pyarrow.csv as csv
import pyarrow.dataset as ds

with open('/tmp/my_dataset/blah.csv', mode='wb') as f:
    f.write(b'\xef\xbb\xbfa,b\n1,2\n3,4\n')

print(csv.read_csv('/tmp/my_dataset/blah.csv').to_pydict())
dataset = ds.dataset('/tmp/my_dataset', format='csv')
print(dataset.to_table().to_pydict())
print(dataset.to_table())
{noformat}

I had thought that maybe it was a streaming / file reader issue but I tested 
both the streaming and file readers and they seem to be properly skipping the 
BOM.  Here are those tests if you want them: 
https://github.com/apache/arrow/compare/master...westonpace:experiment/ARROW-14644-investigation?expand=1

However, that didn't seem to be the problem either.  So then I thought that 
maybe it was the fact that the datasets API will specify the schema when 
reading the data (instead of inferring it) but some initial testing there 
wasn't very fruitful either.

So I'll let you take this, I'm still not quite sure the root cause.  My guess 
would be that it is still some option getting passed when the reader is called 
from the datasets API but I don't know which one it would be.

> [C++] open_dataset doesn't ignore BOM in csv file
> -------------------------------------------------
>
>                 Key: ARROW-14644
>                 URL: https://issues.apache.org/jira/browse/ARROW-14644
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 6.0.0
>         Environment: macOS Mojave, R 4.1.1
>            Reporter: Andy Teucher
>            Assignee: Will Jones
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> DragosMG: I believe this is a bug that should be fixed in the C++ code as 
> there isn't an option we could leverage on the R side.
> I have draft PR with a failing test, but it's identical to Andy's 
> _reproducible example_ below.
> Original description below:
> ======================
> When a CSV file starts with byte order mark, {{arrow::open_dataset()}} reads 
> the file but populates the first column with {{NA}} values. It appears a 
> similar issue was raised and fixed here: 
> https://issues.apache.org/jira/browse/ARROW-5413. {{read_csv_arrow()}} deals 
> with the BOM correctly.
> Reproducible Example:
> {code:java}
> library(arrow)
> library(dplyr)
> writeLines('\xef\xbb\xbfa,b\n1,2\n', con = "testfile.csv")
> read_csv_arrow("testfile.csv") # works
> #> # A tibble: 1 × 2
> #> a b
> #> <int> <int>
> #> 1 1 2
> open_dataset("testfile.csv", format = "csv") |> 
>   collect()
> #> # A tibble: 1 × 2
> #> a b
> #> <int> <int>
> #> 1 NA 2 {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to