[ https://issues.apache.org/jira/browse/ARROW-14644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454307#comment-17454307 ]
Weston Pace commented on ARROW-14644: ------------------------------------- [~willjones127] Sorry for the assignment dance. I hadn't realized you had assigned this as I had earmarked it this morning. I did a bit of investigation. I agree it is a C++ issue. The following python reproduction runs into the same problem: {noformat} import pyarrow.csv as csv import pyarrow.dataset as ds with open('/tmp/my_dataset/blah.csv', mode='wb') as f: f.write(b'\xef\xbb\xbfa,b\n1,2\n3,4\n') print(csv.read_csv('/tmp/my_dataset/blah.csv').to_pydict()) dataset = ds.dataset('/tmp/my_dataset', format='csv') print(dataset.to_table().to_pydict()) print(dataset.to_table()) {noformat} I had thought that maybe it was a streaming / file reader issue but I tested both the streaming and file readers and they seem to be properly skipping the BOM. Here are those tests if you want them: https://github.com/apache/arrow/compare/master...westonpace:experiment/ARROW-14644-investigation?expand=1 However, that didn't seem to be the problem either. So then I thought that maybe it was the fact that the datasets API will specify the schema when reading the data (instead of inferring it) but some initial testing there wasn't very fruitful either. So I'll let you take this, I'm still not quite sure the root cause. My guess would be that it is still some option getting passed when the reader is called from the datasets API but I don't know which one it would be. > [C++] open_dataset doesn't ignore BOM in csv file > ------------------------------------------------- > > Key: ARROW-14644 > URL: https://issues.apache.org/jira/browse/ARROW-14644 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 6.0.0 > Environment: macOS Mojave, R 4.1.1 > Reporter: Andy Teucher > Assignee: Will Jones > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > DragosMG: I believe this is a bug that should be fixed in the C++ code as > there isn't an option we could leverage on the R side. > I have draft PR with a failing test, but it's identical to Andy's > _reproducible example_ below. > Original description below: > ====================== > When a CSV file starts with byte order mark, {{arrow::open_dataset()}} reads > the file but populates the first column with {{NA}} values. It appears a > similar issue was raised and fixed here: > https://issues.apache.org/jira/browse/ARROW-5413. {{read_csv_arrow()}} deals > with the BOM correctly. > Reproducible Example: > {code:java} > library(arrow) > library(dplyr) > writeLines('\xef\xbb\xbfa,b\n1,2\n', con = "testfile.csv") > read_csv_arrow("testfile.csv") # works > #> # A tibble: 1 × 2 > #> a b > #> <int> <int> > #> 1 1 2 > open_dataset("testfile.csv", format = "csv") |> > collect() > #> # A tibble: 1 × 2 > #> a b > #> <int> <int> > #> 1 NA 2 {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)