[jira] [Created] (ARROW-15088) Support for csv options on open_dataset

Carl Boettiger (Jira) Mon, 13 Dec 2021 14:14:05 -0800

Carl Boettiger created ARROW-15088:
--------------------------------------

             Summary: Support for csv options on open_dataset
                 Key: ARROW-15088
                 URL: https://issues.apache.org/jira/browse/ARROW-15088
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 6.0.2
            Reporter: Carl Boettiger



There's a lot of gotchas created around heterogeneity in arrow's support for 
csv parsing options beween read_csv_arrow() and open_dataset() (and further 
issues arising from migrating from readr::read_csv()).  Not sure if it's more 
helpful to report these in one place or as separate issues, but here's a few 
that keep tripping me up:

 
 * "na" (defining the na-character choices) is not implemented on 
open_dataset(), though it is on read_csv_arrow()
 * somewhat confusingly, open_dataset does support `null_strings` though, which 
appears to play the same roll.   The docs however suggest that `open_dataset()` 
`...` options are passed to `dataset_factory()`.  I think those docs should 
link to [https://arrow.apache.org/docs/r/reference/CsvReadOptions.html] .  
[https://arrow.apache.org/docs/r/reference/FileFormat.html] suggests that 
`null_strings` is not one of the recognized CsvReadOptions, but it seems that 
it now is.  I appreciate the challenge of supporting both the readr-like 
options and the native arrow option names here, but the functionality and 
documentation remains very confusing!

Also another gotcha: in arrow 6.0 release, if we supply an arrow schema, 
open_dataset assumes the first line of the csv is data and not column headers, 
so we have to do skip=1.  I see the logic (the schema names the columns anyway, 
so assuming we're going with those names why parse the names from the csv), but 
it's surprising since reading without the schema we do not use skip=1, and it's 
natural to want to go and declare column types while preserving csv column 
names.  The error messages on doing so aren't helpful, since if you forget 
skip=1, you are just told that any column that is not a string is "the 
incorrect type".  The open_dataset() docs imply that we can use 
read_csv_arrow() options, which suggest that we could provide types using 
col_types() instead of schema, but this appears not to be the case.  Also



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15088) Support for csv options on open_dataset

Reply via email to