dgreiss commented on PR #36436: URL: https://github.com/apache/arrow/pull/36436#issuecomment-1636746387
> The implementation of `write_csv_arrow()` doesn't do this (maybe it should, but that hasn't been done yet), but we could do those things here. Got it, I created a separate issue #36700 to add the other functions `write_delim_arrow()` and `write_tsv_arrow()` > In the implementation of `open_csv_dataset()`, I also intentionally don't allow extra things to be passed in via the `...`, to try to keep things simpler to reason about (advanced users can use `open_dataset()` and pass in whatever they like there, if they need to). Makes sense, I'll update it to follow that convention. > What I'm wondering is, if we want to take the opportunity to expose the options differently here, to create an API for users that is easier to reason about. > > For example, `readr::write_csv` has the following options exposed in its parameters: > > * `na` > * `append` > * `col_names` > * `quote` > * `escape` > * `eol` > > We could take these, work out if they map to the Arrow options nicely, and do the hard work inside of the function to convert them for the user, a bit like we do for the CSV reader. Right now `na`, `eol` and `delim` map to Arrow's `null_string`, `eol` and `delimiter` and I've exposed all of them in the PR. I can make sure those options get added as well in #36700. The mapping of `readr` options does complicate things if a user uses both Arrow and `readr` options (eg. `write_dataset(ds, file, delim = ',', delimiter = ';')`). So we'll have to handle that possibility by either throwing an error or defaulting to the Arrow option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
