amoeba commented on issue #37908: URL: https://github.com/apache/arrow/issues/37908#issuecomment-1756253850
Hi @angela-li, when I've run into situations like yours in the past, I've resorted to adding a cleanup step in between the raw data and the less flexible system (in this case, arrow) in order to get the raw data in a form that can be read without issues. I can imagine this might not be practical for your use case. This comment got me thinking, > Changing this option is also not good for the rest of the data, where I do want the quote_char to be ". One other thing you might try that arrow can do right now would be to make use of arrow's UnionDataset functionality. As described above, you essentially need to parse some files with one set of rules and other files with another. `open_dataset` can actually open other Datasets so you could do something like, ```r my_ds <- open_dataset( list( open_dataset("good_file.txt", type = "text") open_dataset("bad_file.txt", type = "text", parse_options = CsvParseOptions$create(...)) ) ) # <- this is a UnionDataset ``` From here you can work with `my_ds` normally. This problem also reminds me of lubridate and its `orders` argument in [`lubridate::parse_date_time`](https://lubridate.tidyverse.org/reference/parse_date_time.html). One limitation of the above approach is that it requires you to know which files are problematic and which are not. So an idea would be to create a list of `CsvParseOptions` objects, try opening your files in a `tryCatch` as you try each option. I've included hacky example below. <details> <summary>flexible_open_dataset.R</summary> ```r library(arrow) # First create a set of CsvParseOptions to try. Order matters. default_parse_options <- CsvParseOptions$create(delimiter = "|") quirk_parse_options <- CsvParseOptions$create(delimiter = "|", quote_char = '') my_parse_options <- c(default_parse_options, quirk_parse_options) # Then we define two helper functions that attempt to call open_dataset until one succeeds flexible_open_dataset_single <- function(file, parse_options) { for (parse_option in parse_options) { ds <- tryCatch({ open_dataset(file, format = "text", parse_options = parse_option) }, error = function(e) { warning( "Failed to parse ", file, " with provided ParseOption. Trying any remaining options...") NULL }) if (!is.null(ds)) { break; } } ds } flexible_open_dataset <- function(files, parse_options) { open_dataset(lapply(files, function(f) { flexible_open_dataset_single(f, parse_options) })) } # Then, finally, we use our new helper and this should print a warning but otherwise work my_ds <- flexible_open_dataset(c("test_data.txt", "test_data_good.txt"), my_parse_options) ``` </details> If we wanted to provide something like this in arrow, one way would be to allow `parse_options` to take multiple values and use a similar mechanism internally to try each. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org