Stephanie Hazlitt created ARROW-11587: -----------------------------------------
Summary: Implement a fixed-width file reader Key: ARROW-11587 URL: https://issues.apache.org/jira/browse/ARROW-11587 Project: Apache Arrow Issue Type: Wish Components: R Reporter: Stephanie Hazlitt Fixed-width files are a common data provisioning format for (very) large, administrative data files. We have been converting provisioned fwf files to `.parquet` and then leveraging `arrow::open_dataset()` with good success. However, we still run into RAM issues with the read-in step and are keen to try new approaches to this in-memory RAM issue (ideally without chunking files etc). A simple, example workflow looks like this: ``` sample_data <- "https://github.com/bcgov/dipr/raw/master/inst/extdata/starwars-fwf.dat.gz" vroom::vroom_fwf(sample_data, col_positions = vroom::fwf_positions( c(1, 22, 25, 31), c(21, 24, 30, 35), c("name", "height", "mass", "has_hair") ), col_types = ("cnnl") ) %>% dplyr::group_by(has_hair) %>% arrow::write_dataset(path = "starwars_parquet", format = "parquet") ``` With an \{arrow} fixed-width reader, we could perhaps leverage `arrow::open_dataset(as_data_frame = FALSE)` directly on a large fwf file and then convert to partitioned `.parquet` files with arrow::write_dataset()? -- This message was sent by Atlassian Jira (v8.3.4#803005)