[ https://issues.apache.org/jira/browse/ARROW-16833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Farmer updated ARROW-16833: -------------------------------- Component/s: R > [R] how to enforce type conversion in open_dataset() > ---------------------------------------------------- > > Key: ARROW-16833 > URL: https://issues.apache.org/jira/browse/ARROW-16833 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Affects Versions: 8.0.0 > Reporter: Zsolt Kegyes-Brassai > Priority: Minor > > Here is a small example: > {{}} > {code:java} > library(arrow) > df_numbers <- tibble::tibble(number = c(1,2,3,"error", 4, 5, NA, 6)) > str(df_numbers) > #> tibble [8 x 1] (S3: tbl_df/tbl/data.frame) > #> $ number: chr [1:8] "1" "2" "3" "error" ... > write_parquet(df_numbers, "numbers.parquet") > open_dataset("numbers.parquet") > #> FileSystemDataset with 1 Parquet file > #> number: string > open_dataset("numbers.parquet", schema(number = int8())) |> dplyr::collect() > #> Error in `dplyr::collect()`: > #> ! Invalid: Failed to parse string: 'error' as a scalar of type int8 > {code} > The expected result is having an input column of integers; where the > non-integer values are converted to NAs. > How this type conversion can be enforced using schema definition in in the > {{{}open_dataset(){}}}? > Rationale: I would like to include this in a code chunk which imports a csv > dataset and saves to parquet dataset (open_dataset -> write_dataset); where > the type conversion based on a preset schema would be done at the same time. > And all these steps without loading all the data in memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)