[ 
https://issues.apache.org/jira/browse/ARROW-16833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Farmer updated ARROW-16833:
--------------------------------
    Component/s: R

> [R] how to enforce type conversion in open_dataset()
> ----------------------------------------------------
>
>                 Key: ARROW-16833
>                 URL: https://issues.apache.org/jira/browse/ARROW-16833
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 8.0.0
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>
> Here is a small example:
> {{}}
> {code:java}
> library(arrow)
> df_numbers <- tibble::tibble(number = c(1,2,3,"error", 4, 5, NA, 6))
> str(df_numbers)
> #> tibble [8 x 1] (S3: tbl_df/tbl/data.frame)
> #>  $ number: chr [1:8] "1" "2" "3" "error" ...
> write_parquet(df_numbers, "numbers.parquet")
> open_dataset("numbers.parquet") 
> #> FileSystemDataset with 1 Parquet file
> #> number: string
> open_dataset("numbers.parquet", schema(number = int8())) |> dplyr::collect()
> #> Error in `dplyr::collect()`:
> #> ! Invalid: Failed to parse string: 'error' as a scalar of type int8
> {code}
> The expected result is having an input column of integers; where the 
> non-integer values are converted to NAs.
> How this type conversion can be enforced using schema definition in in the  
> {{{}open_dataset(){}}}? 
> Rationale: I would like to include this in a code chunk  which imports a csv 
> dataset and saves to parquet dataset (open_dataset -> write_dataset); where 
> the type conversion based on a preset schema would be done at the same time.  
> And all these steps without loading all the data in memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to