[ https://issues.apache.org/jira/browse/ARROW-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384873#comment-17384873 ]
Daniel Paierl edited comment on ARROW-13421 at 7/21/21, 12:55 PM: ------------------------------------------------------------------ Hi [~thisisnic], thanks for the super fast reply! Sorry I forgot the reprex, its easy to forget how insular these "," vs. "." problems are when european comma and thousand separators are standard here. Sadly, I cannot change the format of the source data, even using .parquet files is a major departure from what has been done in the past. Without further ado: h2. Reprex {code:r} set.seed(1) tbl <- tibble::tibble(x = rnorm(5)) tbl #> # A tibble: 5 x 1 #> x #> <dbl> #> 1 -0.626 #> 2 0.184 #> 3 -0.836 #> 4 1.60 #> 5 0.330 # write to file in european format (separator = ";", decimal mark = ".") readr::write_csv2(tbl, "arrow_repex.csv") # read in with delim set to ";" arrow::read_delim_arrow(file = "arrow_repex.csv", delim = ";") #> # A tibble: 5 x 1 #> x #> <chr> #> 1 -0,626453810742332 #> 2 0,183643324222082 #> 3 -0,835628612410047 #> 4 1,595280802137792 #> 5 0,329507771815361 # works with data.table::fread with sep = ";" and dec ="," data.table::fread("arrow_repex.csv", sep = ";", dec = ",") #> x #> 1: -0.6264538 #> 2: 0.1836433 #> 3: -0.8356286 #> 4: 1.5952808 #> 5: 0.3295078 {code} h3. Session Info {code:r} R version 4.0.5 (2021-03-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server x64 (build 14393) Matrix products: default locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_4.0.1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6 [5] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 tibble_3.1.1 [9] ggplot2_3.3.3 tidyverse_1.3.1 {code} edit: Updated reprex was (Author: ruser): Hi [~thisisnic], thanks for the super fast reply! Sorry I forgot the reprex, its easy to forget how insular these "," vs. "." problems are when european comma and thousand separators are standard here. Sadly, I cannot change the format of the source data, even using .parquet files is a major departure from what has been done in the past. Without further ado: h2. Reprex {code:r} set.seed(1) tbl <- tibble::tibble(x = rnorm(5)) tbl #> # A tibble: 5 x 1 #> x #> <dbl> #> 1 -0.626 #> 2 0.184 #> 3 -0.836 #> 4 1.60 #> 5 0.330 # write to file in european format (separator = ";", decimal mark = ".") readr::write_csv2(tbl, "arrow_repex.csv") # read in with delim set to ";" arrow::read_delim_arrow(file = "arrow_repex.csv", delim = ";") #> # A tibble: 5 x 1 #> x #> <chr> #> 1 -0,626453810742332 #> 2 0,183643324222082 #> 3 -0,835628612410047 #> 4 1,595280802137792 #> 5 0,329507771815361 # works with data.table::fread with sep = ";" and dec ="," data.table::fread("arrow_repex.csv", sep = ";", dec = ",") #> x #> 1: -0.6264538 #> 2: 0.1836433 #> 3: -0.8356286 #> 4: 1.5952808 #> 5: 0.3295078 {code} h3. Session Info {code:r} R version 4.0.5 (2021-03-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server x64 (build 14393) Matrix products: default locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_4.0.1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6 [5] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 tibble_3.1.1 [9] ggplot2_3.3.3 tidyverse_1.3.1 {code} edit: Updated repex > [R] Add choice for decimal marker in read_delim_arrow > ----------------------------------------------------- > > Key: ARROW-13421 > URL: https://issues.apache.org/jira/browse/ARROW-13421 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Affects Versions: 4.0.1 > Reporter: Daniel Paierl > Priority: Minor > Labels: R > > In the R arrow package read_delim_arrow lacks the option to specify the > decimal marker (e.g. comma or point) in the parsing options. > This is a major inconvenience for data with a _point_ as a decimal marker > (european users) since the data is read in as astring which requires post-hoc > conversion of the string to double. > > Request: Add a parsing option to set the decimal marker if that is possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)