[ 
https://issues.apache.org/jira/browse/ARROW-13421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384873#comment-17384873
 ] 

Daniel Paierl edited comment on ARROW-13421 at 7/21/21, 12:54 PM:
------------------------------------------------------------------

Hi [~thisisnic], thanks for the super fast reply! Sorry I forgot the reprex, 
its easy to forget how insular these "," vs. "." problems are when european 
comma and thousand separators are standard here. Sadly, I cannot change the 
format of the source data, even using .parquet files is a major departure from 
what has been done in the past.

 

Without further ado:
h2. Reprex
{code:r}
set.seed(1)

tbl <- tibble::tibble(x = rnorm(5))
tbl
#> # A tibble: 5 x 1
#>        x
#>    <dbl>
#> 1 -0.626
#> 2  0.184
#> 3 -0.836
#> 4  1.60 
#> 5  0.330

# write to file in european format (separator = ";", decimal mark = ".")
readr::write_csv2(tbl, "arrow_repex.csv")

# read in with delim set to ";"
arrow::read_delim_arrow(file = "arrow_repex.csv",
                        delim = ";")
#> # A tibble: 5 x 1
#>   x                 
#>   <chr>             
#> 1 -0,626453810742332
#> 2 0,183643324222082 
#> 3 -0,835628612410047
#> 4 1,595280802137792 
#> 5 0,329507771815361

# works with data.table::fread with sep = ";" and dec =","
data.table::fread("arrow_repex.csv",
                  sep = ";", dec = ",")
#>             x
#> 1: -0.6264538
#> 2:  0.1836433
#> 3: -0.8356286
#> 4:  1.5952808
#> 5:  0.3295078

{code}
 
h3. Session Info
{code:r}
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] arrow_4.0.1     forcats_0.5.1   stringr_1.4.0   dplyr_1.0.6    
 [5] purrr_0.3.4     readr_1.4.0     tidyr_1.1.3     tibble_3.1.1   
 [9] ggplot2_3.3.3   tidyverse_1.3.1
{code}
 

 edit: Updated repex


was (Author: ruser):
Hi [~thisisnic], thanks for the super fast reply! Sorry I forgot the repex, its 
easy to forget how insular these "," vs. "." problems are when european comma 
and thousand separators are standard here. Sadly, I cannot change the format of 
the source data, even using .parquet files is a major departure from what has 
been done in the past.

 

Without further ado:
h2. Repex
{code:r}
set.seed(1)
# random values 
tbl <- tibble::tibble(x = rnorm(5))
tbl
## # A tibble: 5 x 1
##        x
##    <dbl>
## 1 -0.626
## 2  0.184
## 3 -0.836
## 4  1.60 
## 5  0.330

# write to file in european format (separator = ";", decimal mark = ".")
readr::write_csv2(tbl, here::here("01_proc_data/arrow_repex.csv"))


# read in with delim set to ";"
arrow::read_delim_arrow(file = here::here("01_proc_data/arrow_repex.csv"),
                        delim = ";")
## # A tibble: 5 x 1
##   x                 
##   <chr>             
## 1 -0,626453810742332
## 2 0,183643324222082 
## 3 -0,835628612410047
## 4 1,595280802137792 
## 5 0,329507771815361

{code}
 
h3. Session Info
{code:r}
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] arrow_4.0.1     forcats_0.5.1   stringr_1.4.0   dplyr_1.0.6    
 [5] purrr_0.3.4     readr_1.4.0     tidyr_1.1.3     tibble_3.1.1   
 [9] ggplot2_3.3.3   tidyverse_1.3.1
{code}
 

 

> [R] Add choice for decimal marker in read_delim_arrow
> -----------------------------------------------------
>
>                 Key: ARROW-13421
>                 URL: https://issues.apache.org/jira/browse/ARROW-13421
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 4.0.1
>            Reporter: Daniel Paierl
>            Priority: Minor
>              Labels: R
>
> In the R arrow package read_delim_arrow lacks the option to specify the 
> decimal marker (e.g. comma or point) in the parsing options.
> This is a major inconvenience for data with a _point_ as a decimal marker 
> (european users) since the data is read in as astring which requires post-hoc 
> conversion of the string to double. 
>  
> Request: Add a parsing option to set the decimal marker if that is possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to