[jira] [Created] (ARROW-18268) [Poss]

2022-11-07 Thread Lorenzo Isella (Jira)
Lorenzo Isella created ARROW-18268:
--

 Summary: [Poss]
 Key: ARROW-18268
 URL: https://issues.apache.org/jira/browse/ARROW-18268
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Lorenzo Isella






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18267) [R] Possible bug in Handling Blank Conversion to Missing Value

2022-11-07 Thread Lorenzo Isella (Jira)
Lorenzo Isella created ARROW-18267:
--

 Summary: [R] Possible bug in Handling Blank Conversion to Missing 
Value
 Key: ARROW-18267
 URL: https://issues.apache.org/jira/browse/ARROW-18267
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Lorenzo Isella






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18213) [R] Arrow 10 silently dropping missing values/blanks

2022-11-01 Thread Lorenzo Isella (Jira)
Lorenzo Isella created ARROW-18213:
--

 Summary: [R] Arrow 10 silently dropping missing values/blanks
 Key: ARROW-18213
 URL: https://issues.apache.org/jira/browse/ARROW-18213
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Lorenzo Isella


In the example below a single column text file is written to disk. It contains 
some blanks and when it is opened and collected, the blank values are silently 
dropped.

I did not test this behavior on  arrow 9.0.
{code:java}



library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp

ll <- c(  "100",   "1000",  "200"  , "3000" , "50"   ,
"500", ""   ,   "Not Range")


df <- tibble(x=rep(ll, 1000))

df
#> # A tibble: 8,000 × 1
#>x  
#>  
#>  1 "100"  
#>  2 "1000" 
#>  3 "200"  
#>  4 "3000" 
#>  5 "50"   
#>  6 "500"  
#>  7 "" 
#>  8 "Not Range"
#>  9 "100"  
#> 10 "1000" 
#> # … with 7,990 more rows

df |> dim()
#> [1] 80001


write_tsv(df, "data.tsv")

data <- open_dataset("data.tsv", format="tsv",
 skip_rows=1,
 schema=schema(x=string()))

test <- data |>
collect()

test
#> # A tibble: 7,000 × 1
#>x
#>
#>  1 100  
#>  2 1000 
#>  3 200  
#>  4 3000 
#>  5 50   
#>  6 500  
#>  7 Not Range
#>  8 100  
#>  9 1000 
#> 10 200  
#> # … with 6,990 more rows

test |> dim()  ## the missing values/blanks have been dropped silently
#> [1] 70001




sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 11 (bullseye)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8   LC_NUMERIC=C  
#>  [3] LC_TIME=en_GB.UTF-8LC_COLLATE=en_GB.UTF-8
#>  [5] LC_MONETARY=en_GB.UTF-8LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8   LC_NAME=C 
#>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats graphics  grDevices utils datasets  methods   base 
#> 
#> other attached packages:
#>  [1] arrow_10.0.0forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10   
#>  [5] purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8   
#>  [9] ggplot2_3.3.6   tidyverse_1.3.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] lubridate_1.8.0 assertthat_0.2.1digest_0.6.30  
#>  [4] utf8_1.2.2  R6_2.5.1cellranger_1.1.0   
#>  [7] backports_1.4.1 reprex_2.0.2evaluate_0.17  
#> [10] httr_1.4.4  highr_0.9   pillar_1.8.1   
#> [13] rlang_1.0.6 googlesheets4_1.0.1 readxl_1.4.1   
#> [16] R.utils_2.12.1  R.oo_1.25.0 rmarkdown_2.17 
#> [19] styler_1.8.0googledrive_2.0.0   bit_4.0.4  
#> [22] munsell_0.5.0   broom_1.0.1 compiler_4.2.2 
#> [25] modelr_0.1.9xfun_0.34   pkgconfig_2.0.3
#> [28] htmltools_0.5.3 tidyselect_1.2.0fansi_1.0.3
#> [31] crayon_1.5.2tzdb_0.3.0  dbplyr_2.2.1   
#> [34] withr_2.5.0 R.methodsS3_1.8.2   grid_4.2.2 
#> [37] jsonlite_1.8.3  gtable_0.3.1lifecycle_1.0.3
#> [40] DBI_1.1.3   magrittr_2.0.3  scales_1.2.1   
#> [43] vroom_1.6.0 cli_3.4.1   stringi_1.7.8  
#> [46] fs_1.5.2xml2_1.3.3  ellipsis_0.3.2 
#> [49] generics_0.1.3  vctrs_0.5.0 tools_4.2.2
#> [52] bit64_4.0.5 R.cache_0.16.0  glue_1.6.2 
#> [55] hms_1.1.2   parallel_4.2.2  fastmap_1.1.0  
#> [58] yaml_2.3.6  colorspace_2.0-3gargle_1.2.1   
#> [61] rvest_1.0.3 knitr_1.40  haven_2.5.1
Created on 2022-11-01 with reprex v2.0.2



 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18202) Gsub does not work properly

2022-10-31 Thread Lorenzo Isella (Jira)
Lorenzo Isella created ARROW-18202:
--

 Summary: Gsub does not work properly
 Key: ARROW-18202
 URL: https://issues.apache.org/jira/browse/ARROW-18202
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Lorenzo Isella






--
This message was sent by Atlassian Jira
(v8.20.10#820010)