[ https://issues.apache.org/jira/browse/ARROW-14063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501145#comment-17501145 ]
Jared Lander commented on ARROW-14063: -------------------------------------- I know this is marked as resolved, but I just tried with Arrow 7.0 and if I want to use open_dataset() on CSVs with header rows and I want to specify the schema (which I have to because the types are guessed incorrectly), then I have to set skip_rows=1, which seems not awesome, especially for someone who doesn't know about this issue. So I just wanted to put a note here that this is still an open issue. > [R] open_dataset() does not work on CSVs without header rows > ------------------------------------------------------------ > > Key: ARROW-14063 > URL: https://issues.apache.org/jira/browse/ARROW-14063 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 5.0.0 > Environment: sessionInfo() > R version 4.0.5 (2021-03-31) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 18.04.5 LTS > Matrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 > LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so > locale: > [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 > [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 > [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] arrow_5.0.0.2 dplyr_1.0.5 magrittr_2.0.1 targets_0.6.0 > loaded via a namespace (and not attached): > [1] httr_1.4.2 rnaturalearth_0.1.0 sass_0.4.0 tidyr_1.1.3 > > [5] jsonlite_1.7.2 bit64_4.0.5 bslib_0.2.5.1 > assertthat_0.2.1 > [9] askpass_1.1 sp_1.4-5 blob_1.2.1 renv_0.13.2 > > [13] yaml_2.2.1 globals_0.14.0 pillar_1.5.1 > RSQLite_2.2.7 > [17] lattice_0.20-41 glue_1.4.2 digest_0.6.27 > htmltools_0.5.1.1 > [21] pkgconfig_2.0.3 RPostgres_1.3.2 listenv_0.8.0 config_0.3.1 > > [25] purrr_0.3.4 processx_3.5.1 openssl_1.4.3 tibble_3.1.0 > > [29] proxy_0.4-25 aws.s3_0.3.21 colourvalues_0.3.7 > generics_0.1.0 > [33] ellipsis_0.3.1 cachem_1.0.5 withr_2.4.1 furrr_0.2.3 > > [37] cli_2.4.0 crayon_1.4.1 memoise_2.0.0 > evaluate_0.14 > [41] ps_1.6.0 fs_1.5.0 future_1.21.0 fansi_0.4.2 > > [45] parallelly_1.25.0 xml2_1.3.2 class_7.3-18 > rsconnect_0.8.18 > [49] tools_4.0.5 data.table_1.14.0 hms_1.0.0 > lifecycle_1.0.0 > [53] stringr_1.4.0 callr_3.6.0 jquerylib_0.1.4 > compiler_4.0.5 > [57] e1071_1.7-6 rlang_0.4.10 classInt_0.4-3 units_0.7-1 > > [61] grid_4.0.5 rstudioapi_0.13 visNetwork_2.0.9 > htmlwidgets_1.5.3 > [65] aws.signature_0.6.0 crosstalk_1.1.1 igraph_1.2.6 > base64enc_0.1-3 > [69] rmarkdown_2.7 codetools_0.2-18 DBI_1.1.1 curl_4.3 > > [73] R6_2.5.0 lubridate_1.7.10 knitr_1.31 > fastmap_1.1.0 > [77] rgeos_0.5-5 bit_4.0.4 utf8_1.2.1 > tarchetypes_0.2.1 > [81] readr_1.4.0 KernSmooth_2.23-18 stringi_1.5.3 > parallel_4.0.5 > [85] Rcpp_1.0.6 vctrs_0.3.7 sf_0.9-8 > leaflet_2.0.4.1 > [89] dbplyr_2.1.1 tidyselect_1.1.0 xfun_0.22 > Reporter: Jared Lander > Assignee: Nicola Crane > Priority: Major > Labels: bug, pull-request-available > Fix For: 6.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Using {{open_dataset()}} on a CSV without a header row, followed by > {{collect()}}, results either in a {{tibble}} of \{{NA}}s or an error > depending on duplication of the first row of data. This affects reading one > file or a directory of files. > Here we use the `diamonds` data, where the first row of data does not have > any repeat values. > {code:java} > library(arrow) > library(magrittr) > data(diamonds, package='ggplot2') > readr::write_csv(head(diamonds), file='diamonds_with_header.csv', > col_names=TRUE) > readr::write_csv(head(diamonds), file='diamonds_without_header.csv', > col_names=FALSE) > diamond_schema <- schema( > carat=float32() > , cut=string() > , color=string() > , clarity=string() > , depth=float32() > , table=float32() > , price=float32() > , x=float32() > , y=float32() > , z=float32() > ) > diamonds_with_headers <- open_dataset('diamonds_with_header.csv', > schema=diamond_schema, format='csv') > diamonds_without_headers <- open_dataset('diamonds_without_header.csv', > schema=diamond_schema, format='csv') > # this works > diamonds_with_headers %>% collect() > # A tibble: 6 x 10 > carat cut color clarity depth table price x y z > <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> > 1 0.230 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 > 2 0.210 Premium E SI1 59.8 61 326 3.89 3.84 2.31 > 3 0.230 Good E VS1 56.9 65 327 4.05 4.07 2.31 > 4 0.290 Premium I VS2 62.4 58 334 4.20 4.23 2.63 > 5 0.310 Good J SI2 63.3 58 335 4.34 4.35 2.75 > 6 0.240 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 > # this gives a tibble with all NA values, though of the correct types > diamonds_without_headers %>% collect() > # A tibble: 5 x 10 > carat cut color clarity depth table price x y z > <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> > 1 NA NA NA NA NA NA NA NA NA NA > 2 NA NA NA NA NA NA NA NA NA NA > 3 NA NA NA NA NA NA NA NA NA NA > 4 NA NA NA NA NA NA NA NA NA NA > 5 NA NA NA NA NA NA NA NA NA NA > {code} > Now we use a simple dataset where two of the columns in the first row have > the same value, 0.0. > > {code:java} > randomDF <- tibble::tibble( > A=c(0.0, 2.3, 5.1) > , B=c('a', 'b', 'a') > , C=c(0.0, 3.1, 4.5) > ) > readr::write_csv(randomDF, file='random_with_header.csv', col_names=TRUE) > readr::write_csv(randomDF, file='random_without_header.csv', col_names=FALSE) > random_schema <- schema( > A=float32() > , B=string() > , C=float32() > ) > random_with_headers <- open_dataset('random_with_header.csv', > schema=random_schema, format='csv') > random_without_headers <- open_dataset('random_without_header.csv', > schema=random_schema, format='csv') > # gives a tibble with the proper values > read_with_headers %>% collect() > # A tibble: 3 x 3 > A B C > <dbl> <chr> <dbl> > 1 0 a 0 > 2 2.30 b 3.10 > 3 5.10 a 4.5 > # results in an error > read_without_headers %>% collect() > Error: Invalid: Could not open CSV input source 'without_header.csv': > Invalid: CSV file contained multiple columns named 0 > {code} > Interestingly, {{read_csv_arrow()}} has the opposite problem. Providing the > schema works for CSVs without headers, but not with, despite the help file > saying that providing a schema satisfies both {{col_nmames}} and > {{col_types}}. > > {code:java} > diamonds_read_with_header <- read_csv_arrow('diamonds_with_header.csv', > schema=diamond_schema) > Error: Invalid: In CSV column #0: CSV conversion error to float: invalid > value 'carat' > diamonds_read_without_header <- read_csv_arrow('diamonds_without_header.csv', > schema=diamond_schema) > # reads normally > random_read_with_header <- read_csv_arrow('random_with_header.csv', > schema=random_schema) > Error: Invalid: In CSV column #0: CSV conversion error to float: invalid > value 'A' > random_read_without_header <- read_csv_arrow('random_without_header.csv', > schema=random_schema) > # reads normally{code} -- This message was sent by Atlassian Jira (v8.20.1#820001)