Re: [R] incomplete reading of a large csv file

2020-02-21 Thread Uwe Ligges




On 21.02.2020 20:10, Christopher W. Ryan wrote:

sessionInfo at end of message.

I have data that I was given as an Excel .xlsx file. It contains 96266
lines and 24 columns. I opened it in OpenOffice.org and saved it in .csv
format, using the pipe character as a field separator. This produced a
file with 96266 lines.

When I read it into R thusly:


skip0.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv",

sep = "|", header = TRUE, comment.char = "", skip = 0)

the resulting skip0.dd dataframe has only 58208 lines:


dim(skip0.dd)

[1] 5820824


I've tried a variety of things to troubleshoot. Using head() and tail(),
the expected first and last lines (comparing to the .csv file) do indeed
exist in skip0.dd.  Several arbitrary lines from the "middle" of the csv
file are also present in the skip0.dd dataframe.

I tried reading only the first column, which is integer, but still it
appears that not all lines are read in:


classes <- c(NA, rep("NULL", 23))
skip01.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv",

sep = "|", header = TRUE, comment.char = "", skip = 0, colClasses = classes)

dim(skip01.dd)

[1] 582081

Skipping the first 5 lines nominally should give me a dataframe of
46266 lines, or at least one of 5 fewer lines than skip0.dd (i.e.
8208 lines), but it does neither:


skip5.dd <-

read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", sep = "|",
header = TRUE, comment.char = "", skip = 5)

dim(skip5.dd)

[1] 2217024

Any thoughts on what might be going wrong? Some funky characters from
Excel or OpenOffice.org lurking in the .csv file?


quotes are a typical proiblem, what if you try with arg quote=""?






Perhaps I'd have more success with one of the packages that enables
reading directly from an .xlsx file.

Thanks.

--Chris Ryan
SUNY Upstate Medical University Binghamton Clinical Campus
Broome County Health Department
Binghamton University




sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] dplyr_0.8.3   stringr_1.4.0 Hmisc_4.2-0   ggplot2_3.2.1
[5] Formula_1.2-3 survival_2.44-1.1 lattice_0.20-38

loaded via a namespace (and not attached):
  [1] Rcpp_1.0.1  pillar_1.4.0compiler_3.5.3
  [4] RColorBrewer_1.1-2  tools_3.5.3 base64enc_0.1-3
  [7] digest_0.6.18   zeallot_0.1.0   rpart_4.1-13
[10] checkmate_1.9.3 tibble_2.1.1gtable_0.3.0
[13] htmlTable_1.13.1pkgconfig_2.0.2 rlang_0.4.0
[16] Matrix_1.2-15   rstudioapi_0.10 xfun_0.7
[19] gridExtra_2.3   knitr_1.23  withr_2.1.2
[22] cluster_2.0.7-1 htmlwidgets_1.3 vctrs_0.2.0
[25] grid_3.5.3  nnet_7.3-12 tidyselect_0.2.5
[28] data.table_1.12.2   glue_1.3.1  R6_2.4.0
[31] foreign_0.8-71  latticeExtra_0.6-28 purrr_0.3.2
[34] magrittr_1.5htmltools_0.3.6 backports_1.1.4
[37] scales_1.0.0splines_3.5.3   assertthat_0.2.1
[40] colorspace_1.4-1stringi_1.4.3   acepack_1.4.1
[43] lazyeval_0.2.2  munsell_0.5.0   crayon_1.3.4

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] incomplete reading of a large csv file

2020-02-21 Thread Christopher W. Ryan
sessionInfo at end of message.

I have data that I was given as an Excel .xlsx file. It contains 96266
lines and 24 columns. I opened it in OpenOffice.org and saved it in .csv
format, using the pipe character as a field separator. This produced a
file with 96266 lines.

When I read it into R thusly:

> skip0.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv",
sep = "|", header = TRUE, comment.char = "", skip = 0)

the resulting skip0.dd dataframe has only 58208 lines:

> dim(skip0.dd)
[1] 5820824


I've tried a variety of things to troubleshoot. Using head() and tail(),
the expected first and last lines (comparing to the .csv file) do indeed
exist in skip0.dd.  Several arbitrary lines from the "middle" of the csv
file are also present in the skip0.dd dataframe.

I tried reading only the first column, which is integer, but still it
appears that not all lines are read in:

> classes <- c(NA, rep("NULL", 23))
> skip01.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv",
sep = "|", header = TRUE, comment.char = "", skip = 0, colClasses = classes)
> dim(skip01.dd)
[1] 582081

Skipping the first 5 lines nominally should give me a dataframe of
46266 lines, or at least one of 5 fewer lines than skip0.dd (i.e.
8208 lines), but it does neither:

> skip5.dd <-
read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", sep = "|",
header = TRUE, comment.char = "", skip = 5)
> dim(skip5.dd)
[1] 2217024

Any thoughts on what might be going wrong? Some funky characters from
Excel or OpenOffice.org lurking in the .csv file?

Perhaps I'd have more success with one of the packages that enables
reading directly from an .xlsx file.

Thanks.

--Chris Ryan
SUNY Upstate Medical University Binghamton Clinical Campus
Broome County Health Department
Binghamton University



> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] dplyr_0.8.3   stringr_1.4.0 Hmisc_4.2-0   ggplot2_3.2.1
[5] Formula_1.2-3 survival_2.44-1.1 lattice_0.20-38

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1  pillar_1.4.0compiler_3.5.3
 [4] RColorBrewer_1.1-2  tools_3.5.3 base64enc_0.1-3
 [7] digest_0.6.18   zeallot_0.1.0   rpart_4.1-13
[10] checkmate_1.9.3 tibble_2.1.1gtable_0.3.0
[13] htmlTable_1.13.1pkgconfig_2.0.2 rlang_0.4.0
[16] Matrix_1.2-15   rstudioapi_0.10 xfun_0.7
[19] gridExtra_2.3   knitr_1.23  withr_2.1.2
[22] cluster_2.0.7-1 htmlwidgets_1.3 vctrs_0.2.0
[25] grid_3.5.3  nnet_7.3-12 tidyselect_0.2.5
[28] data.table_1.12.2   glue_1.3.1  R6_2.4.0
[31] foreign_0.8-71  latticeExtra_0.6-28 purrr_0.3.2
[34] magrittr_1.5htmltools_0.3.6 backports_1.1.4
[37] scales_1.0.0splines_3.5.3   assertthat_0.2.1
[40] colorspace_1.4-1stringi_1.4.3   acepack_1.4.1
[43] lazyeval_0.2.2  munsell_0.5.0   crayon_1.3.4

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.