Re: [Rd] read.csv
I was horrified when I saw John Weinstein's article about Excel turning gene names into dates. Mainly because I had been complaining about that phenomenon for years, and it never remotely occurred to me that you could get a publication out of it. I eventually rectified the situation by publishing "Blasted Cell Line Names", describing how to match different researchers' recording of the names of cell lines, by applying techniques for DNA or protein sequence alignment. Best, Kevin On Tue, Apr 16, 2024, 4:51 PM Reed A. Cartwright wrote: > Gene names being misinterpreted by spreadsheet software (read.csv is > no different) is a classic issue in bioinformatics. It seems like > every practitioner ends up encountering this issue in due time. E.g. > > https://pubmed.ncbi.nlm.nih.gov/15214961/ > > https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7 > > https://www.nature.com/articles/d41586-021-02211-4 > > > https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates > > > On Tue, Apr 16, 2024 at 3:46 AM jing hua zhao > wrote: > > > > Dear R-developers, > > > > I came to a somewhat unexpected behaviour of read.csv() which is trivial > but worthwhile to note -- my data involves a protein named "1433E" but to > save space I drop the quote so it becomes, > > > > Gene,SNP,prot,log10p > > YWHAE,13:62129097_C_T,1433E,7.35 > > YWHAE,4:72617557_T_TA,1433E,7.73 > > > > Both read.cv() and readr::read_csv() consider prot(ein) name as > (possibly confused by scientific notation) numeric 1433 which only alerts > me when I tried to combine data, > > > > all_data <- data.frame() > > for (protein in proteins[1:7]) > > { > >cat(protein,":\n") > >f <- paste0(protein,".csv") > >if(file.exists(f)) > >{ > > p <- read.csv(f) > > print(p) > > if(nrow(p)>0) all_data <- bind_rows(all_data,p) > >} > > } > > > > proteins[1:7] > > [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" > > > > dplyr::bind_rows() failed to work due to incompatible types nevertheless > rbind() went ahead without warnings. > > > > Best wishes, > > > > > > Jing Hua > > > > __ > > R-devel@r-project.org mailing list > > > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!IKRxdwAv5BmarQ!YJzURlAK1O3rlvXvq9xl99aUaYL5iKm9gnN5RBi-WJtWa5IEtodN3vaN9pCvRTZA23dZyfrVD7X8nlYUk7S1AK893A$ > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
Gene names being misinterpreted by spreadsheet software (read.csv is no different) is a classic issue in bioinformatics. It seems like every practitioner ends up encountering this issue in due time. E.g. https://pubmed.ncbi.nlm.nih.gov/15214961/ https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7 https://www.nature.com/articles/d41586-021-02211-4 https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates On Tue, Apr 16, 2024 at 3:46 AM jing hua zhao wrote: > > Dear R-developers, > > I came to a somewhat unexpected behaviour of read.csv() which is trivial but > worthwhile to note -- my data involves a protein named "1433E" but to save > space I drop the quote so it becomes, > > Gene,SNP,prot,log10p > YWHAE,13:62129097_C_T,1433E,7.35 > YWHAE,4:72617557_T_TA,1433E,7.73 > > Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly > confused by scientific notation) numeric 1433 which only alerts me when I > tried to combine data, > > all_data <- data.frame() > for (protein in proteins[1:7]) > { >cat(protein,":\n") >f <- paste0(protein,".csv") >if(file.exists(f)) >{ > p <- read.csv(f) > print(p) > if(nrow(p)>0) all_data <- bind_rows(all_data,p) >} > } > > proteins[1:7] > [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" > > dplyr::bind_rows() failed to work due to incompatible types nevertheless > rbind() went ahead without warnings. > > Best wishes, > > > Jing Hua > > __ > R-devel@r-project.org mailing list > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!IKRxdwAv5BmarQ!YJzURlAK1O3rlvXvq9xl99aUaYL5iKm9gnN5RBi-WJtWa5IEtodN3vaN9pCvRTZA23dZyfrVD7X8nlYUk7S1AK893A$ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
Tangentially, your code will be more efficient if you add the data files to a *list* one by one and then apply bind_rows or do.call(rbind,...) after you have accumulated all of the information (see chapter 2 of the _R Inferno_). This may or may not be practically important in your particular case. Burns, Patrick. 2012. The R Inferno. Lulu.com. http://www.burns-stat.com/pages/Tutor/R_inferno.pdf. On 2024-04-16 6:46 a.m., jing hua zhao wrote: Dear R-developers, I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes, Gene,SNP,prot,log10p YWHAE,13:62129097_C_T,1433E,7.35 YWHAE,4:72617557_T_TA,1433E,7.73 Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data, all_data <- data.frame() for (protein in proteins[1:7]) { cat(protein,":\n") f <- paste0(protein,".csv") if(file.exists(f)) { p <- read.csv(f) print(p) if(nrow(p)>0) all_data <- bind_rows(all_data,p) } } proteins[1:7] [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings. Best wishes, Jing Hua __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
As an aside, the odd format does not seem to bother data.table::fread() which also happens to be my personally preferred workhorse for these tasks: > fname <- "/tmp/r/filename.csv" > read.csv(fname) Gene SNP prot log10p 1 YWHAE 13:62129097_C_T 1433 7.35 2 YWHAE 4:72617557_T_TA 1433 7.73 > data.table::fread(fname) Gene SNP prot log10p 1: YWHAE 13:62129097_C_T 1433E 7.35 2: YWHAE 4:72617557_T_TA 1433E 7.73 > readr::read_csv(fname) Rows: 2 Columns: 4 ── Column specification ── Delimiter: "," chr (2): Gene, SNP dbl (2): prot, log10p ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. # A tibble: 2 × 4 Gene SNP prot log10p 1 YWHAE 13:62129097_C_T 1433 7.35 2 YWHAE 4:72617557_T_TA 1433 7.73 > That's on Linux, everything current but dev version of data.table. Dirk -- dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
On 16/04/2024 7:36 a.m., Rui Barradas wrote: Às 11:46 de 16/04/2024, jing hua zhao escreveu: Dear R-developers, I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes, Gene,SNP,prot,log10p YWHAE,13:62129097_C_T,1433E,7.35 YWHAE,4:72617557_T_TA,1433E,7.73 Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data, all_data <- data.frame() for (protein in proteins[1:7]) { cat(protein,":\n") f <- paste0(protein,".csv") if(file.exists(f)) { p <- read.csv(f) print(p) if(nrow(p)>0) all_data <- bind_rows(all_data,p) } } proteins[1:7] [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings. Best wishes, Jing Hua __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel Hello, I wrote a file with that content and read it back with read.csv("filename.csv", as.is = TRUE) There were no problems, it all worked as expected. What platform are you on? I got the same output as Jing Hua: Input filename.csv: Gene,SNP,prot,log10p YWHAE,13:62129097_C_T,1433E,7.35 YWHAE,4:72617557_T_TA,1433E,7.73 Output: > read.csv("filename.csv") Gene SNP prot log10p 1 YWHAE 13:62129097_C_T 1433 7.35 2 YWHAE 4:72617557_T_TA 1433 7.73 Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
Hum... This boils down to > as.numeric("1.23e") [1] 1.23 > as.numeric("1.23e-") [1] 1.23 > as.numeric("1.23e+") [1] 1.23 which in turn comes from this code in src/main/util.c (function R_strtod) if (*p == 'e' || *p == 'E') { int expsign = 1; switch(*++p) { case '-': expsign = -1; case '+': p++; default: ; } for (n = 0; *p >= '0' && *p <= '9'; p++) n = (n < MAX_EXPONENT_PREFIX) ? n * 10 + (*p - '0') : n; expn += expsign * n; } which sets the exponent to zero even if the for loop terminates immediately. This might qualify as a bug, as it differs from the C function strtod which accepts "A sequence of digits, optionally containing a decimal-point character (.), optionally followed by an exponent part (an e or E character followed by an optional sign and a sequence of digits)." [Of course, there would be nothing to stop e.g. "1433E1" from being converted to numeric.] -pd > On 16 Apr 2024, at 12:46 , jing hua zhao wrote: > > Dear R-developers, > > I came to a somewhat unexpected behaviour of read.csv() which is trivial but > worthwhile to note -- my data involves a protein named "1433E" but to save > space I drop the quote so it becomes, > > Gene,SNP,prot,log10p > YWHAE,13:62129097_C_T,1433E,7.35 > YWHAE,4:72617557_T_TA,1433E,7.73 > > Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly > confused by scientific notation) numeric 1433 which only alerts me when I > tried to combine data, > > all_data <- data.frame() > for (protein in proteins[1:7]) > { > cat(protein,":\n") > f <- paste0(protein,".csv") > if(file.exists(f)) > { > p <- read.csv(f) > print(p) > if(nrow(p)>0) all_data <- bind_rows(all_data,p) > } > } > > proteins[1:7] > [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" > > dplyr::bind_rows() failed to work due to incompatible types nevertheless > rbind() went ahead without warnings. > > Best wishes, > > > Jing Hua > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
Às 11:46 de 16/04/2024, jing hua zhao escreveu: Dear R-developers, I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes, Gene,SNP,prot,log10p YWHAE,13:62129097_C_T,1433E,7.35 YWHAE,4:72617557_T_TA,1433E,7.73 Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data, all_data <- data.frame() for (protein in proteins[1:7]) { cat(protein,":\n") f <- paste0(protein,".csv") if(file.exists(f)) { p <- read.csv(f) print(p) if(nrow(p)>0) all_data <- bind_rows(all_data,p) } } proteins[1:7] [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings. Best wishes, Jing Hua __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel Hello, I wrote a file with that content and read it back with read.csv("filename.csv", as.is = TRUE) There were no problems, it all worked as expected. Hope this helps, Rui Barradas -- Este e-mail foi analisado pelo software antivírus AVG para verificar a presença de vírus. www.avg.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
On 16 April 2024 at 10:46, jing hua zhao wrote: | Dear R-developers, | | I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes, | | Gene,SNP,prot,log10p | YWHAE,13:62129097_C_T,1433E,7.35 | YWHAE,4:72617557_T_TA,1433E,7.73 | | Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data, | | all_data <- data.frame() | for (protein in proteins[1:7]) | { |cat(protein,":\n") |f <- paste0(protein,".csv") |if(file.exists(f)) |{ | p <- read.csv(f) | print(p) | if(nrow(p)>0) all_data <- bind_rows(all_data,p) |} | } | | proteins[1:7] | [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" | | dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings. You may need to reconsider aiding read.csv() (and alternate reading functions) by supplying column-type info instead of relying on educated heuristic guesses which appear to fail here due to the nature of your data. Other storage formats can store type info. That is generally safer and may be an option too. I think this was more of an email for r-help than r-devel. Dirk -- dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] read.csv
Dear R-developers, I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile to note -- my data involves a protein named "1433E" but to save space I drop the quote so it becomes, Gene,SNP,prot,log10p YWHAE,13:62129097_C_T,1433E,7.35 YWHAE,4:72617557_T_TA,1433E,7.73 Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly confused by scientific notation) numeric 1433 which only alerts me when I tried to combine data, all_data <- data.frame() for (protein in proteins[1:7]) { cat(protein,":\n") f <- paste0(protein,".csv") if(file.exists(f)) { p <- read.csv(f) print(p) if(nrow(p)>0) all_data <- bind_rows(all_data,p) } } proteins[1:7] [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z" dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind() went ahead without warnings. Best wishes, Jing Hua __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] read.csv quadratic time in number of columns
Dear R-devel, A number of people have observed anecdotally that read.csv is slow for large number of columns, for example: https://stackoverflow.com/questions/7327851/read-csv-is-extremely-slow-in-reading-csv-files-with-large-numbers-of-columns I did a systematic comparison of read.csv with similar functions, and observed that read.csv is quadratic time (N^2) in the number of columns N, whereas the others are linear (N). Can read.csv be improved to use a linear time algorithm, so it can handle CSV files with larger numbers of columns? For more details including figures and session info, please see https://github.com/tdhock/atime/issues/8 Sincerely, Toby Dylan Hocking [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv, worrying behaviour?
I believe this is documented behavior. The 'read.csv' function is a front-end to 'read.table' with different default values. IN this particular case, read.csv sets fill = TRUE, which means that it is supposed to fill incomplete lines with NA's. It also sets header=TRUE, which is presumably what it is using to determine the expected length of a line-row. -- Kevin On 2/25/2021 4:11 AM, TAYLOR, Benjamin (BLACKPOOL TEACHING HOSPITALS NHS FOUNDATION TRUST) via R-devel wrote: Dear all I've been using R for around 16 years now and I've only just become aware of a behaviour of read.csv that I find worrying which is why I'm contacting this list. A simplified example of the behaviour is as follows I created a "test.csv" file containing the following lines: a,b,c,d,e,f,g 1,2,3,4 And then read it into R using: d = read.csv("test.csv") d a b c d e f g 1 1 2 3 4 NA NA NA I was surprised that this did not issue a warning. I can understand why the following csv would not issue a warning: a,b,c,d,e,f,g 1,2,3,4,,, But the missing commas in the first example? Thoughts from others would be welcome. Kind regards Ben ~~ Benjamin M. Taylor, MSci, MSc, PhD Lead Data Scientist Blackpool Teaching Hospitals NHS Foundation Trust Home 15 Whinney Heys Road Blackpool FY3 8NR Scholar: https://scholar.google.co.uk/citations?user=6Hf0CJkJ=en Github: https://github.com/bentaylor1 Gitlab: https://gitlab.com/ben_taylor ORCID: http://orcid.org/-0001-8667-4089 This message may contain confidential information. If ...{{dropped:6}} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv, worrying behaviour?
Dear all I've been using R for around 16 years now and I've only just become aware of a behaviour of read.csv that I find worrying which is why I'm contacting this list. A simplified example of the behaviour is as follows I created a "test.csv" file containing the following lines: a,b,c,d,e,f,g 1,2,3,4 And then read it into R using: > d = read.csv("test.csv") > d a b c d e f g 1 1 2 3 4 NA NA NA I was surprised that this did not issue a warning. I can understand why the following csv would not issue a warning: a,b,c,d,e,f,g 1,2,3,4,,, But the missing commas in the first example? Thoughts from others would be welcome. Kind regards Ben ~~ Benjamin M. Taylor, MSci, MSc, PhD Lead Data Scientist Blackpool Teaching Hospitals NHS Foundation Trust Home 15 Whinney Heys Road Blackpool FY3 8NR Scholar: https://scholar.google.co.uk/citations?user=6Hf0CJkJ=en Github: https://github.com/bentaylor1 Gitlab: https://gitlab.com/ben_taylor ORCID: http://orcid.org/-0001-8667-4089 This message may contain confidential information. If yo...{{dropped:19}} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv reads more rows than indicated by wc -l
Ben, Somewhere on my wish/TO DO list is for someone to rewrite read.table for better robustness *and* efficiency ... Wish granted. New in data.table 1.8.7 : = New function fread(), a fast and friendly file reader. * header, skip, nrows, sep and colClasses are all auto detected. * integers2^31 are detected and read natively as bit64::integer64. * accepts filenames, URLs and A,B\n1,2\n3,4 directly * new implementation entirely in C * with a 50MB .csv, 1 million rows x 6 columns : read.csv(test.csv)# 30-60 sec read.table(test.csv,all known tricks and known nrows) # 10 sec fread(test.csv) # 3 sec * airline data: 658MB csv (7 million rows x 29 columns) read.table(2008.csv,all known tricks and known nrows) # 360 sec fread(2008.csv) # 50 sec See ?fread. Many thanks to Chris Neff and Garrett See for ideas, discussions and beta testing. = The help page ?fread is fairly well developed : https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markuproot=datatable Comments, feedback and bug reports very welcome. Matthew http://datatable.r-forge.r-project.org/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv reads more rows than indicated by wc -l
G See gsee000 at gmail.com writes: When I have a csv file that is more than 6 lines long, not including the header, and one of the fields is blank for the last few lines, and there is an extra comma on of the lines with the blank field, read.csv() makes creates an extra line. I attached an example file; I'll also paste the contents here: A,apple A,orange A,orange A,orange A,orange A,,, A,, - wc -l reports that this file has 7 lines R system(wc -l test.csv) 7 test.csv But, read.csv reads 8. R read.csv(test.csv, header=FALSE, stringsAsFactors=FALSE) V1 V2 1 A apple 2 A orange 3 A orange 4 A orange 5 A orange 6 A 7 8 A If I increase the number of commas at the end of the line, it increases the number of rows. This R command to read a 7 line csv: read.csv(header=FALSE, text=A,apple A,orange A,orange A,orange A,orange A, A,,) will produce this: V1 V2 1 A apple 2 A orange 3 A orange 4 A orange 5 A orange 6 A 7 8 9 A But if the file has fewer than 7 lines, it doesn't increase the number of rows. This R command to read a 6 line csv: read.csv(header=FALSE, text=A,apple A,orange A,orange A,orange A, A,,) will produce this: V1 V2 V3 V4 V5 V6 1 A apple NA NA NA NA 2 A orange NA NA NA NA 3 A orange NA NA NA NA 4 A orange NA NA NA NA 5 ANA NA NA NA 6 ANA NA NA NA Is this intended behavior? Thanks, Garrett See [snip] I don't know if it's exactly *intended* or not, but I think it's more or less as [IMPLICITLY] documented. From ?read.table, The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of ‘col.names’ if it is specified and is longer. This could conceivably be wrong if ‘fill’ or ‘blank.lines.skip’ are true, so specify ‘col.names’ if necessary (as in the ‘Examples’). txt - A,apple A,orange A,orange A,orange A,orange A, A,, read.csv(header=FALSE, text=txt ) What is happening here is that (1) read.table is determining from the first five lines that there are two columns; (2) when it gets to line six, it reads each set of two fields as a separate row If you try read.csv(header=FALSE, text=txt, fill=FALSE,blank.lines.skip=FALSE) you at least get an error. But it gets worse: txt2 - A,apple A,orange A,orange A,orange A,orange A,b,c,d,e,f A,g read.csv(header=FALSE, text=txt2, fill=FALSE,blank.lines.skip=FALSE) produces bad results even though fill=FALSE and blank.lines.skip=FALSE ... Even specifying col.names explicitly doesn't help: read.csv(header=FALSE, text=txt2, col.names=paste0(V,1:2)) At least count.fields() does detect a problem ... count.fields(textConnection(txt2),sep=,) Somewhere on my wish/TO DO list is for someone to rewrite read.table for better robustness *and* efficiency ... __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv behaviour
Mehmet Suzen msuzen at mango-solutions.com writes: This might be obvious but I was wondering if anyone knows quick and easy way of writing out a CSV file with varying row lengths, ideally an initial data read from a CSV file which has the same format. See example below. writeLines(c(A,B,C,D, 1,a,b,c, 2,f,g,c, 3,a,i,j, 4,a,b,c, 5,d,e,f, 6,g,h,i,j,k,l,m,n), con=file(test.csv)) X - read.csv(test.csv) It's not that pretty, but something like tmpf - function(x) paste(x[nzchar(x)],collapse=,) writeLines(apply(as.matrix(X),1,tmpf),con=outfile.csv) might work __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] read.csv behaviour
This might be obvious but I was wondering if anyone knows quick and easy way of writing out a CSV file with varying row lengths, ideally an initial data read from a CSV file which has the same format. See example below. I found it quite strange that R cannot write it in one go, so one must append blocks or post-process the file, is this true? (even Ruby can do it!!) Otherwise it puts ,, or similar for missing column values in the shorter length rows and fill=FALSE option do not work! I don't want to post-process if possible. See this post: http://r.789695.n4.nabble.com/Re-read-csv-trap-td3301924.html Example that generated Error! writeLines(c(A,B,C,D, 1,a,b,c, 2,f,g,c, 3,a,i,j, 4,a,b,c, 5,d,e,f, 6,g,h,i,j,k,l,m,n), con=file(test.csv)) read.csv(test.csv) try(read.csv(test.csv,fill=FALSE)) LEGAL NOTICE This message is intended for the use o...{{dropped:10}} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv and FileEncoding in Windows version of R 2.13.0
Hello Duncan, thank you very much for your reply. The file is attached. Again, the issue is that opening this UTF-8 encoded file under R 2.13.0 yields an error, but opening it under R 2.12.2 works without any issues. The command I used to open the file is: read.csv(test.csv,fileEncoding=UTF-8,header=FALSE) (As you'll see, the file does have a byte order mark.) Regards, Alex -Original Message- From: Duncan Murdoch [mailto:murdoch.dun...@gmail.com] Sent: Wednesday, June 01, 2011 7:35 PM To: Alexander Peterhansl Cc: R-devel@r-project.org Subject: Re: [Rd] read.csv and FileEncoding in Windows version of R 2.13.0 On 01/06/2011 6:00 PM, Alexander Peterhansl wrote: Dear R-devel List: read.csv() seems to have changed in R version 2.13.0 as compared to version 2.12.2 when reading in simple CSV files. Suppose I read in a 2-column CSV file (test.csv), say 1, a 2, b If file is encoded as UTF-8 (on Windows 7), then under R 2.13.0 That file could be pure ASCII, or could include a byte order mark. I tried both, and I didn't get the error your saw. So I think I need to see the file to diagnose this. Could you put it in a .zip file and email it to me? Duncan Murdoch read.csv(test.csv,fileEncoding=UTF-8,header=FALSE) yields the following output V1 1 ? Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : invalid input found on input connection 'test.csv' 2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'test.csv' Under R 2.12.2 it runs problem-free and yields the expected: V1 V2 1 1 a 2 2 b Please help. Regards, Alex [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] read.csv and FileEncoding in Windows version of R 2.13.0
Dear R-devel List: read.csv() seems to have changed in R version 2.13.0 as compared to version 2.12.2 when reading in simple CSV files. Suppose I read in a 2-column CSV file (test.csv), say 1, a 2, b If file is encoded as UTF-8 (on Windows 7), then under R 2.13.0 read.csv(test.csv,fileEncoding=UTF-8,header=FALSE) yields the following output V1 1 ? Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : invalid input found on input connection 'test.csv' 2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'test.csv' Under R 2.12.2 it runs problem-free and yields the expected: V1 V2 1 1 a 2 2 b Please help. Regards, Alex [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv and FileEncoding in Windows version of R 2.13.0
On 01/06/2011 6:00 PM, Alexander Peterhansl wrote: Dear R-devel List: read.csv() seems to have changed in R version 2.13.0 as compared to version 2.12.2 when reading in simple CSV files. Suppose I read in a 2-column CSV file (test.csv), say 1, a 2, b If file is encoded as UTF-8 (on Windows 7), then under R 2.13.0 That file could be pure ASCII, or could include a byte order mark. I tried both, and I didn't get the error your saw. So I think I need to see the file to diagnose this. Could you put it in a .zip file and email it to me? Duncan Murdoch read.csv(test.csv,fileEncoding=UTF-8,header=FALSE) yields the following output V1 1 ? Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : invalid input found on input connection 'test.csv' 2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'test.csv' Under R 2.12.2 it runs problem-free and yields the expected: V1 V2 1 1 a 2 2 b Please help. Regards, Alex [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv trap
Ben Bolker bbolker at gmail.com writes: On 02/11/2011 03:37 PM, Laurent Gatto wrote: On 11 February 2011 19:39, Ben Bolker bbolker at gmail.com wrote: [snip] Bump. Is there any opinion about this from R-core?? Will I be scolded if I submit this as a bug ... ?? What is dangerous/confusing is that R silently **wraps** longer lines if fill=TRUE (which is the default for read.csv). I encountered this when working with a colleague on a long, messy CSV file that had some phantom extra fields in some rows, which then turned into empty lines in the data frame. [snip snip] Here is an example and a workaround that runs count.fields on the whole file to find the maximum column length and set col.names accordingly. (It assumes you don't already have a file named test.csv in your working directory ...) I haven't dug in to try to write a patch for this -- I wanted to test the waters and see what people thought first, and I realize that read.table() is a very complicated piece of code that embodies a lot of tradeoffs, so there could be lots of different approaches to trying to mitigate this problem. I appreciate very much how hard it is to write a robust and general function to read data files, but I also think it's really important to minimize the number of traps in read.table(), which will often be the first part of R that new users encounter ... A quick fix for this might be to allow the number of lines analyzed for length to be settable by the user, or to allow a settable 'maxcols' parameter, although those would only help in the case where the user already knows there is a problem. cheers Ben Bolker === writeLines(c(A,B,C,D, 1,a,b,c, 2,f,g,c, 3,a,i,j, 4,a,b,c, 5,d,e,f, 6,g,h,i,j,k,l,m,n), con=file(test.csv)) read.csv(test.csv) try(read.csv(test.csv,fill=FALSE)) ## assumes header=TRUE, fill=TRUE; should be a little more careful ## with comment, quote arguments (possibly explicit) ## ... contains information about quote, comment.char, sep Read.csv - function(fn,sep=,,...) { colnames - scan(fn,nlines=1,what=character,sep=sep,...) ncolnames - length(colnames) maxcols - max(count.fields(fn,sep=sep,...)) if (maxcolsncolnames) { colnames - c(colnames,paste(V,(ncolnames+1):maxcols,sep=)) } ## assumes you don't have any other columns labeled V[large number] read.csv(fn,...,col.names=colnames) } Read.csv(test.csv) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv trap
Bump. It's been a week since I posted this to r-devel. Any thoughts/discussion? Would R-core be irritated if I submitted a bug report? cheers Ben Original Message Subject: read.csv trap Date: Fri, 04 Feb 2011 11:16:36 -0500 From: Ben Bolker bbol...@gmail.com To: r-de...@stat.math.ethz.ch r-de...@stat.math.ethz.ch, David Earn e...@math.mcmaster.ca This is not specifically a bug, but an (implicitly/obscurely) documented behavior of read.csv (or read.table with fill=TRUE) that can be quite dangerous/confusing for users. I would love to hear some discussion from other users and/or R-core about this ... As always, I apologize if I have missed some obvious workaround or reason that this is actually the desired behavior ... In a nutshell, when fill=TRUE R guesses the number of columns from the first 5 rows of the data set. That's fine, and ?read.table documents this: The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of ‘col.names’ if it is specified and is longer. This could conceivably be wrong if ‘fill’ or ‘blank.lines.skip’ are true, so specify ‘col.names’ if necessary. What is dangerous/confusing is that R silently **wraps** longer lines if fill=TRUE (which is the default for read.csv). I encountered this when working with a colleague on a long, messy CSV file that had some phantom extra fields in some rows, which then turned into empty lines in the data frame. Here is an example and a workaround that runs count.fields on the whole file to find the maximum column length and set col.names accordingly. (It assumes you don't already have a file named test.csv in your working directory ...) I haven't dug in to try to write a patch for this -- I wanted to test the waters and see what people thought first, and I realize that read.table() is a very complicated piece of code that embodies a lot of tradeoffs, so there could be lots of different approaches to trying to mitigate this problem. I appreciate very much how hard it is to write a robust and general function to read data files, but I also think it's really important to minimize the number of traps in read.table(), which will often be the first part of R that new users encounter ... A quick fix for this might be to allow the number of lines analyzed for length to be settable by the user, or to allow a settable 'maxcols' parameter, although those would only help in the case where the user already knows there is a problem. cheers Ben Bolker === writeLines(c(A,B,C,D, 1,a,b,c, 2,f,g,c, 3,a,i,j, 4,a,b,c, 5,d,e,f, 6,g,h,i,j,k,l,m,n), con=file(test.csv)) read.csv(test.csv) try(read.csv(test.csv,fill=FALSE)) ## assumes header=TRUE, fill=TRUE; should be a little more careful ## with comment, quote arguments (possibly explicit) ## ... contains information about quote, comment.char, sep Read.csv - function(fn,sep=,,...) { colnames - scan(fn,nlines=1,what=character,sep=sep,...) ncolnames - length(colnames) maxcols - max(count.fields(fn,sep=sep,...)) if (maxcolsncolnames) { colnames - c(colnames,paste(V,(ncolnames+1):maxcols,sep=)) } ## assumes you don't have any other columns labeled V[large number] read.csv(fn,...,col.names=colnames) } Read.csv(test.csv) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv trap
On 2/11/11 1:39 PM, Ben Bolker bbol...@gmail.com wrote: [snip] Original Message Subject: read.csv trap Date: Fri, 04 Feb 2011 11:16:36 -0500 From: Ben Bolker bbol...@gmail.com To: r-de...@stat.math.ethz.ch r-de...@stat.math.ethz.ch, David Earn e...@math.mcmaster.ca [snip] What is dangerous/confusing is that R silently **wraps** longer lines if fill=TRUE (which is the default for read.csv). [snip] Based on your description, I would be very irritated if I encountered the behavior you describe. I would consider it a bug, though my opinion doesn't necessarily count for much. -- Ken Williams Senior Research Scientist Thomson Reuters Phone: 651-848-7712 ken.willi...@thomsonreuters.com http://labs.thomsonreuters.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv trap
On 11 February 2011 19:39, Ben Bolker bbol...@gmail.com wrote: [snip] What is dangerous/confusing is that R silently **wraps** longer lines if fill=TRUE (which is the default for read.csv). I encountered this when working with a colleague on a long, messy CSV file that had some phantom extra fields in some rows, which then turned into empty lines in the data frame. As a matter of fact, this is exactly what happened to a colleague of mine yesterday and caused her quite a bit of trouble. On the other hand, it could also be considered as a 'bug' in the csv file. Although no formal specification exist for the csv format, RFC 4180 [1] indicates that 'each line should contain the same number of fields throughout the file'. [1] http://tools.ietf.org/html/rfc4180 Best wishes, Laurent Here is an example and a workaround that runs count.fields on the whole file to find the maximum column length and set col.names accordingly. (It assumes you don't already have a file named test.csv in your working directory ...) I haven't dug in to try to write a patch for this -- I wanted to test the waters and see what people thought first, and I realize that read.table() is a very complicated piece of code that embodies a lot of tradeoffs, so there could be lots of different approaches to trying to mitigate this problem. I appreciate very much how hard it is to write a robust and general function to read data files, but I also think it's really important to minimize the number of traps in read.table(), which will often be the first part of R that new users encounter ... A quick fix for this might be to allow the number of lines analyzed for length to be settable by the user, or to allow a settable 'maxcols' parameter, although those would only help in the case where the user already knows there is a problem. cheers Ben Bolker === writeLines(c(A,B,C,D, 1,a,b,c, 2,f,g,c, 3,a,i,j, 4,a,b,c, 5,d,e,f, 6,g,h,i,j,k,l,m,n), con=file(test.csv)) read.csv(test.csv) try(read.csv(test.csv,fill=FALSE)) ## assumes header=TRUE, fill=TRUE; should be a little more careful ## with comment, quote arguments (possibly explicit) ## ... contains information about quote, comment.char, sep Read.csv - function(fn,sep=,,...) { colnames - scan(fn,nlines=1,what=character,sep=sep,...) ncolnames - length(colnames) maxcols - max(count.fields(fn,sep=sep,...)) if (maxcolsncolnames) { colnames - c(colnames,paste(V,(ncolnames+1):maxcols,sep=)) } ## assumes you don't have any other columns labeled V[large number] read.csv(fn,...,col.names=colnames) } Read.csv(test.csv) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- [ Laurent Gatto | slashhome.be ] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv trap
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/11/2011 03:37 PM, Laurent Gatto wrote: On 11 February 2011 19:39, Ben Bolker bbol...@gmail.com wrote: [snip] What is dangerous/confusing is that R silently **wraps** longer lines if fill=TRUE (which is the default for read.csv). I encountered this when working with a colleague on a long, messy CSV file that had some phantom extra fields in some rows, which then turned into empty lines in the data frame. As a matter of fact, this is exactly what happened to a colleague of mine yesterday and caused her quite a bit of trouble. On the other hand, it could also be considered as a 'bug' in the csv file. Although no formal specification exist for the csv format, RFC 4180 [1] indicates that 'each line should contain the same number of fields throughout the file'. [1] http://tools.ietf.org/html/rfc4180 Best wishes, Laurent Asserting that the bug is in the CSV file is logically consistent, but if this is true then the fill=TRUE argument (which is only needed when the lines contain different numbers of fields) should not be allowed. I had never seen RFC4180 before -- interesting! I note especially points 5-7 which define the handling of double quotation marks (but says nothing about single quotes or using backslashes as escape characters). Dealing with read.[table|csv] seems a bit of an Augean task http://en.wikipedia.org/wiki/Augeas (hmmm, maybe I should write a parallel document to Burns's _Inferno_ ...) cheers Ben Here is an example and a workaround that runs count.fields on the whole file to find the maximum column length and set col.names accordingly. (It assumes you don't already have a file named test.csv in your working directory ...) I haven't dug in to try to write a patch for this -- I wanted to test the waters and see what people thought first, and I realize that read.table() is a very complicated piece of code that embodies a lot of tradeoffs, so there could be lots of different approaches to trying to mitigate this problem. I appreciate very much how hard it is to write a robust and general function to read data files, but I also think it's really important to minimize the number of traps in read.table(), which will often be the first part of R that new users encounter ... A quick fix for this might be to allow the number of lines analyzed for length to be settable by the user, or to allow a settable 'maxcols' parameter, although those would only help in the case where the user already knows there is a problem. cheers Ben Bolker === writeLines(c(A,B,C,D, 1,a,b,c, 2,f,g,c, 3,a,i,j, 4,a,b,c, 5,d,e,f, 6,g,h,i,j,k,l,m,n), con=file(test.csv)) read.csv(test.csv) try(read.csv(test.csv,fill=FALSE)) ## assumes header=TRUE, fill=TRUE; should be a little more careful ## with comment, quote arguments (possibly explicit) ## ... contains information about quote, comment.char, sep Read.csv - function(fn,sep=,,...) { colnames - scan(fn,nlines=1,what=character,sep=sep,...) ncolnames - length(colnames) maxcols - max(count.fields(fn,sep=sep,...)) if (maxcolsncolnames) { colnames - c(colnames,paste(V,(ncolnames+1):maxcols,sep=)) } ## assumes you don't have any other columns labeled V[large number] read.csv(fn,...,col.names=colnames) } Read.csv(test.csv) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk1VsX4ACgkQc5UpGjwzenPwsgCfTtGo0kJSXhUTPcY+p7cgaiuq zHAAnikRORUhqLP9O+6M5SwyZcFEW9uT =Rb2R -END PGP SIGNATURE- __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] read.csv trap
This is not specifically a bug, but an (implicitly/obscurely) documented behavior of read.csv (or read.table with fill=TRUE) that can be quite dangerous/confusing for users. I would love to hear some discussion from other users and/or R-core about this ... As always, I apologize if I have missed some obvious workaround or reason that this is actually the desired behavior ... In a nutshell, when fill=TRUE R guesses the number of columns from the first 5 rows of the data set. That's fine, and ?read.table documents this: The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of ‘col.names’ if it is specified and is longer. This could conceivably be wrong if ‘fill’ or ‘blank.lines.skip’ are true, so specify ‘col.names’ if necessary. What is dangerous/confusing is that R silently **wraps** longer lines if fill=TRUE (which is the default for read.csv). I encountered this when working with a colleague on a long, messy CSV file that had some phantom extra fields in some rows, which then turned into empty lines in the data frame. Here is an example and a workaround that runs count.fields on the whole file to find the maximum column length and set col.names accordingly. (It assumes you don't already have a file named test.csv in your working directory ...) I haven't dug in to try to write a patch for this -- I wanted to test the waters and see what people thought first, and I realize that read.table() is a very complicated piece of code that embodies a lot of tradeoffs, so there could be lots of different approaches to trying to mitigate this problem. I appreciate very much how hard it is to write a robust and general function to read data files, but I also think it's really important to minimize the number of traps in read.table(), which will often be the first part of R that new users encounter ... A quick fix for this might be to allow the number of lines analyzed for length to be settable by the user, or to allow a settable 'maxcols' parameter, although those would only help in the case where the user already knows there is a problem. cheers Ben Bolker === writeLines(c(A,B,C,D, 1,a,b,c, 2,f,g,c, 3,a,i,j, 4,a,b,c, 5,d,e,f, 6,g,h,i,j,k,l,m,n), con=file(test.csv)) read.csv(test.csv) try(read.csv(test.csv,fill=FALSE)) ## assumes header=TRUE, fill=TRUE; should be a little more careful ## with comment, quote arguments (possibly explicit) ## ... contains information about quote, comment.char, sep Read.csv - function(fn,sep=,,...) { colnames - scan(fn,nlines=1,what=character,sep=sep,...) ncolnames - length(colnames) maxcols - max(count.fields(fn,sep=sep,...)) if (maxcolsncolnames) { colnames - c(colnames,paste(V,(ncolnames+1):maxcols,sep=)) } ## assumes you don't have any other columns labeled V[large number] read.csv(fn,...,col.names=colnames) } Read.csv(test.csv) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] read.csv('/dev/stdin') fails (PR#14218)
Full_Name: Eric Goldlust Version: 2.10.1 (2009-12-14) x86_64-unknown-linux-gnu OS: Linux 2.6.9-67.0.1.ELsmp x86_64 Submission from: (NULL) (64.22.160.1) After upgrading to from 2.9.1 to 2.10.1, I get unexpected results when calling read.csv('/dev/stdin'). These problems go away when I call read.csv(pipe('cat /dev/stdin')). Shell session follows (bash): ~$ echo -e a,b,c\n1,2,3 | Rscript (echo read.csv('/dev/stdin')) Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input Calls: read.csv - read.table Execution halted ~$ echo -e a,b,c\n1,2,3 | Rscript (echo read.csv(pipe('cat /dev/stdin'))) a b c 1 1 2 3 Note that this code worked fine for me in 2.9.1. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] read.csv confused by newline characters in header (PR#14103)
Full_Name: George Russell Version: 2.10.0 OS: Microsoft Windows XP Service Pack 2 Submission from: (NULL) (217.111.3.131) The following code (typed into R --vanilla) testString - 'B1\nB2\n1\n' con - textConnection(testString) tab - read.csv(con,stringsAsFactors = FALSE) produces a data frame with with one row and one column; the name of the column is B1.B2 (alright so far). However according to print(tab[[1,1]]) the value of the entry in the first row and first column is B2\n1\n So B2 has somehow got into both the names of the data frame and its entry. Either R is confused or I am. What is going on? __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv confused by newline characters in header (PR#14103)
g.russ...@eos-solutions.com wrote: Full_Name: George Russell Version: 2.10.0 OS: Microsoft Windows XP Service Pack 2 Submission from: (NULL) (217.111.3.131) The following code (typed into R --vanilla) testString - 'B1\nB2\n1\n' con - textConnection(testString) tab - read.csv(con,stringsAsFactors = FALSE) produces a data frame with with one row and one column; the name of the column is B1.B2 (alright so far). However according to print(tab[[1,1]]) the value of the entry in the first row and first column is B2\n1\n So B2 has somehow got into both the names of the data frame and its entry. Either R is confused or I am. What is going on? Presumably, read.table is not obeying quotes when removing what it thinks is the header line. Another variation is this: tab - read.table(stdin(), head=T) 0: B1 0: B2 1: 1 2: tab B1.B2 1 B2 2 1 It's somehow connected to the pushBack(c(lines, lines), file) bits in readtable.R, but I don't quite get it. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
On Sun, Jun 14, 2009 at 02:56:01PM -0400, Gabor Grothendieck wrote: If read.csv's colClasses= argument is NOT used then read.csv accepts double quoted numerics: 1: read.csv(stdin()) 0: A,B 1: 1,1 2: 2,2 3: A B 1 1 1 2 2 2 However, if colClasses is used then it seems that it does not: read.csv(stdin(), colClasses = numeric) 0: A,B 1: 1,1 2: 2,2 3: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '1' Is this really intended? I would have expected that a csv file in which each field is surrounded with double quotes is acceptable in both cases. This may be documented as is yet seems undesirable from both a consistency viewpoint and the viewpoint that it should be possible to double quote fields in a csv file. The problem is not specific to read.csv(). The same difference appears for read.table(). read.table(stdin()) 1 1 2 2 # V1 V2 # 1 1 1 # 2 2 2 but read.table(stdin(), colClasses = numeric) 1 1 2 2 Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '1' The error occurs in the call of scan() at line 152 in src/library/utils/R/readtable.R, which is data - scan(file = file, what = what, sep = sep, quote = quote, ... (This is the third call of scan() in the source code of read.table()) In this call, scan() gets the types of columns in what argument. If the type is specified, scan() performs the conversion itself and fails, if a numeric field is quoted. If the type is not specified, the output of scan() is of type character, but with quotes eliminated, if there are some in the input file. Columns with unknown type are then converted using type.convert(), which receives the data already without quotes. The call of type.convert() is contained in a cycle for (i in (1L:cols)[do]) { data[[i]] - if (is.na(colClasses[i])) type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) ## as na.strings have already been converted to NA else if (colClasses[i] == factor) as.factor(data[[i]]) else if (colClasses[i] == Date) as.Date(data[[i]]) else if (colClasses[i] == POSIXct) as.POSIXct(data[[i]]) else methods::as(data[[i]], colClasses[i]) } which contains also lines, which could perform conversion for columns with a specified type, but these lines are not used, since the vector do is defined as do - keep !known where known determines for which columns the type is known. It is possible to modify the code so that scan() is called with all types unspecified and leave the conversion to the lines else if (colClasses[i] == factor) as.factor(data[[i]]) else if (colClasses[i] == Date) as.Date(data[[i]]) else if (colClasses[i] == POSIXct) as.POSIXct(data[[i]]) else methods::as(data[[i]], colClasses[i]) above. Since this solution is already prepared in the code, the patch is very simple --- R-devel/src/library/utils/R/readtable.R 2009-05-18 17:53:08.0 +0200 +++ R-devel-readtable/src/library/utils/R/readtable.R 2009-06-25 10:20:06.0 +0200 @@ -143,9 +143,6 @@ names(what) - col.names colClasses[colClasses %in% c(real, double)] - numeric -known - colClasses %in% -c(logical, integer, numeric, complex, character) -what[known] - sapply(colClasses[known], do.call, list(0)) what[colClasses %in% NULL] - list(NULL) keep - !sapply(what, is.null) @@ -189,7 +186,7 @@ stop(gettextf('as.is' has the wrong length %d != cols = %d, length(as.is), cols), domain = NA) -do - keep !known # !as.is +do - keep !as.is if(rlabp) do[1L] - FALSE # don't convert row.names for (i in (1L:cols)[do]) { data[[i]] - (Also in attachment) I did a test as follows d1 - read.table(stdin()) 1 TRUE 3.5 2 NA 0.1 NA FALSE 0.1 3 TRUE NA sapply(d1, typeof) #V1V2V3 # integer logical double is.na(d1) # V1V2V3 # [1,] FALSE FALSE FALSE # [2,] FALSE TRUE FALSE # [3,] TRUE FALSE FALSE # [4,] FALSE FALSE TRUE d2 - read.table(stdin(), colClasses=c(integer, logical, double)) 1 TRUE 3.5 2 NA 0.1 NA FALSE 0.1 3 TRUE NA sapply(d2, typeof) #V1V2V3 # integer logical double is.na(d2) # V1V2V3 # [1,] FALSE FALSE FALSE # [2,] FALSE TRUE FALSE # [3,] TRUE FALSE FALSE # [4,] FALSE FALSE TRUE I think, there was a reason to let scan() to perform the type conversion, for example, it may be more efficient. So, if correct, the above patch is a possible solution, but some other may be more appropriate. In particular, function scan() may be modified to remove
Re: [Rd] read.csv
I am sorry for not including the attachment mentioned in my previous email. Attached now. Petr. --- R-devel/src/library/utils/R/readtable.R 2009-05-18 17:53:08.0 +0200 +++ R-devel-readtable/src/library/utils/R/readtable.R 2009-06-25 10:20:06.0 +0200 @@ -143,9 +143,6 @@ names(what) - col.names colClasses[colClasses %in% c(real, double)] - numeric -known - colClasses %in% -c(logical, integer, numeric, complex, character) -what[known] - sapply(colClasses[known], do.call, list(0)) what[colClasses %in% NULL] - list(NULL) keep - !sapply(what, is.null) @@ -189,7 +186,7 @@ stop(gettextf('as.is' has the wrong length %d != cols = %d, length(as.is), cols), domain = NA) -do - keep !known # !as.is +do - keep !as.is if(rlabp) do[1L] - FALSE # don't convert row.names for (i in (1L:cols)[do]) { data[[i]] - __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
On Sun, Jun 14, 2009 at 09:21:24PM +0100, Ted Harding wrote: On 14-Jun-09 18:56:01, Gabor Grothendieck wrote: If read.csv's colClasses= argument is NOT used then read.csv accepts double quoted numerics: 1: read.csv(stdin()) 0: A,B 1: 1,1 2: 2,2 3: A B 1 1 1 2 2 2 However, if colClasses is used then it seems that it does not: read.csv(stdin(), colClasses = numeric) 0: A,B 1: 1,1 2: 2,2 3: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '1' Is this really intended? I would have expected that a csv file in which each field is surrounded with double quotes is acceptable in both cases. This may be documented as is yet seems undesirable from both a consistency viewpoint and the viewpoint that it should be possible to double quote fields in a csv file. Well, the default for colClasses is NA, for which ?read.csv says: [...] Possible values are 'NA' (when 'type.convert' is used), [...] and then ?type.convert says: This is principally a helper function for 'read.table'. Given a character vector, it attempts to convert it to logical, integer, numeric or complex, and failing that converts it to factor unless 'as.is = TRUE'. The first type that can accept all the non-missing values is chosen. It would seem that type 'logical' won't accept integer (naively one might expect 1 -- TRUE, but see experiment below), so the first acceptable type for 1 is integer, and that is what happens. So it is indeed documented (in the R[ecursive] sense of documented :)) However, presumably when colClasses is used then type.convert() is not called, in which case R sees itself being asked to assign a character entity to a destination which it has been told shall be integer, and therefore, since the default for as.is is as.is = !stringsAsFactors but for this ?read.csv says that stringsAsFactors is overridden bu [sic] 'as.is' and 'colClasses', both of which allow finer control., so that wouldn't come to the rescue either. Experiment: X -logical(10) class(X) # [1] logical X[1]-1 X # [1] 1 0 0 0 0 0 0 0 0 0 class(X) # [1] numeric so R has converted X from class 'logical' to class 'numeric' on being asked to assign a number to a logical; but in this case its hands were not tied by colClasses. Or am I missing something?!! In my opinion, you explain, how it happens that there is a difference in the behavior between read.csv(stdin(), colClasses = numeric) and read.csv(stdin()) but not, why it is so. The algorithm use the smallest type, which accepts all non-missing values may well be applied to the input values either literally or after removing the quotes. Is there a reason, why read.csv(stdin(), colClasses = numeric) removes quotes from the input values and read.csv(stdin()) does not? Using double-quote characters is a part of the definition of CSV file, see, for example http://en.wikipedia.org/wiki/Comma_separated_values where one may find Fields may always be enclosed within double-quote characters, whether necessary or not. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] read.csv
If read.csv's colClasses= argument is NOT used then read.csv accepts double quoted numerics: 1: read.csv(stdin()) 0: A,B 1: 1,1 2: 2,2 3: A B 1 1 1 2 2 2 However, if colClasses is used then it seems that it does not: read.csv(stdin(), colClasses = numeric) 0: A,B 1: 1,1 2: 2,2 3: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '1' Is this really intended? I would have expected that a csv file in which each field is surrounded with double quotes is acceptable in both cases. This may be documented as is yet seems undesirable from both a consistency viewpoint and the viewpoint that it should be possible to double quote fields in a csv file. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
On 14-Jun-09 18:56:01, Gabor Grothendieck wrote: If read.csv's colClasses= argument is NOT used then read.csv accepts double quoted numerics: 1: read.csv(stdin()) 0: A,B 1: 1,1 2: 2,2 3: A B 1 1 1 2 2 2 However, if colClasses is used then it seems that it does not: read.csv(stdin(), colClasses = numeric) 0: A,B 1: 1,1 2: 2,2 3: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '1' Is this really intended? I would have expected that a csv file in which each field is surrounded with double quotes is acceptable in both cases. This may be documented as is yet seems undesirable from both a consistency viewpoint and the viewpoint that it should be possible to double quote fields in a csv file. Well, the default for colClasses is NA, for which ?read.csv says: [...] Possible values are 'NA' (when 'type.convert' is used), [...] and then ?type.convert says: This is principally a helper function for 'read.table'. Given a character vector, it attempts to convert it to logical, integer, numeric or complex, and failing that converts it to factor unless 'as.is = TRUE'. The first type that can accept all the non-missing values is chosen. It would seem that type 'logical' won't accept integer (naively one might expect 1 -- TRUE, but see experiment below), so the first acceptable type for 1 is integer, and that is what happens. So it is indeed documented (in the R[ecursive] sense of documented :)) However, presumably when colClasses is used then type.convert() is not called, in which case R sees itself being asked to assign a character entity to a destination which it has been told shall be integer, and therefore, since the default for as.is is as.is = !stringsAsFactors but for this ?read.csv says that stringsAsFactors is overridden bu [sic] 'as.is' and 'colClasses', both of which allow finer control., so that wouldn't come to the rescue either. Experiment: X -logical(10) class(X) # [1] logical X[1]-1 X # [1] 1 0 0 0 0 0 0 0 0 0 class(X) # [1] numeric so R has converted X from class 'logical' to class 'numeric' on being asked to assign a number to a logical; but in this case its hands were not tied by colClasses. Or am I missing something?!! Ted. E-Mail: (Ted Harding) ted.hard...@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 14-Jun-09 Time: 21:21:22 -- XFMail -- __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
On Sun, Jun 14, 2009 at 4:21 PM, Ted Hardingted.hard...@manchester.ac.uk wrote: Or am I missing something?!! The point of this is that the current behavior is not desirable since you can't have quoted numeric fields if you specify colClasses = numeric yet you can if you don't. The concepts are not orthogonal but should be. If you specify or not specify colClasses the numeric fields ought to be treated the same way and if the documentation says otherwise it further means there is a problem with the design. One could define their own type quotedNumeric as a workaround (see below) but I think it would be better if specifying numeric or not specifying numeric had the same effect. The way it is now the concepts are intertwined and not orthogonal. library(methods) setClass(quotedNumeric) setAs(character, quotedNumeric, function(from) as.numeric(gsub(\, , from))) Lines - 'A,B 1,1 2,2' read.csv(textConnection(Lines), colClasses = c(quotedNumeric, numeric)) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel