Re: [Rd] read.csv

2024-04-27 Thread Kevin Coombes
I was horrified when I saw John Weinstein's article about Excel turning
gene names into dates. Mainly because I had been complaining about that
phenomenon for years, and it never remotely occurred to me that you could
get a publication out of it.

I eventually rectified the situation by publishing "Blasted Cell Line
Names", describing how to match different researchers' recording of the
names of cell lines, by applying techniques for DNA or protein sequence
alignment.

Best,
   Kevin

On Tue, Apr 16, 2024, 4:51 PM Reed A. Cartwright 
wrote:

> Gene names being misinterpreted by spreadsheet software (read.csv is
> no different) is a classic issue in bioinformatics. It seems like
> every practitioner ends up encountering this issue in due time. E.g.
>
> https://pubmed.ncbi.nlm.nih.gov/15214961/
>
> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7
>
> https://www.nature.com/articles/d41586-021-02211-4
>
>
> https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates
>
>
> On Tue, Apr 16, 2024 at 3:46 AM jing hua zhao 
> wrote:
> >
> > Dear R-developers,
> >
> > I came to a somewhat unexpected behaviour of read.csv() which is trivial
> but worthwhile to note -- my data involves a protein named "1433E" but to
> save space I drop the quote so it becomes,
> >
> > Gene,SNP,prot,log10p
> > YWHAE,13:62129097_C_T,1433E,7.35
> > YWHAE,4:72617557_T_TA,1433E,7.73
> >
> > Both read.cv() and readr::read_csv() consider prot(ein) name as
> (possibly confused by scientific notation) numeric 1433 which only alerts
> me when I tried to combine data,
> >
> > all_data <- data.frame()
> > for (protein in proteins[1:7])
> > {
> >cat(protein,":\n")
> >f <- paste0(protein,".csv")
> >if(file.exists(f))
> >{
> >  p <- read.csv(f)
> >  print(p)
> >  if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
> >}
> > }
> >
> > proteins[1:7]
> > [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"
> >
> > dplyr::bind_rows() failed to work due to incompatible types nevertheless
> rbind() went ahead without warnings.
> >
> > Best wishes,
> >
> >
> > Jing Hua
> >
> > __
> > R-devel@r-project.org mailing list
> >
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!IKRxdwAv5BmarQ!YJzURlAK1O3rlvXvq9xl99aUaYL5iKm9gnN5RBi-WJtWa5IEtodN3vaN9pCvRTZA23dZyfrVD7X8nlYUk7S1AK893A$
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Reed A. Cartwright
Gene names being misinterpreted by spreadsheet software (read.csv is
no different) is a classic issue in bioinformatics. It seems like
every practitioner ends up encountering this issue in due time. E.g.

https://pubmed.ncbi.nlm.nih.gov/15214961/

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

https://www.nature.com/articles/d41586-021-02211-4

https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates


On Tue, Apr 16, 2024 at 3:46 AM jing hua zhao  wrote:
>
> Dear R-developers,
>
> I came to a somewhat unexpected behaviour of read.csv() which is trivial but 
> worthwhile to note -- my data involves a protein named "1433E" but to save 
> space I drop the quote so it becomes,
>
> Gene,SNP,prot,log10p
> YWHAE,13:62129097_C_T,1433E,7.35
> YWHAE,4:72617557_T_TA,1433E,7.73
>
> Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
> confused by scientific notation) numeric 1433 which only alerts me when I 
> tried to combine data,
>
> all_data <- data.frame()
> for (protein in proteins[1:7])
> {
>cat(protein,":\n")
>f <- paste0(protein,".csv")
>if(file.exists(f))
>{
>  p <- read.csv(f)
>  print(p)
>  if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
>}
> }
>
> proteins[1:7]
> [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"
>
> dplyr::bind_rows() failed to work due to incompatible types nevertheless 
> rbind() went ahead without warnings.
>
> Best wishes,
>
>
> Jing Hua
>
> __
> R-devel@r-project.org mailing list
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!IKRxdwAv5BmarQ!YJzURlAK1O3rlvXvq9xl99aUaYL5iKm9gnN5RBi-WJtWa5IEtodN3vaN9pCvRTZA23dZyfrVD7X8nlYUk7S1AK893A$

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Ben Bolker
  Tangentially, your code will be more efficient if you add the data 
files to a *list* one by one and then apply bind_rows or 
do.call(rbind,...) after you have accumulated all of the information 
(see chapter 2 of the _R Inferno_). This may or may not be practically 
important in your particular case.


Burns, Patrick. 2012. The R Inferno. Lulu.com. 
http://www.burns-stat.com/pages/Tutor/R_inferno.pdf.



On 2024-04-16 6:46 a.m., jing hua zhao wrote:

Dear R-developers,

I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile 
to note -- my data involves a protein named "1433E" but to save space I drop 
the quote so it becomes,

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
confused by scientific notation) numeric 1433 which only alerts me when I tried 
to combine data,

all_data <- data.frame()
for (protein in proteins[1:7])
{
cat(protein,":\n")
f <- paste0(protein,".csv")
if(file.exists(f))
{
  p <- read.csv(f)
  print(p)
  if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
}
}

proteins[1:7]
[1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"

dplyr::bind_rows() failed to work due to incompatible types nevertheless 
rbind() went ahead without warnings.

Best wishes,


Jing Hua

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Dirk Eddelbuettel


As an aside, the odd format does not seem to bother data.table::fread() which
also happens to be my personally preferred workhorse for these tasks:

> fname <- "/tmp/r/filename.csv"
> read.csv(fname)
   Gene SNP prot log10p
1 YWHAE 13:62129097_C_T 1433   7.35
2 YWHAE 4:72617557_T_TA 1433   7.73
> data.table::fread(fname)
 Gene SNP   prot log10p

1:  YWHAE 13:62129097_C_T  1433E   7.35
2:  YWHAE 4:72617557_T_TA  1433E   7.73
> readr::read_csv(fname)
Rows: 2 Columns: 4
── Column specification 
──
Delimiter: ","
chr (2): Gene, SNP
dbl (2): prot, log10p

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
# A tibble: 2 × 4
  Gene  SNP  prot log10p

1 YWHAE 13:62129097_C_T  1433   7.35
2 YWHAE 4:72617557_T_TA  1433   7.73
> 

That's on Linux, everything current but dev version of data.table.

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Duncan Murdoch

On 16/04/2024 7:36 a.m., Rui Barradas wrote:

Às 11:46 de 16/04/2024, jing hua zhao escreveu:

Dear R-developers,

I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile 
to note -- my data involves a protein named "1433E" but to save space I drop 
the quote so it becomes,

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
confused by scientific notation) numeric 1433 which only alerts me when I tried 
to combine data,

all_data <- data.frame()
for (protein in proteins[1:7])
{
 cat(protein,":\n")
 f <- paste0(protein,".csv")
 if(file.exists(f))
 {
   p <- read.csv(f)
   print(p)
   if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
 }
}

proteins[1:7]
[1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"

dplyr::bind_rows() failed to work due to incompatible types nevertheless 
rbind() went ahead without warnings.

Best wishes,


Jing Hua

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Hello,

I wrote a file with that content and read it back with


read.csv("filename.csv", as.is = TRUE)


There were no problems, it all worked as expected.


What platform are you on?  I got the same output as Jing Hua:

Input filename.csv:

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Output:

> read.csv("filename.csv")
   Gene SNP prot log10p
1 YWHAE 13:62129097_C_T 1433   7.35
2 YWHAE 4:72617557_T_TA 1433   7.73

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread peter dalgaard
Hum...

This boils down to

> as.numeric("1.23e")
[1] 1.23
> as.numeric("1.23e-")
[1] 1.23
> as.numeric("1.23e+")
[1] 1.23

which in turn comes from this code in src/main/util.c (function R_strtod)

if (*p == 'e' || *p == 'E') {
int expsign = 1;
switch(*++p) {
case '-': expsign = -1;
case '+': p++;
default: ;
}
for (n = 0; *p >= '0' && *p <= '9'; p++) n = (n < MAX_EXPONENT_PREFIX) 
? n * 10 + (*p - '0') : n;
expn += expsign * n;
}

which sets the exponent to zero even if the for loop terminates immediately.  

This might qualify as a bug, as it differs from the C function strtod which 
accepts

"A sequence of digits, optionally containing a decimal-point character (.), 
optionally followed by an exponent part (an e or E character followed by an 
optional sign and a sequence of digits)."

[Of course, there would be nothing to stop e.g. "1433E1" from being converted 
to numeric.]

-pd


> On 16 Apr 2024, at 12:46 , jing hua zhao  wrote:
> 
> Dear R-developers,
> 
> I came to a somewhat unexpected behaviour of read.csv() which is trivial but 
> worthwhile to note -- my data involves a protein named "1433E" but to save 
> space I drop the quote so it becomes,
> 
> Gene,SNP,prot,log10p
> YWHAE,13:62129097_C_T,1433E,7.35
> YWHAE,4:72617557_T_TA,1433E,7.73
> 
> Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
> confused by scientific notation) numeric 1433 which only alerts me when I 
> tried to combine data,
> 
> all_data <- data.frame()
> for (protein in proteins[1:7])
> {
>   cat(protein,":\n")
>   f <- paste0(protein,".csv")
>   if(file.exists(f))
>   {
> p <- read.csv(f)
> print(p)
> if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
>   }
> }
> 
> proteins[1:7]
> [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"
> 
> dplyr::bind_rows() failed to work due to incompatible types nevertheless 
> rbind() went ahead without warnings.
> 
> Best wishes,
> 
> 
> Jing Hua
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Rui Barradas

Às 11:46 de 16/04/2024, jing hua zhao escreveu:

Dear R-developers,

I came to a somewhat unexpected behaviour of read.csv() which is trivial but worthwhile 
to note -- my data involves a protein named "1433E" but to save space I drop 
the quote so it becomes,

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
confused by scientific notation) numeric 1433 which only alerts me when I tried 
to combine data,

all_data <- data.frame()
for (protein in proteins[1:7])
{
cat(protein,":\n")
f <- paste0(protein,".csv")
if(file.exists(f))
{
  p <- read.csv(f)
  print(p)
  if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
}
}

proteins[1:7]
[1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"

dplyr::bind_rows() failed to work due to incompatible types nevertheless 
rbind() went ahead without warnings.

Best wishes,


Jing Hua

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Hello,

I wrote a file with that content and read it back with


read.csv("filename.csv", as.is = TRUE)


There were no problems, it all worked as expected.

Hope this helps,

Rui Barradas




--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-16 Thread Dirk Eddelbuettel


On 16 April 2024 at 10:46, jing hua zhao wrote:
| Dear R-developers,
| 
| I came to a somewhat unexpected behaviour of read.csv() which is trivial but 
worthwhile to note -- my data involves a protein named "1433E" but to save 
space I drop the quote so it becomes,
| 
| Gene,SNP,prot,log10p
| YWHAE,13:62129097_C_T,1433E,7.35
| YWHAE,4:72617557_T_TA,1433E,7.73
| 
| Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
confused by scientific notation) numeric 1433 which only alerts me when I tried 
to combine data,
| 
| all_data <- data.frame()
| for (protein in proteins[1:7])
| {
|cat(protein,":\n")
|f <- paste0(protein,".csv")
|if(file.exists(f))
|{
|  p <- read.csv(f)
|  print(p)
|  if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
|}
| }
| 
| proteins[1:7]
| [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"
| 
| dplyr::bind_rows() failed to work due to incompatible types nevertheless 
rbind() went ahead without warnings.

You may need to reconsider aiding read.csv() (and alternate reading
functions) by supplying column-type info instead of relying on educated
heuristic guesses which appear to fail here due to the nature of your data.

Other storage formats can store type info. That is generally safer and may be
an option too.

I think this was more of an email for r-help than r-devel.

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.csv

2024-04-16 Thread jing hua zhao
Dear R-developers,

I came to a somewhat unexpected behaviour of read.csv() which is trivial but 
worthwhile to note -- my data involves a protein named "1433E" but to save 
space I drop the quote so it becomes,

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly 
confused by scientific notation) numeric 1433 which only alerts me when I tried 
to combine data,

all_data <- data.frame()
for (protein in proteins[1:7])
{
   cat(protein,":\n")
   f <- paste0(protein,".csv")
   if(file.exists(f))
   {
 p <- read.csv(f)
 print(p)
 if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
   }
}

proteins[1:7]
[1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"

dplyr::bind_rows() failed to work due to incompatible types nevertheless 
rbind() went ahead without warnings.

Best wishes,


Jing Hua

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.csv quadratic time in number of columns

2023-03-29 Thread Toby Hocking
Dear R-devel,
A number of people have observed anecdotally that read.csv is slow for
large number of columns, for example:
https://stackoverflow.com/questions/7327851/read-csv-is-extremely-slow-in-reading-csv-files-with-large-numbers-of-columns
I did a systematic comparison of read.csv with similar functions, and
observed that read.csv is quadratic time (N^2) in the number of columns N,
whereas the others are linear (N).
Can read.csv be improved to use a linear time algorithm, so it can handle
CSV files with larger numbers of columns?
For more details including figures and session info, please see
https://github.com/tdhock/atime/issues/8
Sincerely,
Toby Dylan Hocking

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv, worrying behaviour?

2021-02-25 Thread Kevin R. Coombes
I believe this is documented behavior. The 'read.csv' function is a 
front-end to 'read.table' with different default values. IN this 
particular case, read.csv sets fill = TRUE, which means that it is 
supposed to fill incomplete lines with NA's. It also sets header=TRUE, 
which is presumably what it is using to determine the expected length of 
a line-row.

  -- Kevin

On 2/25/2021 4:11 AM, TAYLOR, Benjamin (BLACKPOOL TEACHING HOSPITALS NHS 
FOUNDATION TRUST) via R-devel wrote:

Dear all

I've been using R for around 16 years now and I've only just become aware of a 
behaviour of read.csv that I find worrying which is why I'm contacting this 
list. A simplified example of the behaviour is as follows

I created a "test.csv" file containing the following lines:

a,b,c,d,e,f,g
1,2,3,4

And then read it into R using:


d = read.csv("test.csv")
d

   a b c d  e  f  g
1 1 2 3 4 NA NA NA

I was surprised that this did not issue a warning. I can understand why the 
following csv would not issue a warning:

a,b,c,d,e,f,g
1,2,3,4,,,

But the missing commas in the first example? Thoughts from others would be 
welcome.

Kind regards

Ben


~~

Benjamin M. Taylor, MSci, MSc, PhD
Lead Data Scientist
Blackpool Teaching Hospitals NHS Foundation Trust
Home 15
Whinney Heys Road
Blackpool
FY3 8NR

Scholar: https://scholar.google.co.uk/citations?user=6Hf0CJkJ=en
Github: https://github.com/bentaylor1
Gitlab: https://gitlab.com/ben_taylor
ORCID: http://orcid.org/-0001-8667-4089





This message may contain confidential information. If ...{{dropped:6}}


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv, worrying behaviour?

2021-02-25 Thread TAYLOR, Benjamin (BLACKPOOL TEACHING HOSPITALS NHS FOUNDATION TRUST) via R-devel
Dear all

I've been using R for around 16 years now and I've only just become aware of a 
behaviour of read.csv that I find worrying which is why I'm contacting this 
list. A simplified example of the behaviour is as follows

I created a "test.csv" file containing the following lines:

a,b,c,d,e,f,g
1,2,3,4

And then read it into R using:

> d = read.csv("test.csv")
> d
  a b c d  e  f  g
1 1 2 3 4 NA NA NA

I was surprised that this did not issue a warning. I can understand why the 
following csv would not issue a warning:

a,b,c,d,e,f,g
1,2,3,4,,,

But the missing commas in the first example? Thoughts from others would be 
welcome.

Kind regards

Ben


~~

Benjamin M. Taylor, MSci, MSc, PhD
Lead Data Scientist
Blackpool Teaching Hospitals NHS Foundation Trust
Home 15
Whinney Heys Road
Blackpool
FY3 8NR

Scholar: https://scholar.google.co.uk/citations?user=6Hf0CJkJ=en
Github: https://github.com/bentaylor1
Gitlab: https://gitlab.com/ben_taylor
ORCID: http://orcid.org/-0001-8667-4089





This message may contain confidential information. If yo...{{dropped:19}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv reads more rows than indicated by wc -l

2012-12-20 Thread Matthew Dowle


Ben,

Somewhere on my wish/TO DO list is for someone to rewrite read.table 
for

better robustness *and* efficiency ...


Wish granted. New in data.table 1.8.7 :

=
New function fread(), a fast and friendly file reader.
*  header, skip, nrows, sep and colClasses are all auto detected.
*  integers2^31 are detected and read natively as bit64::integer64.
*  accepts filenames, URLs and A,B\n1,2\n3,4 directly
*  new implementation entirely in C
*  with a 50MB .csv, 1 million rows x 6 columns :
 read.csv(test.csv)# 
30-60 sec
 read.table(test.csv,all known tricks and known nrows)   #
10 sec
 fread(test.csv)   # 
3 sec

* airline data: 658MB csv (7 million rows x 29 columns)
 read.table(2008.csv,all known tricks and known nrows)   #   
360 sec
 fread(2008.csv)   #
50 sec
See ?fread. Many thanks to Chris Neff and Garrett See for ideas, 
discussions

and beta testing.
=

The help page ?fread is fairly well developed :
https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markuproot=datatable

Comments, feedback and bug reports very welcome.

Matthew

http://datatable.r-forge.r-project.org/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv reads more rows than indicated by wc -l

2012-12-19 Thread Ben Bolker
G See gsee000 at gmail.com writes:

 
 When I have a csv file that is more than 6 lines long, not including
 the header, and one of the fields is blank for the last few lines, and
 there is an extra comma on of the lines with the blank field,
 read.csv() makes creates an extra line.
 
 I attached an example file; I'll also paste the contents here:
 
 A,apple
 A,orange
 A,orange
 A,orange
 A,orange
 A,,,
 A,,
 
 -
 wc -l reports that this file has 7 lines
 
 R system(wc -l test.csv)
 7 test.csv
 
 But, read.csv reads 8.
 
 R read.csv(test.csv, header=FALSE, stringsAsFactors=FALSE)
   V1 V2
 1  A  apple
 2  A orange
 3  A orange
 4  A orange
 5  A orange
 6  A
 7
 8  A
 
 If I increase the number of commas at the end of the line, it
 increases the number of rows.
 
 This R command to read a 7 line csv:
 
 read.csv(header=FALSE, text=A,apple
 A,orange
 A,orange
 A,orange
 A,orange
 A,
 A,,)
 
 will produce this:
 
   V1 V2
 1  A  apple
 2  A orange
 3  A orange
 4  A orange
 5  A orange
 6  A
 7
 8
 9  A
 
 But if the file has fewer than 7 lines, it doesn't increase the number of 
 rows.
 
 This R command to read a 6 line csv:
 read.csv(header=FALSE, text=A,apple
 A,orange
 A,orange
 A,orange
 A,
 A,,)
 
 will produce this:
 
   V1 V2 V3 V4 V5 V6
 1  A  apple NA NA NA NA
 2  A orange NA NA NA NA
 3  A orange NA NA NA NA
 4  A orange NA NA NA NA
 5  ANA NA NA NA
 6  ANA NA NA NA
 
 Is this intended behavior?
 
 Thanks,
 Garrett See
 
 [snip]

I don't know if it's exactly *intended* or not, but I think it's
more or less as [IMPLICITLY] documented.  From ?read.table,

 The number of data columns is determined by looking at the first
 five lines of input (or the whole file if it has less than five
 lines), or from the length of ‘col.names’ if it is specified and
 is longer.  This could conceivably be wrong if ‘fill’ or
 ‘blank.lines.skip’ are true, so specify ‘col.names’ if necessary
 (as in the ‘Examples’).

txt - A,apple
 A,orange
 A,orange
 A,orange
 A,orange
 A,
 A,,
read.csv(header=FALSE, text=txt )

What is happening here is that
(1) read.table is determining from the first five lines that
there are two columns;
(2) when it gets to line six, it reads each set of two fields as a
separate row

If you try

read.csv(header=FALSE, text=txt, fill=FALSE,blank.lines.skip=FALSE)

you at least get an error.

But it gets worse:

txt2 - A,apple
 A,orange
 A,orange
 A,orange
 A,orange
 A,b,c,d,e,f
 A,g

read.csv(header=FALSE, text=txt2, fill=FALSE,blank.lines.skip=FALSE)

produces bad results even though fill=FALSE and blank.lines.skip=FALSE ...

Even specifying col.names explicitly doesn't help:

read.csv(header=FALSE, text=txt2, col.names=paste0(V,1:2))

At least count.fields() does detect a problem ...

count.fields(textConnection(txt2),sep=,)

Somewhere on my wish/TO DO list is for someone to rewrite read.table for
better robustness *and* efficiency ...

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv behaviour

2011-09-28 Thread Ben Bolker
Mehmet Suzen msuzen at mango-solutions.com writes:

 This might be obvious but I was wondering if anyone knows quick and easy
 way of writing out a CSV file with varying row lengths, ideally an
 initial data read from a CSV file which has the same format. See example
 below.
 
 writeLines(c(A,B,C,D,
  1,a,b,c,
  2,f,g,c,
  3,a,i,j,
  4,a,b,c,
  5,d,e,f,
  6,g,h,i,j,k,l,m,n),
con=file(test.csv))
 

X - read.csv(test.csv)


  It's not that pretty, but something like


tmpf - function(x) paste(x[nzchar(x)],collapse=,)
writeLines(apply(as.matrix(X),1,tmpf),con=outfile.csv)

  might work

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.csv behaviour

2011-09-27 Thread Mehmet Suzen

This might be obvious but I was wondering if anyone knows quick and easy
way of writing out a CSV file with varying row lengths, ideally an
initial data read from a CSV file which has the same format. See example
below.


I found it quite strange that R cannot write it in one go, so one must
append blocks or post-process the file, is this true? (even Ruby can do
it!!) 

Otherwise it puts ,, or similar for missing column values in the
shorter length rows and fill=FALSE option do not work!

I don't want to post-process if possible.

See this post:
http://r.789695.n4.nabble.com/Re-read-csv-trap-td3301924.html

Example that generated Error!

writeLines(c(A,B,C,D,
 1,a,b,c,
 2,f,g,c,
 3,a,i,j,
 4,a,b,c,
 5,d,e,f,
 6,g,h,i,j,k,l,m,n),
   con=file(test.csv))

read.csv(test.csv)
try(read.csv(test.csv,fill=FALSE))
LEGAL NOTICE
This message is intended for the use o...{{dropped:10}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv and FileEncoding in Windows version of R 2.13.0

2011-06-06 Thread Alexander Peterhansl

Hello Duncan, thank you very much for your reply.  The file is attached.

Again, the issue is that opening this UTF-8 encoded file under R 2.13.0 yields 
an error, but opening it under R 2.12.2 works without any issues.

The command I used to open the file is:
read.csv(test.csv,fileEncoding=UTF-8,header=FALSE)

(As you'll see, the file does have a byte order mark.)

Regards,
Alex


-Original Message-
From: Duncan Murdoch [mailto:murdoch.dun...@gmail.com] 
Sent: Wednesday, June 01, 2011 7:35 PM
To: Alexander Peterhansl
Cc: R-devel@r-project.org
Subject: Re: [Rd] read.csv and FileEncoding in Windows version of R 2.13.0

On 01/06/2011 6:00 PM, Alexander Peterhansl wrote:

 Dear R-devel List:

 read.csv() seems to have changed in R version 2.13.0 as compared to version 
 2.12.2 when reading in simple CSV files.

 Suppose I read in a 2-column CSV file (test.csv), say 1, a 2, b

 If file is encoded as UTF-8 (on Windows 7), then under R 2.13.0

That file could be pure ASCII, or could include a byte order mark.  I tried 
both, and I didn't get the error your saw.  So I think I need to see the file 
to diagnose this.

Could you put it in a .zip file and email it to me?

Duncan Murdoch


 read.csv(test.csv,fileEncoding=UTF-8,header=FALSE) yields the following 
 output
V1
 1  ?
 Warning messages:
 1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
invalid input found on input connection 'test.csv'
 2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
incomplete final line found by readTableHeader on 'test.csv'

 Under R 2.12.2 it runs problem-free and yields the expected:
V1 V2
 1  1  a
 2  2  b

 Please help.

 Regards,
 Alex

   [[alternative HTML version deleted]]

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.csv and FileEncoding in Windows version of R 2.13.0

2011-06-01 Thread Alexander Peterhansl

Dear R-devel List:

read.csv() seems to have changed in R version 2.13.0 as compared to version 
2.12.2 when reading in simple CSV files.

Suppose I read in a 2-column CSV file (test.csv), say
1, a
2, b

If file is encoded as UTF-8 (on Windows 7), then under R 2.13.0
read.csv(test.csv,fileEncoding=UTF-8,header=FALSE) yields the following 
output
  V1
1  ?
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection 'test.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'test.csv'

Under R 2.12.2 it runs problem-free and yields the expected:
  V1 V2
1  1  a
2  2  b

Please help.

Regards,
Alex

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv and FileEncoding in Windows version of R 2.13.0

2011-06-01 Thread Duncan Murdoch

On 01/06/2011 6:00 PM, Alexander Peterhansl wrote:


Dear R-devel List:

read.csv() seems to have changed in R version 2.13.0 as compared to version 
2.12.2 when reading in simple CSV files.

Suppose I read in a 2-column CSV file (test.csv), say
1, a
2, b

If file is encoded as UTF-8 (on Windows 7), then under R 2.13.0


That file could be pure ASCII, or could include a byte order mark.  I 
tried both, and I didn't get the error your saw.  So I think I need to 
see the file to diagnose this.


Could you put it in a .zip file and email it to me?

Duncan Murdoch



read.csv(test.csv,fileEncoding=UTF-8,header=FALSE) yields the following 
output
   V1
1  ?
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
   invalid input found on input connection 'test.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
   incomplete final line found by readTableHeader on 'test.csv'

Under R 2.12.2 it runs problem-free and yields the expected:
   V1 V2
1  1  a
2  2  b

Please help.

Regards,
Alex

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv trap

2011-03-03 Thread Ben Bolker
Ben Bolker bbolker at gmail.com writes:

 On 02/11/2011 03:37 PM, Laurent Gatto wrote:
  On 11 February 2011 19:39, Ben Bolker bbolker at gmail.com wrote:
 
  [snip]
 


  Bump.  Is there any opinion about this from R-core??
Will I be scolded if I submit this as a bug ... ??


  What is dangerous/confusing is that R silently **wraps** longer lines if
  fill=TRUE (which is the default for read.csv).  I encountered this when
  working with a colleague on a long, messy CSV file that had some phantom
  extra fields in some rows, which then turned into empty lines in the
  data frame.
 

  [snip snip]

   Here is an example and a workaround that runs count.fields on the
  whole file to find the maximum column length and set col.names
  accordingly.  (It assumes you don't already have a file named test.csv
  in your working directory ...)
 
   I haven't dug in to try to write a patch for this -- I wanted to test
  the waters and see what people thought first, and I realize that
  read.table() is a very complicated piece of code that embodies a lot of
  tradeoffs, so there could be lots of different approaches to trying to
  mitigate this problem. I appreciate very much how hard it is to write a
  robust and general function to read data files, but I also think it's
  really important to minimize the number of traps in read.table(), which
  will often be the first part of R that new users encounter ...
 
   A quick fix for this might be to allow the number of lines analyzed
  for length to be settable by the user, or to allow a settable 'maxcols'
  parameter, although those would only help in the case where the user
  already knows there is a problem.
 
   cheers
 Ben Bolker
 
===
writeLines(c(A,B,C,D,
1,a,b,c,
2,f,g,c,
3,a,i,j,
4,a,b,c,
5,d,e,f,
6,g,h,i,j,k,l,m,n),
  con=file(test.csv))
 
 
read.csv(test.csv)
try(read.csv(test.csv,fill=FALSE))
 
## assumes header=TRUE, fill=TRUE; should be a little more careful
##  with comment, quote arguments (possibly explicit)
## ... contains information about quote, comment.char, sep
Read.csv - function(fn,sep=,,...) {
 colnames - scan(fn,nlines=1,what=character,sep=sep,...)
 ncolnames - length(colnames)
 maxcols - max(count.fields(fn,sep=sep,...))
 if (maxcolsncolnames) {
   colnames - c(colnames,paste(V,(ncolnames+1):maxcols,sep=))
 }
 ## assumes you don't have any other columns labeled V[large number]
 read.csv(fn,...,col.names=colnames)
}

Read.csv(test.csv)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv trap

2011-02-11 Thread Ben Bolker

  Bump.

  It's been a week since I posted this to r-devel.  Any
thoughts/discussion?  Would R-core be irritated if I submitted a bug report?

  cheers
Ben


 Original Message 
Subject: read.csv trap
Date: Fri, 04 Feb 2011 11:16:36 -0500
From: Ben Bolker bbol...@gmail.com
To: r-de...@stat.math.ethz.ch r-de...@stat.math.ethz.ch,  David Earn
e...@math.mcmaster.ca

  This is not specifically a bug, but an (implicitly/obscurely)
documented behavior of read.csv (or read.table with fill=TRUE) that can
be quite dangerous/confusing for users.  I would love to hear some
discussion from other users and/or R-core about this ...  As always, I
apologize if I have missed some obvious workaround or reason that this
is actually the desired behavior ...

  In a nutshell, when fill=TRUE R guesses the number of columns from the
first 5 rows of the data set.  That's fine, and ?read.table documents this:

   The number of data columns is determined by looking at the first
 five lines of input (or the whole file if it has less than five
 lines), or from the length of ‘col.names’ if it is specified and
 is longer.  This could conceivably be wrong if ‘fill’ or
 ‘blank.lines.skip’ are true, so specify ‘col.names’ if necessary.

What is dangerous/confusing is that R silently **wraps** longer lines if
fill=TRUE (which is the default for read.csv).  I encountered this when
working with a colleague on a long, messy CSV file that had some phantom
extra fields in some rows, which then turned into empty lines in the
data frame.

  Here is an example and a workaround that runs count.fields on the
whole file to find the maximum column length and set col.names
accordingly.  (It assumes you don't already have a file named test.csv
in your working directory ...)

  I haven't dug in to try to write a patch for this -- I wanted to test
the waters and see what people thought first, and I realize that
read.table() is a very complicated piece of code that embodies a lot of
tradeoffs, so there could be lots of different approaches to trying to
mitigate this problem. I appreciate very much how hard it is to write a
robust and general function to read data files, but I also think it's
really important to minimize the number of traps in read.table(), which
will often be the first part of R that new users encounter ...

  A quick fix for this might be to allow the number of lines analyzed
for length to be settable by the user, or to allow a settable 'maxcols'
parameter, although those would only help in the case where the user
already knows there is a problem.

  cheers
Ben Bolker

===
writeLines(c(A,B,C,D,
 1,a,b,c,
 2,f,g,c,
 3,a,i,j,
 4,a,b,c,
 5,d,e,f,
 6,g,h,i,j,k,l,m,n),
   con=file(test.csv))


read.csv(test.csv)
try(read.csv(test.csv,fill=FALSE))

## assumes header=TRUE, fill=TRUE; should be a little more careful
##  with comment, quote arguments (possibly explicit)
## ... contains information about quote, comment.char, sep
Read.csv - function(fn,sep=,,...) {
  colnames - scan(fn,nlines=1,what=character,sep=sep,...)
  ncolnames - length(colnames)
  maxcols - max(count.fields(fn,sep=sep,...))
  if (maxcolsncolnames) {
colnames - c(colnames,paste(V,(ncolnames+1):maxcols,sep=))
  }
  ## assumes you don't have any other columns labeled V[large number]
  read.csv(fn,...,col.names=colnames)
}

Read.csv(test.csv)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv trap

2011-02-11 Thread Ken.Williams


On 2/11/11 1:39 PM, Ben Bolker bbol...@gmail.com wrote:

[snip]
 Original Message 
Subject: read.csv trap
Date: Fri, 04 Feb 2011 11:16:36 -0500
From: Ben Bolker bbol...@gmail.com
To: r-de...@stat.math.ethz.ch r-de...@stat.math.ethz.ch,  David Earn
e...@math.mcmaster.ca

[snip]
What is dangerous/confusing is that R silently **wraps** longer lines if
fill=TRUE (which is the default for read.csv).
[snip]


Based on your description, I would be very irritated if I encountered the
behavior you describe.  I would consider it a bug, though my opinion
doesn't necessarily count for much.

--
Ken Williams
Senior Research Scientist
Thomson Reuters
Phone: 651-848-7712
ken.willi...@thomsonreuters.com
http://labs.thomsonreuters.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv trap

2011-02-11 Thread Laurent Gatto
On 11 February 2011 19:39, Ben Bolker bbol...@gmail.com wrote:

[snip]

 What is dangerous/confusing is that R silently **wraps** longer lines if
 fill=TRUE (which is the default for read.csv).  I encountered this when
 working with a colleague on a long, messy CSV file that had some phantom
 extra fields in some rows, which then turned into empty lines in the
 data frame.


As a matter of fact, this is exactly what happened to a colleague of
mine yesterday and caused her quite a bit of trouble. On the other
hand, it could also be considered as a 'bug' in the csv file. Although
no formal specification exist for the csv format, RFC 4180 [1]
indicates that 'each line should contain the same number of fields
throughout the file'.

[1] http://tools.ietf.org/html/rfc4180

Best wishes,

Laurent

  Here is an example and a workaround that runs count.fields on the
 whole file to find the maximum column length and set col.names
 accordingly.  (It assumes you don't already have a file named test.csv
 in your working directory ...)

  I haven't dug in to try to write a patch for this -- I wanted to test
 the waters and see what people thought first, and I realize that
 read.table() is a very complicated piece of code that embodies a lot of
 tradeoffs, so there could be lots of different approaches to trying to
 mitigate this problem. I appreciate very much how hard it is to write a
 robust and general function to read data files, but I also think it's
 really important to minimize the number of traps in read.table(), which
 will often be the first part of R that new users encounter ...

  A quick fix for this might be to allow the number of lines analyzed
 for length to be settable by the user, or to allow a settable 'maxcols'
 parameter, although those would only help in the case where the user
 already knows there is a problem.

  cheers
    Ben Bolker

 ===
 writeLines(c(A,B,C,D,
             1,a,b,c,
             2,f,g,c,
             3,a,i,j,
             4,a,b,c,
             5,d,e,f,
             6,g,h,i,j,k,l,m,n),
           con=file(test.csv))


 read.csv(test.csv)
 try(read.csv(test.csv,fill=FALSE))

 ## assumes header=TRUE, fill=TRUE; should be a little more careful
 ##  with comment, quote arguments (possibly explicit)
 ## ... contains information about quote, comment.char, sep
 Read.csv - function(fn,sep=,,...) {
  colnames - scan(fn,nlines=1,what=character,sep=sep,...)
  ncolnames - length(colnames)
  maxcols - max(count.fields(fn,sep=sep,...))
  if (maxcolsncolnames) {
    colnames - c(colnames,paste(V,(ncolnames+1):maxcols,sep=))
  }
  ## assumes you don't have any other columns labeled V[large number]
  read.csv(fn,...,col.names=colnames)
 }

 Read.csv(test.csv)

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel




-- 
[ Laurent Gatto | slashhome.be ]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv trap

2011-02-11 Thread Ben Bolker
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02/11/2011 03:37 PM, Laurent Gatto wrote:
 On 11 February 2011 19:39, Ben Bolker bbol...@gmail.com wrote:

 [snip]

 What is dangerous/confusing is that R silently **wraps** longer lines if
 fill=TRUE (which is the default for read.csv).  I encountered this when
 working with a colleague on a long, messy CSV file that had some phantom
 extra fields in some rows, which then turned into empty lines in the
 data frame.

 
 As a matter of fact, this is exactly what happened to a colleague of
 mine yesterday and caused her quite a bit of trouble. On the other
 hand, it could also be considered as a 'bug' in the csv file. Although
 no formal specification exist for the csv format, RFC 4180 [1]
 indicates that 'each line should contain the same number of fields
 throughout the file'.
 
 [1] http://tools.ietf.org/html/rfc4180
 
 Best wishes,
 
 Laurent

  Asserting that the bug is in the CSV file is logically consistent, but
if this is true then the fill=TRUE argument (which is only needed when
the lines contain different numbers of fields) should not be allowed.

 I had never seen RFC4180 before -- interesting!  I note especially
points 5-7 which define the handling of double quotation marks (but says
nothing about single quotes or using backslashes as escape characters).

  Dealing with read.[table|csv] seems a bit of an Augean task
http://en.wikipedia.org/wiki/Augeas (hmmm, maybe I should write a
parallel document to Burns's _Inferno_ ...)

  cheers
Ben

 
  Here is an example and a workaround that runs count.fields on the
 whole file to find the maximum column length and set col.names
 accordingly.  (It assumes you don't already have a file named test.csv
 in your working directory ...)

  I haven't dug in to try to write a patch for this -- I wanted to test
 the waters and see what people thought first, and I realize that
 read.table() is a very complicated piece of code that embodies a lot of
 tradeoffs, so there could be lots of different approaches to trying to
 mitigate this problem. I appreciate very much how hard it is to write a
 robust and general function to read data files, but I also think it's
 really important to minimize the number of traps in read.table(), which
 will often be the first part of R that new users encounter ...

  A quick fix for this might be to allow the number of lines analyzed
 for length to be settable by the user, or to allow a settable 'maxcols'
 parameter, although those would only help in the case where the user
 already knows there is a problem.

  cheers
Ben Bolker

 ===
 writeLines(c(A,B,C,D,
 1,a,b,c,
 2,f,g,c,
 3,a,i,j,
 4,a,b,c,
 5,d,e,f,
 6,g,h,i,j,k,l,m,n),
   con=file(test.csv))


 read.csv(test.csv)
 try(read.csv(test.csv,fill=FALSE))

 ## assumes header=TRUE, fill=TRUE; should be a little more careful
 ##  with comment, quote arguments (possibly explicit)
 ## ... contains information about quote, comment.char, sep
 Read.csv - function(fn,sep=,,...) {
  colnames - scan(fn,nlines=1,what=character,sep=sep,...)
  ncolnames - length(colnames)
  maxcols - max(count.fields(fn,sep=sep,...))
  if (maxcolsncolnames) {
colnames - c(colnames,paste(V,(ncolnames+1):maxcols,sep=))
  }
  ## assumes you don't have any other columns labeled V[large number]
  read.csv(fn,...,col.names=colnames)
 }

 Read.csv(test.csv)

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

 
 
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1VsX4ACgkQc5UpGjwzenPwsgCfTtGo0kJSXhUTPcY+p7cgaiuq
zHAAnikRORUhqLP9O+6M5SwyZcFEW9uT
=Rb2R
-END PGP SIGNATURE-

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.csv trap

2011-02-04 Thread Ben Bolker
  This is not specifically a bug, but an (implicitly/obscurely)
documented behavior of read.csv (or read.table with fill=TRUE) that can
be quite dangerous/confusing for users.  I would love to hear some
discussion from other users and/or R-core about this ...  As always, I
apologize if I have missed some obvious workaround or reason that this
is actually the desired behavior ...

  In a nutshell, when fill=TRUE R guesses the number of columns from the
first 5 rows of the data set.  That's fine, and ?read.table documents this:

   The number of data columns is determined by looking at the first
 five lines of input (or the whole file if it has less than five
 lines), or from the length of ‘col.names’ if it is specified and
 is longer.  This could conceivably be wrong if ‘fill’ or
 ‘blank.lines.skip’ are true, so specify ‘col.names’ if necessary.

What is dangerous/confusing is that R silently **wraps** longer lines if
fill=TRUE (which is the default for read.csv).  I encountered this when
working with a colleague on a long, messy CSV file that had some phantom
extra fields in some rows, which then turned into empty lines in the
data frame.

  Here is an example and a workaround that runs count.fields on the
whole file to find the maximum column length and set col.names
accordingly.  (It assumes you don't already have a file named test.csv
in your working directory ...)

  I haven't dug in to try to write a patch for this -- I wanted to test
the waters and see what people thought first, and I realize that
read.table() is a very complicated piece of code that embodies a lot of
tradeoffs, so there could be lots of different approaches to trying to
mitigate this problem. I appreciate very much how hard it is to write a
robust and general function to read data files, but I also think it's
really important to minimize the number of traps in read.table(), which
will often be the first part of R that new users encounter ...

  A quick fix for this might be to allow the number of lines analyzed
for length to be settable by the user, or to allow a settable 'maxcols'
parameter, although those would only help in the case where the user
already knows there is a problem.

  cheers
Ben Bolker

===
writeLines(c(A,B,C,D,
 1,a,b,c,
 2,f,g,c,
 3,a,i,j,
 4,a,b,c,
 5,d,e,f,
 6,g,h,i,j,k,l,m,n),
   con=file(test.csv))


read.csv(test.csv)
try(read.csv(test.csv,fill=FALSE))

## assumes header=TRUE, fill=TRUE; should be a little more careful
##  with comment, quote arguments (possibly explicit)
## ... contains information about quote, comment.char, sep
Read.csv - function(fn,sep=,,...) {
  colnames - scan(fn,nlines=1,what=character,sep=sep,...)
  ncolnames - length(colnames)
  maxcols - max(count.fields(fn,sep=sep,...))
  if (maxcolsncolnames) {
colnames - c(colnames,paste(V,(ncolnames+1):maxcols,sep=))
  }
  ## assumes you don't have any other columns labeled V[large number]
  read.csv(fn,...,col.names=colnames)
}

Read.csv(test.csv)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.csv('/dev/stdin') fails (PR#14218)

2010-02-20 Thread egoldlust
Full_Name: Eric Goldlust
Version: 2.10.1 (2009-12-14) x86_64-unknown-linux-gnu 
OS: Linux 2.6.9-67.0.1.ELsmp x86_64
Submission from: (NULL) (64.22.160.1)


After upgrading to from 2.9.1 to 2.10.1, I get unexpected results when calling
read.csv('/dev/stdin').  These problems go away when I call read.csv(pipe('cat
/dev/stdin')).

Shell session follows (bash):

~$ echo -e a,b,c\n1,2,3 | Rscript (echo read.csv('/dev/stdin'))
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  no lines available in input
Calls: read.csv - read.table
Execution halted
~$ echo -e a,b,c\n1,2,3 | Rscript (echo read.csv(pipe('cat /dev/stdin')))
  a b c
1 1 2 3

Note that this code worked fine for me in 2.9.1.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.csv confused by newline characters in header (PR#14103)

2009-12-02 Thread g . russell
Full_Name: George Russell
Version: 2.10.0
OS: Microsoft Windows XP Service Pack 2
Submission from: (NULL) (217.111.3.131)


The following code (typed into R --vanilla)

testString - 'B1\nB2\n1\n'
con - textConnection(testString)
tab - read.csv(con,stringsAsFactors = FALSE)

produces a data frame with with one row and one column; the name of the column
is B1.B2 (alright so far). However according to
print(tab[[1,1]])

the value of the entry in the first row and first column is

B2\n1\n

So B2 has somehow got into both the names of the data frame and its entry.
Either R is confused or I am. What is going on?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv confused by newline characters in header (PR#14103)

2009-12-02 Thread Peter Dalgaard
g.russ...@eos-solutions.com wrote:
 Full_Name: George Russell
 Version: 2.10.0
 OS: Microsoft Windows XP Service Pack 2
 Submission from: (NULL) (217.111.3.131)
 
 
 The following code (typed into R --vanilla)
 
 testString - 'B1\nB2\n1\n'
 con - textConnection(testString)
 tab - read.csv(con,stringsAsFactors = FALSE)
 
 produces a data frame with with one row and one column; the name of the column
 is B1.B2 (alright so far). However according to
 print(tab[[1,1]])
 
 the value of the entry in the first row and first column is
 
 B2\n1\n
 
 So B2 has somehow got into both the names of the data frame and its entry.
 Either R is confused or I am. What is going on?

Presumably, read.table is not obeying quotes when removing what it
thinks is the header line. Another variation is this:

 tab - read.table(stdin(), head=T)
0: B1
0: B2
1: 1
2:
 tab
  B1.B2
1   B2
2 1


It's somehow connected to the

pushBack(c(lines, lines), file)

bits in readtable.R, but I don't quite get it.

-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - (p.dalga...@biostat.ku.dk)  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2009-06-25 Thread Petr Savicky
On Sun, Jun 14, 2009 at 02:56:01PM -0400, Gabor Grothendieck wrote:
 If read.csv's colClasses= argument is NOT used then read.csv accepts
 double quoted numerics:
 
 1:  read.csv(stdin())
 0: A,B
 1: 1,1
 2: 2,2
 3:
   A B
 1 1 1
 2 2 2
 
 However, if colClasses is used then it seems that it does not:
 
  read.csv(stdin(), colClasses = numeric)
 0: A,B
 1: 1,1
 2: 2,2
 3:
 Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
   scan() expected 'a real', got '1'
 
 Is this really intended?  I would have expected that a csv file in which
 each field is surrounded with double quotes is acceptable in both
 cases.   This may be documented as is yet seems undesirable from
 both a consistency viewpoint and the viewpoint that it should be
 possible to double quote fields in a csv file.

The problem is not specific to read.csv(). The same difference appears
for read.table().
  read.table(stdin())
  1 1
  2 2
  
  #   V1 V2
  # 1  1  1
  # 2  2  2
but
  read.table(stdin(), colClasses = numeric)
  1 1
  2 2
  
  Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '1'

The error occurs in the call of scan() at line 152 in 
src/library/utils/R/readtable.R,
which is
  data - scan(file = file, what = what, sep = sep, quote = quote, ...
(This is the third call of scan() in the source code of read.table())

In this call, scan() gets the types of columns in what argument. If the type 
is specified, scan() performs the conversion itself and fails, if a numeric 
field
is quoted. If the type is not specified, the output of scan() is of type 
character,
but with quotes eliminated, if there are some in the input file. Columns with
unknown type are then converted using type.convert(), which receives the data
already without quotes.

The call of type.convert() is contained in a cycle
for (i in (1L:cols)[do]) {
data[[i]] -
if (is.na(colClasses[i]))
type.convert(data[[i]], as.is = as.is[i], dec = dec,
 na.strings = character(0L))
## as na.strings have already been converted to NA
else if (colClasses[i] == factor) as.factor(data[[i]])
else if (colClasses[i] == Date) as.Date(data[[i]])
else if (colClasses[i] == POSIXct) as.POSIXct(data[[i]])
else methods::as(data[[i]], colClasses[i])
}
which contains also lines, which could perform conversion for columns with
a specified type, but these lines are not used, since the vector do 
is defined as
  do - keep  !known 
where known determines for which columns the type is known.

It is possible to modify the code so that scan() is called with all types
unspecified and leave the conversion to the lines
else if (colClasses[i] == factor) as.factor(data[[i]])
else if (colClasses[i] == Date) as.Date(data[[i]])
else if (colClasses[i] == POSIXct) as.POSIXct(data[[i]])
else methods::as(data[[i]], colClasses[i])
above. Since this solution is already prepared in the code, the patch is very 
simple
  --- R-devel/src/library/utils/R/readtable.R 2009-05-18 17:53:08.0 
+0200
  +++ R-devel-readtable/src/library/utils/R/readtable.R   2009-06-25 
10:20:06.0 +0200
  @@ -143,9 +143,6 @@
   names(what) - col.names
   
   colClasses[colClasses %in% c(real, double)] - numeric
  -known - colClasses %in%
  -c(logical, integer, numeric, complex, character)
  -what[known] - sapply(colClasses[known], do.call, list(0))
   what[colClasses %in% NULL] - list(NULL)
   keep - !sapply(what, is.null)
   
  @@ -189,7 +186,7 @@
  stop(gettextf('as.is' has the wrong length %d  != cols = %d,
length(as.is), cols), domain = NA)
   
  -do - keep  !known #  !as.is
  +do - keep  !as.is
   if(rlabp) do[1L] - FALSE # don't convert row.names
   for (i in (1L:cols)[do]) {
   data[[i]] -
(Also in attachment)

I did a test as follows
  d1 - read.table(stdin())
  1 TRUE   3.5
  2   NA 0.1
  NA  FALSE  0.1
  3   TRUE NA

  sapply(d1, typeof)
  #V1V2V3 
  # integer logical  double 
  is.na(d1)
  # V1V2V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE
  
  d2 - read.table(stdin(), colClasses=c(integer, logical, double))
  1 TRUE   3.5
  2   NA 0.1
  NA  FALSE  0.1
  3   TRUE NA

  sapply(d2, typeof)
  #V1V2V3 
  # integer logical  double 
  is.na(d2)
  # V1V2V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE

I think, there was a reason to let scan() to perform the type conversion, for
example, it may be more efficient. So, if correct, the above patch is a possible
solution, but some other may be more appropriate. In particular, function scan()
may be modified to remove 

Re: [Rd] read.csv

2009-06-25 Thread Petr Savicky
I am sorry for not including the attachment mentioned in my
previous email. Attached now. Petr.
--- R-devel/src/library/utils/R/readtable.R 2009-05-18 17:53:08.0 
+0200
+++ R-devel-readtable/src/library/utils/R/readtable.R   2009-06-25 
10:20:06.0 +0200
@@ -143,9 +143,6 @@
 names(what) - col.names
 
 colClasses[colClasses %in% c(real, double)] - numeric
-known - colClasses %in%
-c(logical, integer, numeric, complex, character)
-what[known] - sapply(colClasses[known], do.call, list(0))
 what[colClasses %in% NULL] - list(NULL)
 keep - !sapply(what, is.null)
 
@@ -189,7 +186,7 @@
stop(gettextf('as.is' has the wrong length %d  != cols = %d,
  length(as.is), cols), domain = NA)
 
-do - keep  !known #  !as.is
+do - keep  !as.is
 if(rlabp) do[1L] - FALSE # don't convert row.names
 for (i in (1L:cols)[do]) {
 data[[i]] -
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2009-06-16 Thread Petr Savicky
On Sun, Jun 14, 2009 at 09:21:24PM +0100, Ted Harding wrote:
 On 14-Jun-09 18:56:01, Gabor Grothendieck wrote:
  If read.csv's colClasses= argument is NOT used then read.csv accepts
  double quoted numerics:
  
  1:  read.csv(stdin())
  0: A,B
  1: 1,1
  2: 2,2
  3:
A B
  1 1 1
  2 2 2
  
  However, if colClasses is used then it seems that it does not:
  
  read.csv(stdin(), colClasses = numeric)
  0: A,B
  1: 1,1
  2: 2,2
  3:
  Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
  na.strings,  :
scan() expected 'a real', got '1'
  
  Is this really intended?  I would have expected that a csv file
  in which each field is surrounded with double quotes is acceptable
  in both cases. This may be documented as is yet seems undesirable
  from both a consistency viewpoint and the viewpoint that it should
  be possible to double quote fields in a csv file.
 
 Well, the default for colClasses is NA, for which ?read.csv says:
   [...]
   Possible values are 'NA' (when 'type.convert' is used),
   [...]
 and then ?type.convert says:
   This is principally a helper function for 'read.table'. Given a
   character vector, it attempts to convert it to logical, integer,
   numeric or complex, and failing that converts it to factor unless
   'as.is = TRUE'.  The first type that can accept all the non-missing
   values is chosen.
 
 It would seem that type 'logical' won't accept integer (naively one
 might expect 1 -- TRUE, but see experiment below), so the first
 acceptable type for 1 is integer, and that is what happens.
 So it is indeed documented (in the R[ecursive] sense of documented :))
 
 However, presumably when colClasses is used then type.convert() is
 not called, in which case R sees itself being asked to assign a
 character entity to a destination which it has been told shall be
 integer, and therefore, since the default for as.is is
   as.is = !stringsAsFactors
 but for this ?read.csv says that stringsAsFactors is overridden
 bu [sic] 'as.is' and 'colClasses', both of which allow finer
 control., so that wouldn't come to the rescue either.
 
 Experiment:
   X -logical(10)
   class(X)
   # [1] logical
   X[1]-1
   X
   # [1] 1 0 0 0 0 0 0 0 0 0
   class(X)
   # [1] numeric
 so R has converted X from class 'logical' to class 'numeric'
 on being asked to assign a number to a logical; but in this
 case its hands were not tied by colClasses.
 
 Or am I missing something?!!

In my opinion, you explain, how it happens that there is a difference
in the behavior between
  read.csv(stdin(), colClasses = numeric)
and
  read.csv(stdin())
but not, why it is so.

The algorithm use the smallest type, which accepts all non-missing values
may well be applied to the input values either literally or after removing
the quotes. Is there a reason, why
  read.csv(stdin(), colClasses = numeric)
removes quotes from the input values and
  read.csv(stdin())
does not?

Using double-quote characters is a part of the definition of CSV file, see,
for example
  http://en.wikipedia.org/wiki/Comma_separated_values
where one may find
  Fields may always be enclosed within double-quote characters, whether 
necessary or not.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] read.csv

2009-06-14 Thread Gabor Grothendieck
If read.csv's colClasses= argument is NOT used then read.csv accepts
double quoted numerics:

1:  read.csv(stdin())
0: A,B
1: 1,1
2: 2,2
3:
  A B
1 1 1
2 2 2

However, if colClasses is used then it seems that it does not:

 read.csv(stdin(), colClasses = numeric)
0: A,B
1: 1,1
2: 2,2
3:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  scan() expected 'a real', got '1'

Is this really intended?  I would have expected that a csv file in which
each field is surrounded with double quotes is acceptable in both
cases.   This may be documented as is yet seems undesirable from
both a consistency viewpoint and the viewpoint that it should be
possible to double quote fields in a csv file.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2009-06-14 Thread Ted Harding
On 14-Jun-09 18:56:01, Gabor Grothendieck wrote:
 If read.csv's colClasses= argument is NOT used then read.csv accepts
 double quoted numerics:
 
 1:  read.csv(stdin())
 0: A,B
 1: 1,1
 2: 2,2
 3:
   A B
 1 1 1
 2 2 2
 
 However, if colClasses is used then it seems that it does not:
 
 read.csv(stdin(), colClasses = numeric)
 0: A,B
 1: 1,1
 2: 2,2
 3:
 Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
 na.strings,  :
   scan() expected 'a real', got '1'
 
 Is this really intended?  I would have expected that a csv file
 in which each field is surrounded with double quotes is acceptable
 in both cases. This may be documented as is yet seems undesirable
 from both a consistency viewpoint and the viewpoint that it should
 be possible to double quote fields in a csv file.

Well, the default for colClasses is NA, for which ?read.csv says:
  [...]
  Possible values are 'NA' (when 'type.convert' is used),
  [...]
and then ?type.convert says:
  This is principally a helper function for 'read.table'. Given a
  character vector, it attempts to convert it to logical, integer,
  numeric or complex, and failing that converts it to factor unless
  'as.is = TRUE'.  The first type that can accept all the non-missing
  values is chosen.

It would seem that type 'logical' won't accept integer (naively one
might expect 1 -- TRUE, but see experiment below), so the first
acceptable type for 1 is integer, and that is what happens.
So it is indeed documented (in the R[ecursive] sense of documented :))

However, presumably when colClasses is used then type.convert() is
not called, in which case R sees itself being asked to assign a
character entity to a destination which it has been told shall be
integer, and therefore, since the default for as.is is
  as.is = !stringsAsFactors
but for this ?read.csv says that stringsAsFactors is overridden
bu [sic] 'as.is' and 'colClasses', both of which allow finer
control., so that wouldn't come to the rescue either.

Experiment:
  X -logical(10)
  class(X)
  # [1] logical
  X[1]-1
  X
  # [1] 1 0 0 0 0 0 0 0 0 0
  class(X)
  # [1] numeric
so R has converted X from class 'logical' to class 'numeric'
on being asked to assign a number to a logical; but in this
case its hands were not tied by colClasses.

Or am I missing something?!!

Ted.




E-Mail: (Ted Harding) ted.hard...@manchester.ac.uk
Fax-to-email: +44 (0)870 094 0861
Date: 14-Jun-09   Time: 21:21:22
-- XFMail --

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2009-06-14 Thread Gabor Grothendieck
On Sun, Jun 14, 2009 at 4:21 PM, Ted
Hardingted.hard...@manchester.ac.uk wrote:
 Or am I missing something?!!

The point of this is that the current behavior is not desirable since you can't
have quoted numeric fields if you specify colClasses = numeric yet you
can if you don't.  The concepts are not orthogonal but should be.  If you
specify or not specify colClasses the numeric fields ought to be treated
the same way and if the documentation says otherwise it further means
there is a problem with the design.

One could define their own type quotedNumeric as a workaround
(see below) but I think it would be better if specifying numeric or
not specifying
numeric had the same effect.  The way it is now the concepts are intertwined
and not orthogonal.

library(methods)
setClass(quotedNumeric)
setAs(character, quotedNumeric,
  function(from) as.numeric(gsub(\, , from)))
Lines - 'A,B
1,1
2,2'
read.csv(textConnection(Lines), colClasses = c(quotedNumeric, numeric))

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel