Re: [R] retaining characters in a csv file
On 9/23/2015 5:57 AM, Therneau, Terry M., Ph.D. wrote: Thanks for all for the comments, I hadn't intended to start a war. My summary: 1. Most important: I wasn't missing something obvious. This is always my first suspicion when I submit something to R-help, and it's true more often than not. 2. Obviously (at least it is now), the CSV standard does not specify that quotes should force a character result. R is not "wrong". Wrt to using what Excel does as litmus test, I consider that to be totally uninformative about standards: neither pro (like Duncan) or anti (like Rolf), but simply irrelevant. (Like many MS choices.) 3. I'll have to code in my own solution, either pre-scan the first few lines to create a colClasses, or use read_csv from the readr library (if there are leading zeros it keeps the string as character, which may suffice for my needs), or something else. 4. The source of the data is a "text/csv" field coming from an http POST request. This is an internal service on an internal Mayo server and coded by our own IT department; this will not be the first case where I have found that their definition of "csv" is not quite standard. Terry T. On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote: I have a csv file from an automatic process (so this will happen thousands of times), for which the first row is a vector of variable names and the second row often starts something like this: 5724550,"000202075214",2005.02.17,2005.02.17,"F", . Notice the second variable which is a character string (note the quotation marks) a sequence of numeric digits leading zeros are significant The read.csv function insists on turning this into a numeric. Is there any simple set of options that will turn this behavior off? I'm looking for a way to tell it to "obey the bloody quotes" -- I still want the first, third, etc columns to become numeric. There can be more than one variable like this, and not always in the second position. This happens deep inside the httr library; there is an easy way for me to add more options to the read.csv call but it is not so easy to replace it with something else. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. A fairly simple workaround is to add two lines of code to the process, and then add the colClasses parameter as you suggested in item 2 above. want <- read.csv('yourfile', quote='', stringsAsFactors= FALSE, nrows=1) classes <- sapply(want, class) want <- read.csv('yourfile', stringsAsFactors= FALSE, colClasses=classes) I don't know if you want your final file to convert strings to factors, so you can modify as needed. In addition, if your files aren't as regular as I inferred, you can increase the number of rows to read in the first line to ensure getting the classes right. Hope this is helpful, Dan -- Daniel Nordlund Bothell, WA USA __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
> -Original Message- > From: r.tur...@auckland.ac.nz > Sent: Wed, 23 Sep 2015 13:26:58 +1200 > To: pda...@gmail.com .. > I would say that this phenomenon ("Excel does it") is *overwhelming* > evidence that it is bad practice!!! :-) Fortune? Can't remember your password? Do you need a strong and secure password? Use Password manager! It stores your passwords & protects your account. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
Thanks for all for the comments, I hadn't intended to start a war. My summary: 1. Most important: I wasn't missing something obvious. This is always my first suspicion when I submit something to R-help, and it's true more often than not. 2. Obviously (at least it is now), the CSV standard does not specify that quotes should force a character result. R is not "wrong". Wrt to using what Excel does as litmus test, I consider that to be totally uninformative about standards: neither pro (like Duncan) or anti (like Rolf), but simply irrelevant. (Like many MS choices.) 3. I'll have to code in my own solution, either pre-scan the first few lines to create a colClasses, or use read_csv from the readr library (if there are leading zeros it keeps the string as character, which may suffice for my needs), or something else. 4. The source of the data is a "text/csv" field coming from an http POST request. This is an internal service on an internal Mayo server and coded by our own IT department; this will not be the first case where I have found that their definition of "csv" is not quite standard. Terry T. On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote: I have a csv file from an automatic process (so this will happen thousands of times), for which the first row is a vector of variable names and the second row often starts something like this: 5724550,"000202075214",2005.02.17,2005.02.17,"F", . Notice the second variable which is a character string (note the quotation marks) a sequence of numeric digits leading zeros are significant The read.csv function insists on turning this into a numeric. Is there any simple set of options that will turn this behavior off? I'm looking for a way to tell it to "obey the bloody quotes" -- I still want the first, third, etc columns to become numeric. There can be more than one variable like this, and not always in the second position. This happens deep inside the httr library; there is an easy way for me to add more options to the read.csv call but it is not so easy to replace it with something else. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
On 23/09/15 11:19, peter dalgaard wrote: On 23 Sep 2015, at 00:33 , Rolf Turner wrote: [read.csv() doesn't distinguish "123.4" from 123.4] IMHO this is a bug in read.csv(). Dunno about that: pd$ cat ~/tmp/junk.csv "1";1 2;"2" pd$ open !$ open ~/tmp/junk.csv And lo and behold, Excel opens with 1 1 2 2 and all cells numeric. I would say that this phenomenon ("Excel does it") is *overwhelming* evidence that it is bad practice!!! :-) I don't think the CSV standard (if there is one...) specifies that quoted strings are necessarily text. Duncan Murdoch has pointed out that this is definitely *not* the case. I think we have been here before, and found that even if we decide that it is a bug (or misfeature), it would be hard to change, because the modus operandi of read.* is to first read everything as character and _then_ see (in type.convert()) which entries can be converted to numeric, logical, etc. As Arunkumar Srinivasan has pointed out, fread() from the data.table package can handle this, so it is *not impossible*. cheers, Rolf -- Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
On 23/09/15 12:48, Duncan Murdoch wrote: On 22/09/2015 6:33 PM, Rolf Turner wrote: IMHO this is a bug in read.csv(). No, it's a bug in "Rolf Turner", who believes in fairies at the end of his garden, rather than in documentation for file formats. Naturally, I beg to differ. The documentation for read.csv() refers to a quote character. Nowhere does it mention that quotes only serve to keep commas and white space from being interpreted as delimiters. The usual meaning of quotes in R is to enclose character strings and so it is a reasonable assumption that this would be their function in this instance. Before you fly off into some idiotic rant about how one "should never make assumptions" consider the fact that if one made no assumptions at all one could not get out of bed in the morning. One has to assume that the documentation is reasonably consistent and that any serious inconsistencies are drawn to the user's attention. If one had to read the (entire) documentation for each system called upon by a given piece of software (apply recursively!) then one would spend one's entire life reading documentation and never get any work done. Although I most definitely do not believe in fairies at the bottom of my garden, I am the first to admit that I am not all that bright and could have erringly missed something. HOWEVER Terry Therneau was flummoxed by the quirky and counter-intuitive nature of quotes in read.csv(), and Dr. Therneau is very bright indeed. So the fault is not in the user/reader but in the function and its documentation. cheers, Rolf -- Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
On 22/09/2015 6:33 PM, Rolf Turner wrote: > On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote: >> I have a csv file from an automatic process (so this will happen >> thousands of times), for which the first row is a vector of variable >> names and the second row often starts something like this: >> >> 5724550,"000202075214",2005.02.17,2005.02.17,"F", . >> >> Notice the second variable which is >>a character string (note the quotation marks) >>a sequence of numeric digits >>leading zeros are significant >> >> The read.csv function insists on turning this into a numeric. Is there >> any simple set of options that >> will turn this behavior off? I'm looking for a way to tell it to "obey >> the bloody quotes" -- I still want the first, third, etc columns to >> become numeric. There can be more than one variable like this, and not >> always in the second position. >> >> This happens deep inside the httr library; there is an easy way for me >> to add more options to the read.csv call but it is not so easy to >> replace it with something else. > > IMHO this is a bug in read.csv(). No, it's a bug in "Rolf Turner", who believes in fairies at the end of his garden, rather than in documentation for file formats. Duncan Murdoch > > A possible workaround: > > ccc <- c("integer","character",rep(NA,k)) > X <- read.csv("melvin.csv",colClasses=ccc) > > where "melvin.csv" is the file from which you are attempting to read and > where k+2 = the number of columns in that file. > > Kludgey, but it might work. > > Another workaround is to specify quote="", but this has the side effect > of making the 5th column character rather than logical. > > cheers, > > Rolf > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
On 22/09/2015 7:19 PM, peter dalgaard wrote: > >> On 23 Sep 2015, at 00:33 , Rolf Turner wrote: >> > > [read.csv() doesn't distinguish "123.4" from 123.4] > >> IMHO this is a bug in read.csv(). >> > > Dunno about that: > > pd$ cat ~/tmp/junk.csv > "1";1 > 2;"2" > pd$ open !$ > open ~/tmp/junk.csv > > And lo and behold, Excel opens with > > 1 1 > 2 2 > > and all cells numeric. > > I don't think the CSV standard (if there is one...) specifies that quoted > strings are necessarily text. It specifically does not. Quotes allow commas and spaces to be ignored as column separators. That's all. They say nothing about the type of data. Duncan Murdoch > > I think we have been here before, and found that even if we decide that it is > a bug (or misfeature), it would be hard to change, because the modus operandi > of read.* is to first read everything as character and _then_ see (in > type.convert()) which entries can be converted to numeric, logical, etc. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
On 22/09/2015 6:00 PM, Therneau, Terry M., Ph.D. wrote: > I have a csv file from an automatic process (so this will happen thousands of > times), for > which the first row is a vector of variable names and the second row often > starts > something like this: > > 5724550,"000202075214",2005.02.17,2005.02.17,"F", . > > Notice the second variable which is >a character string (note the quotation marks) >a sequence of numeric digits >leading zeros are significant > > The read.csv function insists on turning this into a numeric. No it doesn't. All you need to do is specify colClasses and it will follow your instructions. Is there any simple set of > options that > will turn this behavior off? I'm looking for a way to tell it to "obey the > bloody quotes" > -- I still want the first, third, etc columns to become numeric. There can > be more than > one variable like this, and not always in the second position. No, because the bloody quotes are part of the "csv standard". They aren't meaningful. If you don't know what the data is, that's your fault. You shouldn't be analyzing data when you are so ignorant. Duncan Murdoch > This happens deep inside the httr library; there is an easy way for me to add > more options > to the read.csv call but it is not so easy to replace it with something else. > > Terry T > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
> On 23 Sep 2015, at 00:33 , Rolf Turner wrote: > [read.csv() doesn't distinguish "123.4" from 123.4] > IMHO this is a bug in read.csv(). > Dunno about that: pd$ cat ~/tmp/junk.csv "1";1 2;"2" pd$ open !$ open ~/tmp/junk.csv And lo and behold, Excel opens with 1 1 2 2 and all cells numeric. I don't think the CSV standard (if there is one...) specifies that quoted strings are necessarily text. I think we have been here before, and found that even if we decide that it is a bug (or misfeature), it would be hard to change, because the modus operandi of read.* is to first read everything as character and _then_ see (in type.convert()) which entries can be converted to numeric, logical, etc. -pd -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
data.table's fread reads this as expected. Quoted strings aren't coerced. sapply(fread('5724550,"000202075214",2005.02.17,2005.02.17,"F"\n'), class) # V1 V2 V3 V4 V5 # "integer" "character" "character" "character" "character" Best, Arun. On Wed, Sep 23, 2015 at 12:00 AM, Therneau, Terry M., Ph.D. wrote: > I have a csv file from an automatic process (so this will happen thousands > of times), for which the first row is a vector of variable names and the > second row often starts something like this: > > 5724550,"000202075214",2005.02.17,2005.02.17,"F", . > > Notice the second variable which is > a character string (note the quotation marks) > a sequence of numeric digits > leading zeros are significant > > The read.csv function insists on turning this into a numeric. Is there any > simple set of options that > will turn this behavior off? I'm looking for a way to tell it to "obey the > bloody quotes" -- I still want the first, third, etc columns to become > numeric. There can be more than one variable like this, and not always in > the second position. > > This happens deep inside the httr library; there is an easy way for me to > add more options to the read.csv call but it is not so easy to replace it > with something else. > > Terry T > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
On Sep 22, 2015, at 3:00 PM, Therneau, Terry M., Ph.D. wrote: > I have a csv file from an automatic process (so this will happen thousands of > times), for which the first row is a vector of variable names and the second > row often starts something like this: > > 5724550,"000202075214",2005.02.17,2005.02.17,"F", . > > Notice the second variable which is > a character string (note the quotation marks) > a sequence of numeric digits > leading zeros are significant > > The read.csv function insists on turning this into a numeric. Is there any > simple set of options that > will turn this behavior off? I'm looking for a way to tell it to "obey the > bloody quotes" -- I still want the first, third, etc columns to become > numeric. There can be more than one variable like this, and not always in > the second position. The last part about not knowing which col might be an issue might require inputting everything with character class, but if there is a way to pass in a colClasses argument this might help: > read.csv(text='5724550,"000202075214",2005.02.17,2005.02.17,"F"', > stringsAsFactors=FALSE, header=FALSE, colClasses=c("numeric", > rep("character", 4))) V1 V2 V3 V4 V5 1 5724550 000202075214 2005.02.17 2005.02.17 F Or you can create a class with an As method: > setClass('myChar') > setAs('character', 'myChar', def=function(from, to ) to <- I(from)) > read.csv(text='5724550,"000202075214",2005.02.17,2005.02.17,"F"', > stringsAsFactors=FALSE, header=FALSE, colClasses=c("numeric", > rep('myChar',4)) ) V1 V2 V3 V4 V5 1 5724550 000202075214 2005.02.17 2005.02.17 F (Neither of the third or fourth columns makes sense as a numeric, so now illustrating coercion to Date.) > setClass('dotDate') > setAs('character', 'dotDate', def=function(from, to ) to <- as.Date(from, > "%Y.%m.%d") ) > read.csv(text='5724550,"000202075214",2005.02.17,2005.02.17,"F"', > stringsAsFactors=FALSE, header=FALSE, colClasses=c("numeric", "character", > rep('dotDate',2), "character") ) V1 V2 V3 V4 V5 1 5724550 000202075214 2005-02-17 2005-02-17 F > > This happens deep inside the httr library; there is an easy way for me to add > more options to the read.csv call but it is not so easy to replace it with > something else. > > Terry T > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote: I have a csv file from an automatic process (so this will happen thousands of times), for which the first row is a vector of variable names and the second row often starts something like this: 5724550,"000202075214",2005.02.17,2005.02.17,"F", . Notice the second variable which is a character string (note the quotation marks) a sequence of numeric digits leading zeros are significant The read.csv function insists on turning this into a numeric. Is there any simple set of options that will turn this behavior off? I'm looking for a way to tell it to "obey the bloody quotes" -- I still want the first, third, etc columns to become numeric. There can be more than one variable like this, and not always in the second position. This happens deep inside the httr library; there is an easy way for me to add more options to the read.csv call but it is not so easy to replace it with something else. IMHO this is a bug in read.csv(). A possible workaround: ccc <- c("integer","character",rep(NA,k)) X <- read.csv("melvin.csv",colClasses=ccc) where "melvin.csv" is the file from which you are attempting to read and where k+2 = the number of columns in that file. Kludgey, but it might work. Another workaround is to specify quote="", but this has the side effect of making the 5th column character rather than logical. cheers, Rolf -- Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
The problem is that quotes in csv files are commonly held to me meaningless (i.e. they don't automatically force components to be strings). Earlier this morning I committed a fix to readr so that numbers starting with a sequence of zeros are read as character strings. You may want to try out the dev version: https://github.com/hadley/readr. Hadley On Tue, Sep 22, 2015 at 5:00 PM, Therneau, Terry M., Ph.D. wrote: > I have a csv file from an automatic process (so this will happen thousands > of times), for which the first row is a vector of variable names and the > second row often starts something like this: > > 5724550,"000202075214",2005.02.17,2005.02.17,"F", . > > Notice the second variable which is > a character string (note the quotation marks) > a sequence of numeric digits > leading zeros are significant > > The read.csv function insists on turning this into a numeric. Is there any > simple set of options that > will turn this behavior off? I'm looking for a way to tell it to "obey the > bloody quotes" -- I still want the first, third, etc columns to become > numeric. There can be more than one variable like this, and not always in > the second position. > > This happens deep inside the httr library; there is an easy way for me to > add more options to the read.csv call but it is not so easy to replace it > with something else. > > Terry T -- http://had.co.nz/ __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.