Re: [Rd] read.table() with quoted integers

Joris Meys Mon, 30 Sep 2013 08:48:40 -0700

On Mon, Sep 30, 2013 at 5:27 PM, Milan Bouchet-Valat <nalimi...@club.fr>wrote:


> Le lundi 30 septembre 2013 à 17:10 +0200, Joris Meys a écrit :
> > Regardless of whether "stored as character" is interpreted the R way
> > or the ASCII way, the point Joshua makes is rather valid. Mainly
> > because read.table has an argument quote with default value \"'. This
> > means that at least according to R, everything between either " or '
> > should be seen as of type character and not integer.
> I don't think the problem is related to the quote argument at all:
> > read.table("file.csv", colClasses="integer", quote=NULL)
>

The quote argument is not the problem. The appearance of quotes in the file
is. If you set quote to NULL, then any quote is read as part of the value,
which is even less what you want. I was just pointing out that R treats
everything that's quoted as character. Which was basically the point of
Joshua.


> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> :
>   scan() expected 'an integer' and got '"1"'
>
> > The only way these quotes can end up in a .csv file, is when in the
> > rendering program (often Excel), these integers are called "character"
> > inside the program as well. So they're not treated as integers by the
> > person that created the file, so R won't treat them
> > as integers either. Note that read.table does read the quoted integers
> > as characters, and only afterwards convert those.
> Yeah, I understand how the conversion happens, but I wonder whether the
> result really makes sense. The fact that you cannot set colClasses to
> the classes you are actually getting when reading the file is somewhat
> disturbing...
>

I see your point. It would be nice to have type.convert() at least try to
get the required classes, but this defeats the whole point of setting
colClasses, which is making the function scan() process the data faster as
it doesn't have to "guess" what to read.  colClasses is used to construct
the value for the argument what in scan(). So if you want to change the
behaviour of read.table(), you're actually looking at changing the
behaviour of scan(). Or at a total rewrite of read.table(), eg read
everything as character with scan() and then convert to the colClasses
specified.

Although your suggestion would intuitively be a "more logic" approach to
some, I consider "character" to be the correct colClass in case an integer
is surrounded by quotes. I also consider it bad coding practice if you make
your function dependent on the guessing done by other functions in a way
that is going against the documentation of said function. So I guess the
best solution is to rewrite read.table.ffdf() in a way as described above.

My 2 cents, as I'm not going to do the rewrite...
Cheers
Joris



>
> > So yes, this is an issue with read.table.ffdf more than with R itself.
> > And the problem is indeed how integers are treated the moment they are
> > stored. This refering to the presence/absence of the quote character.
> Of course this could be fixed in read.table.ffdf(), but that would be
> quite hacky since it could not cleanly rely as currently on
> read.table(): it would need to read the file directly to check whether
> the fields are quoted or not (since the result of read.table() does not
> allow distinguishing their presence). To me this tends to indicate
> something is wrong in the way read.table() works.
>
> FWIW, changing the behavior of read.table() to skip quotes when
> colClasses="integer" would not break any existing program since it would
> only avoid an error where it previously happened, without modifying
> working cases.
>




>
>
> Regards
>
> >
> > Regards
> > Joris
> >
> >
> > On Mon, Sep 30, 2013 at 4:45 PM, Milan Bouchet-Valat
> > <nalimi...@club.fr> wrote:
> >         Le lundi 30 septembre 2013 à 08:38 -0500, Joshua Ulrich a
> >         écrit :
> >         > On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat
> >         <nalimi...@club.fr> wrote:
> >         > > Hi!
> >         > >
> >         > >
> >         > > It seems that read.table() in R 3.0.1 (Linux 64-bit) does
> >         not consider
> >         > > quoted integers as an acceptable value for columns for
> >         which
> >         > > colClasses="integer". But when colClasses is omitted,
> >         these columns are
> >         > > read as integer anyway.
> >         > >
> >         > > For example, let's consider a file named file.dat,
> >         containing:
> >         > > "1"
> >         > > "2"
> >         > >
> >         > >> read.table("file.dat", colClasses="integer")
> >         > > Error in scan(file, what, nmax, sep, dec, quote, skip,
> >         nlines, na.strings, :
> >         > >   scan() expected 'an integer' and got '"1"'
> >         > >
> >         > > But:
> >         > >> str(read.table("file.dat"))
> >         > > 'data.frame':   2 obs. of  1 variable:
> >         > >  $ V1: int  1 2
> >         > >
> >         > > The latter result is indeed documented in ?read.table:
> >         > >      Unless colClasses is specified, all columns are
> >         read as
> >         > >      character columns and then converted using
> >         type.convert to
> >         > >      logical, integer, numeric, complex or (depending on
> >         as.is)
> >         > >      factor as appropriate.  Quotes are (by default)
> >         interpreted in all
> >         > >      fields, so a column of values like "42" will result
> >         in an
> >         > >      integer column.
> >         > >
> >         > >
> >         > > Should the former behavior be considered a bug?
> >         > >
> >         > No. If you tell read.table the column is integer and it's
> >         actually
> >         > character on disk, it should be an error.
> >
> >         All values in a CSV file are stored as characters on disk,
> >         disregarding
> >         the fact that they are surrounded by quotes or not. 1 is saved
> >         as
> >         00110001 (ASCII character #49), not 00000001, nor 00000000
> >         00000000
> >         00000000 00000001 (as would for example imply a 32 bit storage
> >         of
> >         integers).
> >
> >         So, with all due respect, please refrain from formulating such
> >         blatantly
> >         erroneous statements.
> >
> >
> >         Regards
> >
> >
> >         > > This creates problems when combined with read.table.ffdf
> >         from package
> >         > > ff, since this function tries to guess the column classes
> >         by reading the
> >         > > first rows of the file, and then passes colClasses to
> >         read.table to read
> >         > > the remaining rows by chunks. A column of quoted integers
> >         is correctly
> >         > > detected as integer in the first read, but read.table()
> >         fails in
> >         > > subsequent reads.
> >         > >
> >         > This sounds like a issue with read.table.ffdf.  The column
> >         of quoted
> >         > integers is *incorrectly* detected as integer because
> >         they're actually
> >         > character on disk.  read.table.ffdf should rely on how the
> >         data are
> >         > actually stored on disk (via as.is=TRUE), not how read.table
> >         might
> >         > convert them once they're read into R.
> >         >
> >         > >
> >         > > Regards
> >         > >
> >         > > ______________________________________________
> >         > > R-devel@r-project.org mailing list
> >         > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >         >
> >         > --
> >         > Joshua Ulrich  |  about.me/joshuaulrich
> >         > FOSS Trading  |  www.fosstrading.com
> >
> >         ______________________________________________
> >         R-devel@r-project.org mailing list
> >         https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
> >
> >
> >
> > --
> > Joris Meys
> > Statistical consultant
> >
> > Ghent University
> > Faculty of Bioscience Engineering
> > Department of Mathematical Modelling, Statistics and Bio-Informatics
> >
> > tel : +32 9 264 59 87
> > joris.m...@ugent.be
> > -------------------------------
> > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>


-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel : +32 9 264 59 87
joris.m...@ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] read.table() with quoted integers

Reply via email to