On 20/04/2014, 2:22 PM, Gábor Csárdi wrote:
How about using the quoting to decide what should be character, and what
not? You do not need to quote numbers, logical values, only characters, so
this would make sense imo.

That explicitly violates some of the CSV "standards". The quotes must have no effect on the interpretation.

Duncan Murdoch


How about something like this:
- if it is quoted (and not specified otherwise in colClasses), then it is a
character/factor
- if it is not quoted (and not specified otherwise in colClasses), then the
type is automatically detected, according to the pre-3.1.x method, and a
(suppressible) warning or error is given if information is lost, when
coercing to numbers.

Just an idea.

Gabor

On Sun, Apr 20, 2014 at 3:24 AM, Murray Stokely <mur...@stokely.org> wrote:

Yes, I'm also strongly in favor of having an option for this.  If
there was an option in base R for controlling this we would just use
that and get rid of the separate RProtoBuf.int64AsString option we use
in the RProtoBuf package on CRAN to control whether 64-bit int types
from C++ are returned to R as numerics or character vectors.

I agree that reasonable people can disagree about the default, but I
found my original bug report about this, so I will counter Robert's
example with my favorite example of what was wrong with the previous
behavior :

tmp<-data.frame(n=c("72057594037927936", "72057594037927937"),
name=c("foo", "bar"))
length(unique(tmp$n))
# 2
write.csv(tmp, "/tmp/foo.csv", quote=FALSE, row.names=FALSE)
data <- read.csv("/tmp/foo.csv")
length(unique(data$n))
# 1

           - Murray


On Sat, Apr 19, 2014 at 10:06 AM, Simon Urbanek
<simon.urba...@r-project.org> wrote:
On Apr 19, 2014, at 9:00 AM, Martin Maechler <maech...@stat.math.ethz.ch>
wrote:

McGehee, Robert <robert.mcge...@geodecapital.com>
    on Thu, 17 Apr 2014 19:15:47 -0400 writes:

This is all application specific and
sort of beyond the scope of type.convert(), which now behaves as it
has been documented to behave.

That's only a true statement because the documentation was changed to
reflect the new behavior! The new feature in type.convert certainly does
not behave according to the documentation as of R 3.0.3. Here's a snippit:

The first type that can accept all the
non-missing values is chosen (numeric and complex return values
will represented approximately, of course).

The key phrase is in parentheses, which reminds the user to expect a
possible loss of precision. That important parenthetical was removed from
the documentation in R 3.1.0 (among other changes).

Putting aside the fact that this introduces a large amount of
unnecessary work rewriting SQL / data import code, SQL packages, my biggest
conceptual problem is that I can no longer rely on a particular function
call returning a particular class. In my example querying stock prices,
about 5% of prices came back as factors and the remaining 95% as numeric,
so we had random errors popping in throughout the morning.

Here's a short example showing us how the new behavior can be
unreliable. I pass a character representation of a uniformly distributed
random variable to type.convert. 90% of the time it is converted to
"numeric" and 10% it is a "factor" (in R 3.1.0). In the 10% of cases in
which type.convert converts to a factor the leading non-zero digit is
always a 9. So if you were expecting a numeric value, then 1 in 10 times
you may have a bug in your code that didn't exist before.

options(digits=16)
cl <- NULL; for (i in 1:10000) cl[i] <-
class(type.convert(format(runif(1))))
table(cl)
cl
factor numeric
990    9010

Yes.

Murray's point is valid, too.

But in my view, with the reasoning we have seen here,
*and* with the well known software design principle of
"least surprise" in mind,
I also do think that the default for type.convert() should be what
it has been for > 10 years now.


I think there should be two separate discussions:

a) have an option (argument to type.convert and possibly read.table) to
enable/disable this behavior. I'm strongly in favor of this.

b) decide what the default for a) will be. I have no strong opinion, I
can see arguments in both directions

But most importantly I think a) is better than the status quo - even if
the discussion about b) drags out.

Cheers,
Simon




______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to