Re: [Rd] table(exclude = NULL) always includes NA

2016-08-12 Thread Martin Maechler
> Martin Maechler 
> on Fri, 12 Aug 2016 10:12:01 +0200 writes:

> Suharto Anggono Suharto Anggono via R-devel 
> on Thu, 11 Aug 2016 16:19:49 + writes:

>> I stand corrected. The part "If set to 'NULL', it implies
>> 'useNA="always"'." is even in the documentation in R
>> 2.8.0. It was my fault not to check carefully.  I wonder,
>> why "always" was chosen for 'useNA' for exclude = NULL.

> me too.  "ifany" would seem more logical, and I am
> considering changing to that as a 2nd step (if the 1st
> step, below) shows to be feasible.

>> Why exclude = NULL is so special? What about another
>> 'exclude' of length zero, like character(0) (not c(),
>> because c() is NULL)? I thought that, too. But then, I
>> have no opinion about making it general.

> As mentioned, I entirely agree with that {and you are
> right about c() !!}.

>> It fits my expectation to override 'useNA' only if the
>> user doesn't explicitly specify 'useNA'.

>> Thank you for looking into this.

> you are welcome.  As first step, I plan to commit the
> change to (*)

>  useNA <- if (missing(useNA) && !missing(exclude) && !(NA
> %in% exclude)) "always"

> as proposed yesterday, and I'll eventually see / be
> notified about the effect in CRAN space.

and as I'm finding now,  20 minutes too late,   doing step 1
without doing step 2  is not feasible.
It gives many  0 counts for   e.g. for  exclude = "foo".


> --
> (*) slightly more efficiently, I'll be using match()
> directly instead of %in%

>> My points: Could R 2.7.2 behavior of table(,
>> exclude = NULL) be brought back? But R 3.3.1 behavior is
>> in R since version 2.8.0, rather long.

> you are right... but then, the places / cases where the
> behavior would change back should be quite rare.

>> If not, I suggest changing summary().
>> 

> Thank you for your feedback, Suharto!  Martin

>> On Thu, 11/8/16, Martin Maechler
>>  wrote:
>> 
>> Subject: Re: [Rd] table(exclude = NULL) always includes
>> NA
>> 
>> @r-project.org Cc: "Martin Maechler"
>>  Date: Thursday, 11 August,
>> 2016, 12:39 AM
>> 
>> > Martin Maechler  >
>> on Tue, 9 Aug 2016 15:35:41 +0200 writes:
>> 
>> > Suharto Anggono Suharto Anggono via R-devel
>>  > on Sun, 7 Aug 2016 15:32:19
>> + writes:
>> 
>> > > This is an example from
>> https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html
>> .
>> > 
>> > > With R 2.7.2:
>> > 
>> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
>> > > > table(a, b, exclude = NULL) > > b > > a 1 2 > > 1 1
>> 1 > > 2 2 0 > > 3 1 0 > >  1 0
>> > 
>> > > With R 3.3.1:
>> > 
>> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
>> > > > table(a, b, exclude = NULL) > > b > > a 1 2  >
>> > 1 1 1 0 > > 2 2 0 0 > > 3 1 0 0 > >  1 0 0 > > >
>> table(a, b, useNA = "ifany") > > b > > a 1 2 > > 1 1 1 >
>> > 2 2 0 > > 3 1 0 > >  1 0 > > > table(a, b, exclude
>> = NULL, useNA = "ifany") > > b > > a 1 2  > > 1 1 1 0
>> > > 2 2 0 0 > > 3 1 0 0 > >  1 0 0
>> > 
>> > > For the example, in R 3.3.1, the result of 'table'
>> with > > exclude = NULL includes NA even if NA is not
>> present. It is > > different from R 2.7.2, that comes
>> from factor(exclude = NULL), > > that includes NA only if
>> NA is present.
>> > 
>> > I agree that this (R 3.3.1 behavior) seems undesirable
>> and looks > wrong, and the old (<= 2.2.7) behavior for
>> table(a,b, > exclude=NULL) seems desirable to me.
>> > 
>> > 
>> > > >From R 3.3.1 help on 'table', in "Details" section:
>> > > 'useNA' controls if the table includes counts of 'NA'
>> values: the allowed values correspond to never, only if
>> the count is positive and even for zero counts.  This is
>> overridden by specifying 'exclude = NULL'.
>> > 
>> > > Specifying 'exclude = NULL' overrides 'useNA' to what
>> value? The documentation doesn't say. Looking at the code
>> of function 'table', the value is "always".
>> > 
>> > Yes, it should be documented what happens for this
>> case, > (but read on ...)
>> 
>> and it is *not* true that the documentation does not say,
>> since 2013, it has contained
>> 
>> exclude: levels to remove for all factors in ‘...’.  If
>> set to ‘NULL’, it implies ‘useNA = "always"’.  See
>> ‘Details’ for its interpretation for non-factor
>> arguments.
>> 
>> 
>> > > For the example, in R 3.3.1, the result like in R
>> 2.7.2 can be obtained with useNA = "ifany" and 'exclude'
>> unspecified.
>> > 
>> > Yes.  What should we do?  > I currently think that we'd
>> want to change the line
>> > 
>> > useNA <- if (!m

Re: [Rd] ifelse() woes ... can we agree on a ifelse2() ?

2016-08-12 Thread Hadley Wickham
> >> One possibility would also be to consider  a "numbers-only" or
> >> rather "same type"-only {e.g., would also work for characters}
> >> version.
>
> > I don't know what you mean by these.
>
> In the mean time, Bob Rudis mentioned   dplyr::if_else(),
> which is very relevant, thank you Bob!
>
> As I have found, that actually works in such a "same type"-only way:
> It does not try to coerce, but gives an error when the classes
> differ, even in this somewhat debatable case :
>
>> dplyr::if_else(c(TRUE, FALSE), 2:3, 0+10:11)
>Error: `false` has type 'double' not 'integer'
>>
>
> As documented, if_else() is clearly stricter than ifelse()
> and e.g., also does no recycling (but of length() 1).

I agree that if_else() is currently too strict - it's particularly
annoying if you want to replace some values with a missing:

x <- sample(10)
if_else(x > 5, NA, x)
#  Error: `false` has type 'integer' not 'logical'

But I would like to make sure that this remains an error:

if_else(x > 5, x, "BLAH")

Because that seems more likely to be a user error (but reasonable
people might certainly believe that it should just work)

dplyr is more accommodating in other places (i.e. in bind_rows(),
collapse() and the joins) but it's surprisingly hard to get all the
details right. For example, what should the result of this call be?

if_else(c(TRUE, FALSE), factor(c("a", "b")), factor(c("c", "b"))

Strictly speaking I think you could argue it's an error, but that's
not very user-friendly. Should it be a factor with the union of the
levels? Should it be a character vector + warning? Should the
behaviour change if one set of levels is a subset of the other set?

There are similar issues for POSIXct (if the time zones are different,
which should win?), and difftimes (similarly for units).  Ideally
you'd like the behaviour to be extensible for new S3 classes, which
suggests it should be a generic (and for the most general case, it
would need to dispatch on both arguments).

Hadley

-- 
http://hadley.nz

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] ifelse() woes ... can we agree on a ifelse2() ?

2016-08-12 Thread Martin Maechler
Excuse for the delay;  I had waited for other / additional
comments and reactions (and been distracted with other urgent issues),
but do want to keep this thread alive  [inline] ..

> Duncan Murdoch 
> on Sat, 6 Aug 2016 11:30:08 -0400 writes:

> On 06/08/2016 10:18 AM, Martin Maechler wrote:
>> Dear R-devel readers,
>> ( = people interested in the improvement and development of R).
>> 
>> This is not the first time that this topic is raised.
>> and I am in now state to promise that anything will result from
>> this thread ...
>> 
>> Still, I think the majority among us has agreed that
>> 
>> 1) you should never use ifelse(test, yes, no)
>> if you know that length(test) == 1, in which case
>> if(test) yes else no
>> is much preferable  (though not equivalent: ifelse(NA, 1, 0) !)
>> 
>> 2) it is potentially inefficient by design since it (almost
>> always) evaluates both 'yes' and 'no' independent of 'test'.
>> 
>> 3) is a nice syntax in principle, and so is often used, also by
>> myself, inspite of '2)'  just because nicely self-explaining
>> code is sometimes clearly preferable to more efficient but
>> less readable code.
>> 
>> 4) it is too late to change ifelse() fundamentally, because it
>> works according to its documentation
>> (and I think very much the same as in S and S-PLUS) and has
>> done so for ages.
>> 
>>  and if you don't agree with  1) -- 4)  you may pretend for
>> a moment instead of starting to discuss them thoroughly.
>> 
>> Recently, a useR has alerted me to the fact that my Rmpfr's
>> package arbitrary (high) precision numbers don't work for a
>> relatively simple function.
>> 
>> As I found the reason was that that simple function used
>> ifelse(.,.,.)
>> and the problem was that the (*simplified*) gist of ifelse(test, yes, no)
>> is
>> 
>> test <- as.logical(test)
>> ans <- test
>> ans[ test] <- yes
>> ans[!test] <- no
>> 
>> and in case of Rmpfr, the problem is that
>> 
>> []  <-  
>> 
>> cannot work correctly
>> 
>> [[ maybe it could in a future R, if I could define a method
>> 
>> setReplaceMethod("[", c("logical,"logical","mpfr"),
>> function(x,i,value) .)
>> 
>> but that currently fails as the C-low-level dispatch for '[<-'
>> does not look at the full signature
>> ]]
>> 
>> I vaguely remember having seen proposals for
>> light weight substitutes for ifelse(),  called
>> ifelse1() or
>> ifelse2() etc...
>> 
>> and I wonder if we should not try to see if there was a version
>> that could go into "base R" (maybe the 'utils' package, not
>> 'base'; that's not so important).
>> 
>> One difference to ifelse() would be that the type/mode/class of the 
result
>> is not initialized by logical, by default but rather by the
>> "common type" of  yes and no ... maybe determined  by  c()'ing
>> parts of those.
>> The idea was that this would work for most S3 and S4 objects for
>> which logical 'length', (logical) indexing '[', and 'rep()' works.

> I think your description is more or less:

> test <- as.logical(test)
> ans <- c(yes, no)[seq_along(test)]
> ans <- ans[seq_along(test)]
> ans[ test] <- yes[test]
> ans[!test] <- no[!test]

> (though the implementation details would vary, and recycling rules would 
> apply if the lengths of test, yes and no weren't all equal).

Yes, more or less,  notably, conceptually a version of  c(yes, no) 
to get a common mode/class but as you mention below, c()
cannot be used alone because famously "misbehaves" e.g., for factors.

> You didn't mention what happens with attributes.  Currently we keep the 
> attributes from test, which probably doesn't make a lot of sense. In 
> particular,

> ifelse(c(TRUE, FALSE), factor(2:3), factor(3:4))

> returns nonsense, as does my translation of your idea above.

yes.   factor()s  or "Date" or "POSIXt" objects are  'base R'
examples where an alternative  ifelse() would have to work
(ideally automatically with no special-case code!) by "keeping
the class".


> That implementation also drops attributes. I'd say this definition would 
> make more sense:

> test <- as.logical(test)
> ans <- yes
> ans[!test] <- no[!test]

> (and this is suggested as an alternative in ?ifelse).  It generates an 
> error in my test example, which seems reasonable.  It gives the "right" 
> thing in

> ifelse(c(TRUE, FALSE), factor(2:3), factor(3:2))

> because the factors have the same levels.

> The lack of symmetry between yes and no is slightly irksome, but I would 
> think in most cases you could choose attributes from just one of yes and 
> no to be what you want in the result (and use !test to swap the order if 

Re: [Rd] table(exclude = NULL) always includes NA

2016-08-12 Thread Martin Maechler
> Suharto Anggono Suharto Anggono via R-devel 
> on Thu, 11 Aug 2016 16:19:49 + writes:

> I stand corrected. The part "If set to 'NULL', it implies
> 'useNA="always"'." is even in the documentation in R
> 2.8.0. It was my fault not to check carefully.  I wonder,
> why "always" was chosen for 'useNA' for exclude = NULL.

me too.  "ifany" would seem more logical, and I am considering
changing to that as a 2nd step (if the 1st step, below) shows to
be feasible.

> Why exclude = NULL is so special? What about another
> 'exclude' of length zero, like character(0) (not c(),
> because c() is NULL)? I thought that, too. But then, I
> have no opinion about making it general.

As mentioned, I entirely agree with that {and you are right
about c() !!}.

> It fits my expectation to override 'useNA' only if the
> user doesn't explicitly specify 'useNA'.

> Thank you for looking into this.

you are welcome.
As first step, I plan to commit the change to (*)

 useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) 
"always"

as proposed yesterday,  and I'll eventually see / be notified
about the effect in CRAN space.

--
(*) slightly more efficiently, I'll be using match() directly instead of %in%

> My points:
> Could R 2.7.2 behavior of table(, exclude = NULL) be brought 
back? But R 3.3.1 behavior is in R since version 2.8.0, rather long.

you are right... but then, the places / cases where the behavior
would change back should be quite rare.

> If not, I suggest changing summary().
> 

Thank you for your feedback, Suharto!
Martin

> On Thu, 11/8/16, Martin Maechler  wrote:
> 
>  Subject: Re: [Rd] table(exclude = NULL) always includes NA
> 
> @r-project.org
>  Cc: "Martin Maechler" 
>  Date: Thursday, 11 August, 2016, 12:39 AM
> 
> > Martin Maechler 
> > on Tue, 9 Aug 2016 15:35:41 +0200 writes:
> 
> > Suharto Anggono Suharto Anggono via R-devel 
> > on Sun, 7 Aug 2016 15:32:19 + writes:
> 
> > > This is an example from 
https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html .
> > 
> > > With R 2.7.2:
> > 
> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
> > > > table(a, b, exclude = NULL)
> > >   b
> > > a  1 2
> > >   11 1
> > >   22 0
> > >   31 0
> > >1 0
> > 
> > > With R 3.3.1:
> > 
> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
> > > > table(a, b, exclude = NULL)
> > >   b
> > > a  1 2 
> > >   11 10
> > >   22 00
> > >   31 00
> > >1 00
> > > > table(a, b, useNA = "ifany")
> > >   b
> > > a  1 2
> > >   11 1
> > >   22 0
> > >   31 0
> > >1 0
> > > > table(a, b, exclude = NULL, useNA = "ifany")
> > >   b
> > > a  1 2 
> > >   11 10
> > >   22 00
> > >   31 00
> > >1 00
> > 
> > > For the example, in R 3.3.1, the result of 'table' with
> > > exclude = NULL includes NA even if NA is not present. It is
> > > different from R 2.7.2, that comes from factor(exclude = NULL), 
> > > that includes NA only if NA is present.
> > 
> > I agree that this (R 3.3.1 behavior) seems undesirable and looks
> > wrong, and the old (<= 2.2.7) behavior for  table(a,b,
> > exclude=NULL) seems desirable to me.
> > 
> > 
> > > >From R 3.3.1 help on 'table', in "Details" section:
> > > 'useNA' controls if the table includes counts of 'NA' values: the 
allowed values correspond to never, only if the count is positive and even for 
zero counts.  This is overridden by specifying 'exclude = NULL'.
> > 
> > > Specifying 'exclude = NULL' overrides 'useNA' to what value? The 
documentation doesn't say. Looking at the code of function 'table', the value 
is "always".
> > 
> > Yes, it should be documented what happens for this case,
> > (but read on ...)
> 
> and it is *not* true that the documentation does not say, since
> 2013, it has contained
> 
> exclude: levels to remove for all factors in ‘...’.  If set to ‘NULL’,
>   it implies ‘useNA = "always"’.  See ‘Details’ for its
>   interpretation for non-factor arguments.
> 
> 
> > > For the example, in R 3.3.1, the result like in R 2.7.2 can be 
obtained with useNA = "ifany" and 'exclude' unspecified.
> > 
> > Yes.  What should we do?
> > I currently think that we'd want to change the line
> > 
> >  useNA <- if (!missing(exclude) && is.null(exclude)) "always"
> > 
> > to
> > 
> >  useNA <- if (!missing(exclude) && is.null(exclude)) "ifany" # was 
"always"
> > 
> > 
> > which would not even contradict