Re: [Rd] complex NA's match(), etc: not back-compatible change proposal
> Suharto Anggono Suharto Anggono via R-devel> on Sat, 1 Apr 2017 14:10:06 + writes: > I am raising this again. > With > z <- complex(real = c(0,NaN,NaN), imaginary = c(NA,NA,0)) , > results of > sapply(z, match, table = z) > and > match(z, z) > are different in R 3.4.0 alpha. I think they should be the same. > I suggest changing 'cequal' in unique.c such that a > complex number that has both NA and NaN matches NA and > doesn't match NaN, as such complex number is printed as NA. Thank you very much, Suharto, for the reminder. I have committed a change to R-devel yesterday, though your suggestion above had not been 100% clear to me. What I think we want and I decided to commit r72473 | maechler | 2017-04-02 22:23:56 +0200 (Sun, 02 Apr 2017) was to entirely mimic how R format()s and prints() complex numbers: 1) If a complex number has a real or imaginary which is NA then it is formatted / printed as "NA" ==> All such complex numbers should match() i.e. match(), unique(), duplicated() treat such complex numbers as "the same". 2) The picture is very different with (non-NA) NaN: There, R formats and prints NaN+1i or NaN+99i or 0+1i*NaN differently, and [in R-devel only, planned in R 3.4.0 alpha in a day or two!] match(), unique(), duplicated() now treat them as different. The change is more consistent notably does give the same result for match(z,z) and sapply(z, match, table = z) for a variety of z (permutations). __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] complex NA's match(), etc: not back-compatible change proposal
I am raising this again. With z <- complex(real = c(0,NaN,NaN), imaginary = c(NA,NA,0)) , results of sapply(z, match, table = z) and match(z, z) are different in R 3.4.0 alpha. I think they should be the same. I suggest changing 'cequal' in unique.c such that a complex number that has both NA and NaN matches NA and doesn't match NaN, as such complex number is printed as NA. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] complex NA's match(), etc: not back-compatible change proposal
With 'z' of length 8 below, or of length 12 previously, one may try sapply(rev(z), match, table = rev(z)) match(rev(z), rev(z)) I found that the two results were different in R devel r70604. A shorter one: > z <- complex(real = c(0,NaN,NaN), imaginary = c(NA,NA,0)) > sapply(z, match, table = z) [1] 1 1 2 > match(z, z) [1] 1 1 3 An explanation of the behavior: With normal equality, if z[2] is equal to z[1] and z[3] is not equal to z[1], z[3] is not equal to z[2]. It is not the case here with 'cequal'. However, it seems that the property is assumed in usual case of 'match'. For it, just changing 'cequal' so that a complex number that has both NA and NaN matches NA and doesn't match NaN is enough. It also makes length(unique(.)) not order-dependent. For more change, I am fine with '1 A'. On Mon, 30/5/16, Martin Maechler <maech...@stat.math.ethz.ch> wrote: Subject: Re: [Rd] complex NA's match(), etc: not back-compatible change proposal Cc: R-devel@r-project.org Date: Monday, 30 May, 2016, 5:48 PM >>>>> Suharto Anggono >>>>> on Sat, 28 May 2016 09:34:08 + writes: > On 'factor', I meant the case where 'levels' is not > specified, where 'unique' is called. I see, thank you. >> factor(c(complex(real=NaN), complex(imaginary=NaN))) > [1] NaN+0i > Levels: NaN+0i > Look at in the result above. Yes, it happens in > earlier versions of R, too. Yes; let's call this "problem 1" > On matching both NA and NaN, another consequence is that > length(unique(.)) may depend on order. > Example using R devel r70604: >> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) >> (z <- z[is.na(z)]) > [1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA > [9] 0+NaNi 1+NaNi NA NaN+NaNi >> length(print(unique(z))) > [1] NA NaN+0i > [1] 2 >> length(print(unique(c(z[8], z[-8] > [1] NA > [1] 1 > Thank you, Suharto. I agree these are even more convincing reasons to consider changing. Let's call this ("matching both NA and NaN") "problem 2". I think we agree that the R-devel -- comparted to previous versions -- *is* consistent in its (C level) functions cequal() and chash() and also is consistent with the documentation of match()/unique()/duplicated(). Hence I think a change would have to affect all of the above, including a change of documentation. Also, resolution of "problem 1" and "problem 2" are related, but --I think-- almost separate. For the following, let's use a vector notation for complex numbers, say (a, b) :== complex(real = a, imaginary = b) With R (showing relevant examples): ##-- options(width = max(85, getOption("width"))) # so 'z' prints in one line p.z <- function(z) print(noquote(paste0("(",Re(z),",",Im(z),")"))) z <- c(1,NA,NaN); z <- outer(z,z, complex, length.out=1); (z <- z[is.na(z)]) ## NA NaN+ 1i NA NA NA 1+NaNi NA NaN+NaNi p.z(z) ## (NA,1) (NaN,1) (1,NA) (NA,NA) (NaN,NA) (1,NaN) (NA,NaN) (NaN,NaN) length(p.z(unique(z[ 1:8 ]))) ## [1] (NA,1) (NaN,1) ## [1] 2 length(p.z(unique(z[ c(8,1:7) ]))) ## [1] (NaN,NaN) (NA,1) ## [1] 2 length(p.z(unique(z[ c(7:8,1:6) ]))) ## [1] (NA,NaN) ## [1] 1 ##-- Problem 1: To me, at the moment, it would seem most "natural" to consider a change where the match()/unique()/duplicated() behavior matched the behavior of print()/format()/as.character() for such complex vectors. I think this would automatically solve the issue that sometimes length(unique(as.character(x))) > length(unique(x)) The are principally two solutions to this: A: change match()/unique()/duplicated() B: change print()/format()/as.character() For A -- which seems "less disruptive" and more desirable to me -- we would have to change cequal() {and chash()!} and say that complex numbers with NA|NaN "match" if they have any NA, but otherwise, both the regular (r,i) and the NaN must be at the exact same places (and *different* NaNs should match, of course). Problem 2: unique(z[i]) depends on the permutation 'i' What should a change be here ... notably after the "proposed" (rather only "considered") change '1 A' above ? Can "the" new behavior easily be described in words (if '1 A' a
Re: [Rd] complex NA's match(), etc: not back-compatible change proposal
>>>>> Suharto Anggono >>>>> on Sat, 28 May 2016 09:34:08 + writes: > On 'factor', I meant the case where 'levels' is not > specified, where 'unique' is called. I see, thank you. >> factor(c(complex(real=NaN), complex(imaginary=NaN))) > [1] NaN+0i > Levels: NaN+0i > Look at in the result above. Yes, it happens in > earlier versions of R, too. Yes; let's call this "problem 1" > On matching both NA and NaN, another consequence is that > length(unique(.)) may depend on order. > Example using R devel r70604: >> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) >> (z <- z[is.na(z)]) > [1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA > [9] 0+NaNi 1+NaNi NA NaN+NaNi >> length(print(unique(z))) > [1] NA NaN+0i > [1] 2 >> length(print(unique(c(z[8], z[-8] > [1] NA > [1] 1 > Thank you, Suharto. I agree these are even more convincing reasons to consider changing. Let's call this ("matching both NA and NaN") "problem 2". I think we agree that the R-devel -- comparted to previous versions -- *is* consistent in its (C level) functions cequal() and chash() and also is consistent with the documentation of match()/unique()/duplicated(). Hence I think a change would have to affect all of the above, including a change of documentation. Also, resolution of "problem 1" and "problem 2" are related, but --I think-- almost separate. For the following, let's use a vector notation for complex numbers, say (a, b) :== complex(real = a, imaginary = b) With R (showing relevant examples): ##-- options(width = max(85, getOption("width"))) # so 'z' prints in one line p.z <- function(z) print(noquote(paste0("(",Re(z),",",Im(z),")"))) z <- c(1,NA,NaN); z <- outer(z,z, complex, length.out=1); (z <- z[is.na(z)]) ## NA NaN+ 1i NA NA NA 1+NaNi NA NaN+NaNi p.z(z) ## (NA,1) (NaN,1) (1,NA) (NA,NA) (NaN,NA) (1,NaN) (NA,NaN) (NaN,NaN) length(p.z(unique(z[ 1:8 ]))) ## [1] (NA,1) (NaN,1) ## [1] 2 length(p.z(unique(z[ c(8,1:7) ]))) ## [1] (NaN,NaN) (NA,1) ## [1] 2 length(p.z(unique(z[ c(7:8,1:6) ]))) ## [1] (NA,NaN) ## [1] 1 ##-- Problem 1: To me, at the moment, it would seem most "natural" to consider a change where the match()/unique()/duplicated() behavior matched the behavior of print()/format()/as.character() for such complex vectors. I think this would automatically solve the issue that sometimes length(unique(as.character(x))) > length(unique(x)) The are principally two solutions to this: A: change match()/unique()/duplicated() B: change print()/format()/as.character() For A -- which seems "less disruptive" and more desirable to me -- we would have to change cequal() {and chash()!} and say that complex numbers with NA|NaN "match" if they have any NA, but otherwise, both the regular (r,i) and the NaN must be at the exact same places (and *different* NaNs should match, of course). Problem 2: unique(z[i]) depends on the permutation 'i' What should a change be here ... notably after the "proposed" (rather only "considered") change '1 A' above ? Can "the" new behavior easily be described in words (if '1 A' above is already assumed)? At the moment, I would not tackle Problem 2. It would become less problematic once Problem 1 is solved according to '1 A', because it least length(unique(.)) would not change: It would contain *one* z[] with an NA, and all the other z[]s. Opinions ? Thank you in advance for chiming in.. Martin Maechler, ETH Zurich > On Mon, 23/5/16, Martin Maechler <maech...@stat.math.ethz.ch> wrote: > Subject: Re: [Rd] complex NA's match(), etc: not back-compatible change proposal > Cc: R-devel@r-project.org > Date: Monday, 23 May, 2016, 11:06 PM >>>>>> > Suharto Anggono Suharto Anggono via R-devel <r-devel@r-project.org> >>>>>> on Fri, 13 > May 2016 16:33:05 + writes: > > That, for example, complex(real=NaN) > and complex(imaginary=NaN) are regarded as equal makes it > possible that > > > length(unique(as.character(x))) > length(unique(x)) > > (current code of > function 'factor' doesn't expect it). > Thank you, that is an > interesting remark - but is already true,
Re: [Rd] complex NA's match(), etc: not back-compatible change proposal
On 'factor', I meant the case where 'levels' is not specified, where 'unique' is called. > factor(c(complex(real=NaN), complex(imaginary=NaN))) [1] NaN+0i Levels: NaN+0i Look at in the result above. Yes, it happens in earlier versions of R, too. On matching both NA and NaN, another consequence is that length(unique(.)) may depend on order. Example using R devel r70604: > x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) > (z <- z[is.na(z)]) [1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA [9] 0+NaNi 1+NaNi NA NaN+NaNi > length(print(unique(z))) [1] NA NaN+0i [1] 2 > length(print(unique(c(z[8], z[-8] [1] NA [1] 1 On Mon, 23/5/16, Martin Maechler <maech...@stat.math.ethz.ch> wrote: Subject: Re: [Rd] complex NA's match(), etc: not back-compatible change proposal Cc: R-devel@r-project.org Date: Monday, 23 May, 2016, 11:06 PM >>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel@r-project.org> >>>>> on Fri, 13 May 2016 16:33:05 + writes: > That, for example, complex(real=NaN) and complex(imaginary=NaN) are regarded as equal makes it possible that > length(unique(as.character(x))) > length(unique(x)) > (current code of function 'factor' doesn't expect it). Thank you, that is an interesting remark - but is already true, in [[elided Yahoo spam]] .. and of course this is because we do *print* 0+NaNi etc, i.e., we differentiate the non-NA-but-NaN complex values in formatting / printing but not in match(), unique() ... and indeed, with the 'z' example below, fz <- factor(z,z) gives a warnings about duplicated levels and gives such warnings also in current (and previous) versions of R, at least for the slightly larger z I've used in the tests/reg-tests-1c.R example. For the moment I can live with that warning, as I don't think factor()s are constructed from complex numbers "often"... and the performance of factor() in the more regular cases is important. > Yes, an argument for the behavior is that NA and NaN are of one kind. > On my system, using 32-bit R for Windows from binary from CRAN, the result of sapply(z, match, table = z) (not in current R-devel) may be different from below: > 1 2 3 4 1 3 7 8 2 4 8 12 # R 2.10.1, different from below > 1 2 3 4 1 3 7 8 2 4 8 12 # R 3.2.5, different from below interesting, thank you... and another reason why the change (currently only in R-devel) may have been a good one: More uniformity. > I noticed that, by function 'cequal' in unique.c, a complex number that has both NA and NaN matches NA and also matches NaN. >> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) >> (z <- z[is.na(z)]) > [1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA > [9] 0+NaNi 1+NaNi NA NaN+NaNi >> sapply(z, match, table = z[8]) > [1] 1 1 1 1 1 1 1 1 1 1 1 1 >> match(z, z[8]) > [1] 1 1 1 1 1 1 1 1 1 1 1 1 Yes, I see the same. But is n't it what we expect: All of our z[] entries has at least one NA or a NaN in its real or imaginary, and since z[8] has both, it does match with all z[]'s either because of the NA or because of the NaN in common. Hence, currently, I don't think this needs to be changed... but if there are other reasons / arguments ... Thank you again, Martin Maechler >> sessionInfo() > R Under development (unstable) (2016-05-12 r70604) > Platform: i386-w64-mingw32/i386 (32-bit) > Running under: Windows XP (build 2600) Service Pack 2 > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > attached base packages: > [1] stats graphics grDevices utils datasets methods base > - >>>>> Martin Maechler >>>>> on Tue, 10 May 2016 16:08:39 +0200 writes: >> This is an RFC / announcement related to the 2nd part of PR#16885 >> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885 >> about complex NA's. >> The (somewhat rare) incompatibility in R's 3.3.0 match() behavior for the >> case of complex numbers with NA & NaN's {which has been fixed for R 3.3.0 >> patched in the mean time} triggered some more comprehensive "research". >> I found that we have had a long-standing inconsistency at least between the >> documented and the real behavior.
Re: [Rd] complex NA's match(), etc: not back-compatible change proposal
> Martin Maechler> on Tue, 10 May 2016 16:08:39 +0200 writes: > This is an RFC / announcement related to the 2nd part of PR#16885 > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885 > about complex NA's. > The (somewhat rare) incompatibility in R's 3.3.0 match() behavior for the > case of complex numbers with NA & NaN's {which has been fixed for R 3.3.0 > patched in the mean time} triggered some more comprehensive "research". > I found that we have had a long-standing inconsistency at least between the > documented and the real behavior. I am claiming that the documented > behavior is desirable and hence R's current "real" behavior is bugous, and > I am proposing to change it, in R-devel (to be 3.4.0) for now. After the "roaring unanimous" assent (one private msg encouraging me to go forward, no dissenting voice, hence an "odds ratio" of +Inf in favor ;-) I have now committed my proposal to R-devel (svn rev. 70597) and some of us will be seeing the effect in package space within a day or so, in the CRAN checks against R-devel (not for bioconductor AFAIK; their checks using R-devel only when it less than ca 6 months from release). It's still worthwhile to discuss the issue, if you come late to it, notably as ---paraphrasing Dirk on the R-package-devel list--- the release of 3.4.0 is almost a year away, and so now is the best time to tinker with the API, in other words, consider breaking rarely used legacy APIs.. Martin > In help(match) we have been saying > | Exactly what matches what is to some extent a matter of definition. > | For all types, \code{NA} matches \code{NA} and no other value. > | For real and complex values, \code{NaN} values are regarded > | as matching any other \code{NaN} value, but not matching \code{NA}. > for at least 10 years. But we don't do that at all in the > complex case (and AFAIK never got a bug report about it). > Also, e.g., print(.) or format(.) do simply use "NA" for all > the different complex NA-containing numbers, where OTOH, > non-NA NaN's { <=> !is.nan(z) & is.na(z) } > in format() or print() do show the NaN in real and/or imaginary > parts; for an example, look at the "format" column of the matrix > below, after 'print(cbind' ... > The current match()---and duplicated(), unique() which are based on the same > C code---*do* distinguish almost all complex NA / NaN's which is > NOT according to documentation. I have found that this is just because of > of our hashing function for the complex case, chash() in R/src/main/unique.c, > is bogous in the sense that it is not compatible with the above documentation > and also not with the cequal() function (in the same file uniqu.c) for checking > equality of complex numbers. > As I have found,, a *simplified* version of the chash() function > to make it compatible with cequal() does solve all the problems I've > indicated, and the current plan is to commit that change --- after some > discussion time, here on R-devel --- to the code base. > My change passes 'make check-all' fine, but I'm 100% sure that there will > be effects in package-space. ... one reason for this posting. > As mentioned above, note that the chash() function has been in > use for all three functions > match() > duplicated() > unique() > and the change will affect all three --- but just for the case of complex > vectors with NA or NaN's. > To show more, a small R session -- using my version of R-devel > == the proposition: > The R script ('complex-NA-short.R') for (a bit more than) the > session is attached {{you can attach text/plain easily}}: >> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) >> ## --- = NA_real_ but that does not exist e.g., in R 2.3.1 >> ## similarly, '1L', '2L', .. do not exist e.g., in R 2.3.1 >> (z <- z[is.na(z)]) > [1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA > [9] 0+NaNi 1+NaNi NA NaN+NaNi >> outerID <- function(x,y, ...) { ## ugly; can we get outer() to work ? > + r <- matrix( , length(x), length(y)) > + for(i in seq(along=x)) > + for(j in seq(along=y)) > + r[i,j] <- identical(z[i], z[j], ...) > + r > + } >> ## Very strictly - in the sense of identical() -- these 12 complex numbers all differ: >> ## a version that works in older versions of R, where identical() had fewer arguments! >> outerID.picky <- function(x,y) { > + nF <- length(formals(identical)) - 2 > + do.call("outerID", c(list(x, y), as.list(rep(FALSE, nF > + } >> oldR <- !exists("getRversion") || getRversion() < "3.0.0" ## << FIXME: 3.0.0 is a wild guess
[Rd] complex NA's match(), etc: not back-compatible change proposal
This is an RFC / announcement related to the 2nd part of PR#16885 https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885 about complex NA's. The (somewhat rare) incompatibility in R's 3.3.0 match() behavior for the case of complex numbers with NA & NaN's {which has been fixed for R 3.3.0 patched in the mean time} triggered some more comprehensive "research". I found that we have had a long-standing inconsistency at least between the documented and the real behavior. I am claiming that the documented behavior is desirable and hence R's current "real" behavior is bugous, and I am proposing to change it, in R-devel (to be 3.4.0) for now. In help(match) we have been saying | Exactly what matches what is to some extent a matter of definition. | For all types, \code{NA} matches \code{NA} and no other value. | For real and complex values, \code{NaN} values are regarded | as matching any other \code{NaN} value, but not matching \code{NA}. for at least 10 years. But we don't do that at all in the complex case (and AFAIK never got a bug report about it). Also, e.g., print(.) or format(.) do simply use "NA" for all the different complex NA-containing numbers, where OTOH, non-NA NaN's { <=> !is.nan(z) & is.na(z) } in format() or print() do show the NaN in real and/or imaginary parts; for an example, look at the "format" column of the matrix below, after 'print(cbind' ... The current match()---and duplicated(), unique() which are based on the same C code---*do* distinguish almost all complex NA / NaN's which is NOT according to documentation. I have found that this is just because of of our hashing function for the complex case, chash() in R/src/main/unique.c, is bogous in the sense that it is not compatible with the above documentation and also not with the cequal() function (in the same file uniqu.c) for checking equality of complex numbers. As I have found,, a *simplified* version of the chash() function to make it compatible with cequal() does solve all the problems I've indicated, and the current plan is to commit that change --- after some discussion time, here on R-devel --- to the code base. My change passes 'make check-all' fine, but I'm 100% sure that there will be effects in package-space. ... one reason for this posting. As mentioned above, note that the chash() function has been in use for all three functions match() duplicated() unique() and the change will affect all three --- but just for the case of complex vectors with NA or NaN's. To show more, a small R session -- using my version of R-devel == the proposition: The R script ('complex-NA-short.R') for (a bit more than) the session is attached {{you can attach text/plain easily}}: > x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) > ## --- = NA_real_ but that does not exist e.g., in R 2.3.1 > ## similarly, '1L', '2L', .. do not exist e.g., in R 2.3.1 > (z <- z[is.na(z)]) [1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA [9] 0+NaNi 1+NaNi NA NaN+NaNi > outerID <- function(x,y, ...) { ## ugly; can we get outer() to work ? + r <- matrix( , length(x), length(y)) + for(i in seq(along=x)) + for(j in seq(along=y)) + r[i,j] <- identical(z[i], z[j], ...) + r + } > ## Very strictly - in the sense of identical() -- these 12 complex numbers > all differ: > ## a version that works in older versions of R, where identical() had fewer > arguments! > outerID.picky <- function(x,y) { + nF <- length(formals(identical)) - 2 + do.call("outerID", c(list(x, y), as.list(rep(FALSE, nF + } > oldR <- !exists("getRversion") || getRversion() < "3.0.0" ## << FIXME: 3.0.0 > is a wild guess > symnum(id.z <- outerID.picky(z,z)) ## == Diagonal matrix [newer versions of R] [1,] | . . . . . . . . . . . [2,] . | . . . . . . . . . . [3,] . . | . . . . . . . . . [4,] . . . | . . . . . . . . [5,] . . . . | . . . . . . . [6,] . . . . . | . . . . . . [7,] . . . . . . | . . . . . [8,] . . . . . . . | . . . . [9,] . . . . . . . . | . . . [10,] . . . . . . . . . | . . [11,] . . . . . . . . . . | . [12,] . . . . . . . . . . . | > try(# for older R versions + stopifnot(identical(id.z, outerID(z,z)), oldR || identical(id.z, diag(12) == 1)) + ) > (mz <- match(z, z)) # currently different {NA,NaN} patterns differ - not in > print()/format() _FIXME_ [1] 1 2 1 2 1 1 1 1 2 2 1 2 > zRI <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern : > print(cbind(format = format(z), t(zRI), mz), quote=FALSE) format Re Im mz [1,] NA 01 [2,] NaN+ 0i NaN 02 [3,] NA 11 [4,] NaN+ 1i NaN 12 [5,] NA 0 1 [6,] NA 1 1 [7,] NA 1 [8,] NA NaN 1 [9,] 0+NaNi 0NaN 2 [10,] 1+NaNi 1NaN 2 [11,] NA NaN 1 [12,] NaN+NaNi NaN NaN 2 > --- Note that 'mz <-