Re: [R] duplicated() on zero-column data frames returns empty

2024-05-13 Thread Mark Webster via R-help
 > If you would like to try your hand at developing a patch and make a
> case for it at R-devel or the Bugzilla, the resources at
>  can be helpful.
I am attempting to get admitted onto the Bugzilla at the moment for the data 
frame cases, fingers crossed!
Best Regards,Mark Webster  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] duplicated() on zero-column data frames returns empty

2024-05-12 Thread Ivan Krylov via R-help
(Sorry for only getting back to this more than a month later.)

В Mon, 8 Apr 2024 17:03:00 +
Jorgen Harmse  пишет:

> What is the policy for changing something that is wrong? There is a
> trade-off between breaking old code that worked around a problem and
> breaking new code written by people who make reasonable assumptions.

First of all, quantify the breakage. Does the proposed change break
`make check-devel`? Does it break CRAN and BioConductor? (This one is
hard to measure properly: someone will have to run >2 R CMD checks
times two, for "before the change" and "after the change".) Given a
persuasive case, breaking changes can still be made, but will require a
deprecation period to let the packages adjust.

If you would like to try your hand at developing a patch and make a
case for it at R-devel or the Bugzilla, the resources at
 can be helpful.

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] duplicated() on zero-column data frames returns empty

2024-04-08 Thread Jorgen Harmse via R-help
I appreciate the compliment from Ivan and still share the puzzlement at the 
empty return.

What is the policy for changing something that is wrong? There is a trade-off 
between breaking old code that worked around a problem and breaking new code 
written by people who make reasonable assumptions. Mathematically, it seems 
obvious to me that duplicated.matrix(A) should do something like this:

v <- matrix(FALSE, nrow = nrow(A) -> nr, ncol=1L) # or an ordinary vector?
if (nr > 1L) # Check because 2:0 & 2:1 do not do what we want.
{ for (i in 2:nr)
  { for (j in 1:(i-1))
if (identical(A[i,],A[j,])) # or something more complicated to handle 
incomparables
{ v[i] <- TRUE; break}
  }
}
v

Of course my code is horribly inefficient, but the difference should be just in 
computing the same result faster. An empty vector of some type is identical to 
an empty vector of the same type, so this computes

  [,1]

[1,] FALSE

[2,]  TRUE

[3,]  TRUE

[4,]  TRUE

[5,]  TRUE
, and I argue that that is correct.

A gap in documentation makes a change to the correct behaviour easier. (If the 
current behaviour were documented then the first step in changing the behaviour 
would be to issue a warning that the change is coming in a future version.) The 
protection for old code could be just a warning that can be turned off with a 
call to options. The new documentation should be more explicit.

Regards,
Jorgen.

From: Mark Webster 
To: Jorgen Harmse , Ivan Krylov

Cc: "r-help@r-project.org" 
Subject: Re: [R] duplicated() on zero-column data frames returns empty
Message-ID: <603481690.9150754.1712522666...@mail.yahoo.com>
Content-Type: text/plain; charset="utf-8"

 duplicated.matrix is an interesting one. I think a similar change would make 
sense, because it would have the dimensions that people would expect when using 
the default MARGIN = 1. However, it could be argued that it's not a needed 
change, because the Value section of its documentation only guarantees the 
dimensions of the output when using MARGIN = 0. In that case, duplicated.matrix 
does indeed return the expected 5x0 matrix for your example:
str(duplicated(matrix(0, 5, 0), MARGIN = 0))# logi[1:5, 0 ]
Best Regards,
Mark Webster
[[alternative HTML version deleted]]

From: Mark Webster markwebster...@yahoo.co.uk<mailto:markwebster...@yahoo.co.uk>
To: Ivan Krylov ikry...@disroot.org<mailto:ikry...@disroot.org>,  
r-help@r-project.org<mailto:r-help@r-project.org>
    r-help@r-project.org<mailto:r-help@r-project.org>
Subject: Re: [R]  duplicated() on zero-column data frames returns
empty vector
Message-ID: 
1379736116.7985600.1712306452...@mail.yahoo.com<mailto:1379736116.7985600.1712306452...@mail.yahoo.com>
Content-Type: text/plain; charset="utf-8"

 Do you mean the row names should mean all the rows should be counted as 
non-duplicates?Yes, I can see the argument for that, thanks.I must say I'm 
still puzzled at what interpretation would motivate the current behaviour of 
returning a logical(0), however.

Date: Sun, 7 Apr 2024 11:00:51 +0300
From: Ivan Krylov mailto:ikry...@disroot.org>>
To: Jorgen Harmse mailto:jhar...@roku.com>>
Cc: "r-help@r-project.org<mailto:r-help@r-project.org>" 
mailto:r-help@r-project.org>>,
        "markwebster...@yahoo.co.uk<mailto:markwebster...@yahoo.co.uk>" 
mailto:markwebster...@yahoo.co.uk>>
Subject: Re: [R] duplicated() on zero-column data frames returns empty
Message-ID: 
20240407110051.7924c03c@Tarkus<mailto:20240407110051.7924c03c@Tarkus>
Content-Type: text/plain; charset="utf-8"

� Fri, 5 Apr 2024 16:08:13 +
Jorgen Harmse mailto:jhar...@roku.com>> �:

> if duplicated really treated a row name as part of the row then
> any(duplicated(data.frame(�))) would always be FALSE. My expectation
> is that if key1 is a subset of key2 then all(duplicated(df[key1]) >=
> duplicated(df[key2])) should always be TRUE.

That's a good argument, thank you!

Would you suggest similar changes to duplicated.matrix too? Currently
it too returns 0-length output for 0-column inputs:

# 0-column matrix for 0-column input
str(duplicated(matrix(0, 5, 0)))
# logi[1:5, 0 ]

# 1-column matrix for 1-column input
str(duplicated(matrix(0, 5, 1)))
# logi [1:5, 1] FALSE TRUE TRUE TRUE TRUE

# a dim-1 array for >1-column input
str(duplicated(matrix(0, 5, 10)))
# logi [1:5(1d)] FALSE TRUE TRUE TRUE TRUE

--
Best regards,
Ivan




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] duplicated() on zero-column data frames returns empty

2024-04-07 Thread Mark Webster via R-help
 With respect to duplicated.data.frame taking account of row names to return 
all the rows as unique: thinking about this some more, I can see that making 
sense in isolation, but it's at odds with the usual behaviour of duplicated for 
other classes, e.g. primitive vectors, where it doesn't take account of names.
> Would you suggest similar changes to duplicated.matrix too? Currently
> it too returns 0-length output for 0-column inputs:

duplicated.matrix is an interesting one. I think a similar change would make 
sense, because it would have the dimensions that people would expect when using 
the default MARGIN = 1. However, it could be argued that it's not a needed 
change, because the Value section of its documentation only guarantees the 
dimensions of the output when using MARGIN = 0. In that case, duplicated.matrix 
does indeed return the expected 5x0 matrix for your example:
str(duplicated(matrix(0, 5, 0), MARGIN = 0))# logi[1:5, 0 ]
Best Regards,
Mark Webster  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] duplicated() on zero-column data frames returns empty

2024-04-07 Thread Ivan Krylov via R-help
В Fri, 5 Apr 2024 16:08:13 +
Jorgen Harmse  пишет:

> if duplicated really treated a row name as part of the row then
> any(duplicated(data.frame(…))) would always be FALSE. My expectation
> is that if key1 is a subset of key2 then all(duplicated(df[key1]) >=
> duplicated(df[key2])) should always be TRUE.

That's a good argument, thank you!

Would you suggest similar changes to duplicated.matrix too? Currently
it too returns 0-length output for 0-column inputs:

# 0-column matrix for 0-column input
str(duplicated(matrix(0, 5, 0)))
# logi[1:5, 0 ] 

# 1-column matrix for 1-column input
str(duplicated(matrix(0, 5, 1)))
# logi [1:5, 1] FALSE TRUE TRUE TRUE TRUE

# a dim-1 array for >1-column input
str(duplicated(matrix(0, 5, 10)))
# logi [1:5(1d)] FALSE TRUE TRUE TRUE TRUE

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] duplicated() on zero-column data frames returns empty

2024-04-05 Thread Jorgen Harmse via R-help
(I do not know how to make Outlook send plain text, so I avoid apostrophes.)

For what it is worth, I agree with Mark Webster. The discussion by Ivan Krylov 
is interesting, but if duplicated really treated a row name as part of the row 
then any(duplicated(data.frame(�))) would always be FALSE. My expectation is 
that if key1 is a subset of key2 then all(duplicated(df[key1]) >= 
duplicated(df[key2])) should always be TRUE.

Incidentally, the examples for duplicated and the documentation of unique hint 
that unique(x) is the same as (but more efficient than) x[!duplicated(x)] (for 
a vector) or x[!duplicated(x)],,drop=FALSE] (for a data frame), and this seems 
to be true even in the corner case (with what I consider incorrect output from 
both functions) . On the other hand, I do not see any explicit guarantee about 
the order of entries in unique(x) (or setdiff(�) or intersect(�)). Code using 
these functions could be more efficient with explicit guarantees, but maybe the 
core team wants to preserve its own flexibility. My suggestion is to include 
some options so users can at least lock in the current behaviour (with a note 
that future versions may achieve it less efficiently). Other options might 
include sort=TRUE in case the core team develops something more efficient than 
sort(unique(�)).

Regards,
Jorgen.

--

Message: 2
Date: Fri, 5 Apr 2024 11:17:37 +0300
From: Ivan Krylov 
To: Mark Webster via R-help 
Cc: Mark Webster 
Subject: Re: [R]  duplicated() on zero-column data frames returns
    empty vector
Message-ID: <20240405111737.2b7e4c3a@arachnoid>
Content-Type: text/plain; charset="utf-8"

Hello Mark,

� Fri, 5 Apr 2024 03:58:36 + (UTC)
Mark Webster via R-help  �:

> I found what looks to me like an odd edge case for duplicated(),
> unique() etc. on data frames with zero columns, due to duplicated()
> returning a zero-length vector for them, regardless of the number of
> rows:

> df <- data.frame(a = 1:5)
> df$a <- NULLnrow(df)
> # 5 (row count preserved by row.names)
> duplicated(df)
> # logical(0), should be c(FALSE, TRUE, TRUE, TRUE, TRUE)
> anyDuplicated(df)
> # 0, should be 2

> This behaviour isn't mentioned in the documentation; is there a
> reason for it to work like this?

<...>

> I admit this is a case we rarely care about.However, for an example
> of this being an issue, I've been running into it when treating data
> frames as database relations, where they have one or more candidate
> keys (irreducible subsets of the columns for which every row must
> have a unique value set).

Part of the problem is that it's not obvious what should be a
zero-column but non-zero-row data.frame mean.

On the one hand, your database relation use case is entirely valid. On
the other hand, if data.frames are considered to be tables of data with
row.names as their identifiers, then duplicated(d) should be returning
logical(nrow(d)) for zero-column data.frames, since row.names are
required to be unique. I'm sure that more interpretations can be
devised, requiring some other behaviour for duplicated() and friends.

Thankfully, duplicated() and anyDuplicated() are generic functions, and
you can subclass your data frames to change their behaviour:

duplicated.database_relation <- function(x, incomparables = FALSE, ...)
 if (length(x)) return(NextMethod()) else c(
  FALSE, rep(TRUE, nrow(x) - 1)
 )
.S3method('duplicated', 'database_relation')

anyDuplicated.database_relation <- function(
 x, incomparables = FALSE, ...
) if (nrow(x) > 1) 2 else 0
.S3method('anyDuplicated', 'database_relation')

x <- data.frame(row.names = 1:5)
class(x) <- c('database_relation', class(x))

duplicated(x)
# [1] FALSE  TRUE  TRUE  TRUE  TRUE
anyDuplicated(x)
# [1] 2
unique(x)
# data frame with 0 columns and 1 row

> [[alternative HTML version deleted]]

Since this mailing list eats the HTML parts of the e-mails, we only get
the plain text version automatically prepared by your mailer. This one
didn't look so good:
https://stat.ethz.ch/pipermail/r-help/2024-April/479143.html

Composing your messages to the list in plain text will help avoid the
problem.

--
Best regards,
Ivan



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] duplicated() on zero-column data frames returns empty vector

2024-04-05 Thread Mark Webster via R-help
 Hello Ivan, thanks for this.
   > Part of the problem is that it's not obvious what should be a
> zero-column but non-zero-row data.frame mean.
> 
> On the one hand, your database relation use case is entirely valid. On
> the other hand, if data.frames are considered to be tables of data with
> row.names as their identifiers, then duplicated(d) should be returning
> logical(nrow(d)) for zero-column data.frames, since row.names are
> required to be unique. I'm sure that more interpretations can be
> devised, requiring some other behaviour for duplicated() and friends.

Do you mean the row names should mean all the rows should be counted as 
non-duplicates?Yes, I can see the argument for that, thanks.I must say I'm 
still puzzled at what interpretation would motivate the current behaviour of 
returning a logical(0), however.

> Thankfully, duplicated() and anyDuplicated() are generic functions, and
> you can subclass your data frames to change their behaviour:
> > ...
Indeed, I'm already doing something along these lines!
Best Regards,Mark  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] duplicated() on zero-column data frames returns empty vector

2024-04-05 Thread Ivan Krylov via R-help
Hello Mark,

В Fri, 5 Apr 2024 03:58:36 + (UTC)
Mark Webster via R-help  пишет:

> I found what looks to me like an odd edge case for duplicated(),
> unique() etc. on data frames with zero columns, due to duplicated()
> returning a zero-length vector for them, regardless of the number of
> rows:

> df <- data.frame(a = 1:5)
> df$a <- NULLnrow(df)
> # 5 (row count preserved by row.names)
> duplicated(df)
> # logical(0), should be c(FALSE, TRUE, TRUE, TRUE, TRUE)
> anyDuplicated(df)
> # 0, should be 2

> This behaviour isn't mentioned in the documentation; is there a
> reason for it to work like this?

<...>

> I admit this is a case we rarely care about.However, for an example
> of this being an issue, I've been running into it when treating data
> frames as database relations, where they have one or more candidate
> keys (irreducible subsets of the columns for which every row must
> have a unique value set).

Part of the problem is that it's not obvious what should be a
zero-column but non-zero-row data.frame mean.

On the one hand, your database relation use case is entirely valid. On
the other hand, if data.frames are considered to be tables of data with
row.names as their identifiers, then duplicated(d) should be returning
logical(nrow(d)) for zero-column data.frames, since row.names are
required to be unique. I'm sure that more interpretations can be
devised, requiring some other behaviour for duplicated() and friends.

Thankfully, duplicated() and anyDuplicated() are generic functions, and
you can subclass your data frames to change their behaviour:

duplicated.database_relation <- function(x, incomparables = FALSE, ...)
 if (length(x)) return(NextMethod()) else c(
  FALSE, rep(TRUE, nrow(x) - 1)
 )
.S3method('duplicated', 'database_relation')

anyDuplicated.database_relation <- function(
 x, incomparables = FALSE, ...
) if (nrow(x) > 1) 2 else 0
.S3method('anyDuplicated', 'database_relation')

x <- data.frame(row.names = 1:5)
class(x) <- c('database_relation', class(x))

duplicated(x)
# [1] FALSE  TRUE  TRUE  TRUE  TRUE
anyDuplicated(x)
# [1] 2
unique(x)
# data frame with 0 columns and 1 row

> [[alternative HTML version deleted]]

Since this mailing list eats the HTML parts of the e-mails, we only get
the plain text version automatically prepared by your mailer. This one
didn't look so good:
https://stat.ethz.ch/pipermail/r-help/2024-April/479143.html

Composing your messages to the list in plain text will help avoid the
problem.

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.