[R] Comparing entire row sets at once efficiently

2006-09-28 Thread Dirk Eddelbuettel

Dear useRs,

I am having a hard time coming up with a nice and efficient solution to
a problem on entires matrices or data.frames. In spirit, this is similar to
what setdiff() and setequal() do, but I need it in more dimensions.

Here's a brief description.

  * given a set of factors or sequences, expand.grid() gives me the set
of permutations in a data.frame; 

in my case all arguments are numeric so I could convert the data frame to
a matrix

let's call this one Candidates

  * I have a second matrix (or data frame) to compare to; this second 
set may be a subset of the first, or a superset but it guaranted to
contain the same columns

let's call this one Comparison

  * I want know which rows in Candidates are not yet in Comparison.

A toy example:

> Comparison <- matrix(1:30, ncol=5)
> Candidates <- Comparison[c(2,4), ]
> checkRow <- function(r, M) { any( (r[1] == M[,1]) & (r[2] == M[,2]) & (r[3] 
> == M[,3]) & (r[4] == M[,4]) ) }
> checkRow( Candidates[1,], Comparison)
[1] TRUE
> falseRow <- Candidates[1,] 
> falseRow[2] <- 42
> checkRow( falseRow, Comparison)
[1] FALSE
> 

The checkRow function works but is a) klunky, b) hardcodes the dimension and
c) works only on one row at a time.

There must be better ways, at least for a) and b).  What am I missing?  

Feel free to reply off-list and I'd gladly summarize back to the list. If you
don't want your reply (or email) summarized back, please indicate.

Thanks, Dirk



-- 
Hell, there are no rules here - we're trying to accomplish something. 
  -- Thomas A. Edison

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Comparing entire row sets at once efficiently

2006-09-28 Thread Gabor Grothendieck
If Comparison and Candidates each have no duplicated rows (which
is the situation in the example) then try this:

tail(!duplicated(rbind(Comparison, Candidates)), nrow(Candidates))


On 9/28/06, Dirk Eddelbuettel <[EMAIL PROTECTED]> wrote:
>
> Dear useRs,
>
> I am having a hard time coming up with a nice and efficient solution to
> a problem on entires matrices or data.frames. In spirit, this is similar to
> what setdiff() and setequal() do, but I need it in more dimensions.
>
> Here's a brief description.
>
>  * given a set of factors or sequences, expand.grid() gives me the set
>of permutations in a data.frame;
>
>in my case all arguments are numeric so I could convert the data frame to
>a matrix
>
>let's call this one Candidates
>
>  * I have a second matrix (or data frame) to compare to; this second
>set may be a subset of the first, or a superset but it guaranted to
>contain the same columns
>
>let's call this one Comparison
>
>  * I want know which rows in Candidates are not yet in Comparison.
>
> A toy example:
>
> > Comparison <- matrix(1:30, ncol=5)
> > Candidates <- Comparison[c(2,4), ]
> > checkRow <- function(r, M) { any( (r[1] == M[,1]) & (r[2] == M[,2]) & (r[3] 
> > == M[,3]) & (r[4] == M[,4]) ) }
> > checkRow( Candidates[1,], Comparison)
> [1] TRUE
> > falseRow <- Candidates[1,]
> > falseRow[2] <- 42
> > checkRow( falseRow, Comparison)
> [1] FALSE
> >
>
> The checkRow function works but is a) klunky, b) hardcodes the dimension and
> c) works only on one row at a time.
>
> There must be better ways, at least for a) and b).  What am I missing?
>
> Feel free to reply off-list and I'd gladly summarize back to the list. If you
> don't want your reply (or email) summarized back, please indicate.
>
> Thanks, Dirk
>
>
>
> --
> Hell, there are no rules here - we're trying to accomplish something.
>  -- Thomas A. Edison
>
> __
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Comparing entire row sets at once efficiently

2006-09-28 Thread Dirk Eddelbuettel

I should have known that  Gabor would reply within minutes with a nice
one-line solution ... :) 

On 28 September 2006 at 12:05, Gabor Grothendieck wrote:
| If Comparison and Candidates each have no duplicated rows (which
| is the situation in the example) then try this:
| 
| tail(!duplicated(rbind(Comparison, Candidates)), nrow(Candidates))

Excellent.  That will work.  Candidates has no dupes because expand.grid()
constructs it.  Comparison may have dupes, but we can ignore that.

By putting the 'larger set' against we which to compare second, we catch the
markers from duplicated(), and then subset via tail().  That's exactly what
needed.

Thanks, and chapeau for a very elegant one-liner,  Dirk

-- 
Hell, there are no rules here - we're trying to accomplish something. 
  -- Thomas A. Edison

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.