Re: [R] select rows with identical columns from a data frame

David Winsemius Sun, 20 Jan 2013 10:39:23 -0800


On Jan 20, 2013, at 9:27 AM, David Winsemius wrote:

On Jan 20, 2013, at 8:26 AM, Sam Steingold wrote:
* Bert Gunter <thagre.ore...@trar.pbz> [2013-01-19 22:26:46 -0800]:
But David W. and Bill Dunlap gave you solutions that also work andare
much faster, no?!
Yes, indeed, and I am now using David's solution as it is fast
(enough), simple and concise.
I am a bit surprised by that. I do agree that it was simple andconcise, two programming virtues that I occasionally achieve.However, when I tested it against either of Bill Dunlap'ssuggestions mine was 15-40 times slower. (So I saved Bill's code andmade a mental note to study it's superiority.) I could see why thef2 version was superior, since it progressively shrank the indexcandidates for further comparison, but his first function used nosuch logic and was still 15 times faster.
My test included the creation of the smaller data.frame which hisdid not, but when I modified mine to only return the index vector,that was the step that consumed all the time. I wondered if it were`which` that consumed the time but it appears the inner step ofx==x[[1]] that was the culprit.
> x <- data.frame(lapply(structure(1:10,names=letters[1:10]),function(i) sample(c(NA,1,1,1,2,2,2,3), replace=TRUE, size=1e6)))
> system.time({ keep <- x[[1]] == x[[2]]
+    for (i in seq_len(ncol(x))[-(1:2)]) {
+        keep <- keep & x[[i - 1]] == x[[i]]
+    }
+    z2 <- !is.na(keep) & keep})
  user  system elapsed
 0.179   0.056   0.240

> system.time({z <- rowSums(x==x[[1]]) })
  user  system elapsed
 3.535   0.535   4.067

> system.time({z <- x==x[[1]] })
  user  system elapsed
 3.540   0.524   4.061

A further note: Was able to recover most of the timing efficiency withinitial coercion of the dataframe structure to matrix before the "=="operation:


> system.time({z <- as.matrix(x)==x[[1]] })
   user  system elapsed
  0.181   0.140   0.320

So it's really `==.data.frame` that is the resource hog.

--
David.

--
David


Thanks a lot to David, Bill, Rui, and arun for their answers (to this

question, my many previous questions, and, I hope, my futurequestions

in advance)!

On Sat, Jan 19, 2013 at 9:41 PM, Sam Steingold <s...@gnu.org> wrote:

* Rui Barradas <ehvconeen...@fncb.cg> [2013-01-18 21:02:20 +0000]:

Try the following.

complete.cases(f) & apply(f, 1, function(x) all(x == x[1]))


thanks, this works, but is horribly slow (dim(f) is 766,950x2)

--

David Winsemius, MD
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] select rows with identical columns from a data frame

Reply via email to