Re: [R] select rows with identical columns from a data frame
* Bert Gunter thagre.ore...@trar.pbz [2013-01-19 22:26:46 -0800]: But David W. and Bill Dunlap gave you solutions that also work and are much faster, no?! Yes, indeed, and I am now using David's solution as it is fast (enough), simple and concise. Thanks a lot to David, Bill, Rui, and arun for their answers (to this question, my many previous questions, and, I hope, my future questions in advance)! On Sat, Jan 19, 2013 at 9:41 PM, Sam Steingold s...@gnu.org wrote: * Rui Barradas ehvconeen...@fncb.cg [2013-01-18 21:02:20 +]: Try the following. complete.cases(f) apply(f, 1, function(x) all(x == x[1])) thanks, this works, but is horribly slow (dim(f) is 766,950x2) -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://americancensorship.org http://palestinefacts.org http://thereligionofpeace.com http://camera.org http://think-israel.org Lisp is a way of life. C is a way of death. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] select rows with identical columns from a data frame
On Jan 20, 2013, at 8:26 AM, Sam Steingold wrote: * Bert Gunter thagre.ore...@trar.pbz [2013-01-19 22:26:46 -0800]: But David W. and Bill Dunlap gave you solutions that also work and are much faster, no?! Yes, indeed, and I am now using David's solution as it is fast (enough), simple and concise. I am a bit surprised by that. I do agree that it was simple and concise, two programming virtues that I occasionally achieve. However, when I tested it against either of Bill Dunlap's suggestions mine was 15-40 times slower. (So I saved Bill's code and made a mental note to study it's superiority.) I could see why the f2 version was superior, since it progressively shrank the index candidates for further comparison, but his first function used no such logic and was still 15 times faster. My test included the creation of the smaller data.frame which his did not, but when I modified mine to only return the index vector, that was the step that consumed all the time. I wondered if it were `which` that consumed the time but it appears the inner step of x==x[[1]] that was the culprit. x - data.frame(lapply(structure(1:10,names=letters[1:10]), function(i) sample(c(NA,1,1,1,2,2,2,3), replace=TRUE, size=1e6))) system.time({ keep - x[[1]] == x[[2]] +for (i in seq_len(ncol(x))[-(1:2)]) { +keep - keep x[[i - 1]] == x[[i]] +} +z2 - !is.na(keep) keep}) user system elapsed 0.179 0.056 0.240 system.time({z - rowSums(x==x[[1]]) }) user system elapsed 3.535 0.535 4.067 system.time({z - x==x[[1]] }) user system elapsed 3.540 0.524 4.061 -- David Thanks a lot to David, Bill, Rui, and arun for their answers (to this question, my many previous questions, and, I hope, my future questions in advance)! On Sat, Jan 19, 2013 at 9:41 PM, Sam Steingold s...@gnu.org wrote: * Rui Barradas ehvconeen...@fncb.cg [2013-01-18 21:02:20 +]: Try the following. complete.cases(f) apply(f, 1, function(x) all(x == x[1])) thanks, this works, but is horribly slow (dim(f) is 766,950x2) -- David Winsemius, MD Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] select rows with identical columns from a data frame
On Jan 20, 2013, at 9:27 AM, David Winsemius wrote: On Jan 20, 2013, at 8:26 AM, Sam Steingold wrote: * Bert Gunter thagre.ore...@trar.pbz [2013-01-19 22:26:46 -0800]: But David W. and Bill Dunlap gave you solutions that also work and are much faster, no?! Yes, indeed, and I am now using David's solution as it is fast (enough), simple and concise. I am a bit surprised by that. I do agree that it was simple and concise, two programming virtues that I occasionally achieve. However, when I tested it against either of Bill Dunlap's suggestions mine was 15-40 times slower. (So I saved Bill's code and made a mental note to study it's superiority.) I could see why the f2 version was superior, since it progressively shrank the index candidates for further comparison, but his first function used no such logic and was still 15 times faster. My test included the creation of the smaller data.frame which his did not, but when I modified mine to only return the index vector, that was the step that consumed all the time. I wondered if it were `which` that consumed the time but it appears the inner step of x==x[[1]] that was the culprit. x - data.frame(lapply(structure(1:10,names=letters[1:10]), function(i) sample(c(NA,1,1,1,2,2,2,3), replace=TRUE, size=1e6))) system.time({ keep - x[[1]] == x[[2]] +for (i in seq_len(ncol(x))[-(1:2)]) { +keep - keep x[[i - 1]] == x[[i]] +} +z2 - !is.na(keep) keep}) user system elapsed 0.179 0.056 0.240 system.time({z - rowSums(x==x[[1]]) }) user system elapsed 3.535 0.535 4.067 system.time({z - x==x[[1]] }) user system elapsed 3.540 0.524 4.061 A further note: Was able to recover most of the timing efficiency with initial coercion of the dataframe structure to matrix before the == operation: system.time({z - as.matrix(x)==x[[1]] }) user system elapsed 0.181 0.140 0.320 So it's really `==.data.frame` that is the resource hog. -- David. -- David Thanks a lot to David, Bill, Rui, and arun for their answers (to this question, my many previous questions, and, I hope, my future questions in advance)! On Sat, Jan 19, 2013 at 9:41 PM, Sam Steingold s...@gnu.org wrote: * Rui Barradas ehvconeen...@fncb.cg [2013-01-18 21:02:20 +]: Try the following. complete.cases(f) apply(f, 1, function(x) all(x == x[1])) thanks, this works, but is horribly slow (dim(f) is 766,950x2) -- David Winsemius, MD Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] select rows with identical columns from a data frame
* Rui Barradas ehvconeen...@fncb.cg [2013-01-18 21:02:20 +]: Try the following. complete.cases(f) apply(f, 1, function(x) all(x == x[1])) thanks, this works, but is horribly slow (dim(f) is 766,950x2) -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://truepeace.org http://palestinefacts.org http://thereligionofpeace.com http://honestreporting.com http://ffii.org usually: can't pay == don't buy. software: can't buy == don't pay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] select rows with identical columns from a data frame
But David W. and Bill Dunlap gave you solutions that also work and are much faster, no?! -- Bert On Sat, Jan 19, 2013 at 9:41 PM, Sam Steingold s...@gnu.org wrote: * Rui Barradas ehvconeen...@fncb.cg [2013-01-18 21:02:20 +]: Try the following. complete.cases(f) apply(f, 1, function(x) all(x == x[1])) thanks, this works, but is horribly slow (dim(f) is 766,950x2) -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://truepeace.org http://palestinefacts.org http://thereligionofpeace.com http://honestreporting.com http://ffii.org usually: can't pay == don't buy. software: can't buy == don't pay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] select rows with identical columns from a data frame
I have a data frame with several columns. I want to select the rows with no NAs (as with complete.cases) and all columns identical. E.g., for --8---cut here---start-8--- f - data.frame(a=c(1,NA,NA,4),b=c(1,NA,3,40),c=c(1,NA,5,40)) f a b c 1 1 1 1 2 NA NA NA 3 NA 3 5 4 4 40 40 --8---cut here---end---8--- I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first row because there all 3 columns are the same and none is NA. thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://memri.org http://mideasttruth.com http://honestreporting.com http://pmw.org.il http://iris.org.il All extremists should be taken out and shot. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] select rows with identical columns from a data frame
I can do Reduce(==,f[complete.cases(f),]) but that creates an intermediate data frame which I would love to avoid (to save memory). * Sam Steingold f...@tah.bet [2013-01-18 15:53:21 -0500]: I have a data frame with several columns. I want to select the rows with no NAs (as with complete.cases) and all columns identical. E.g., for f - data.frame(a=c(1,NA,NA,4),b=c(1,NA,3,40),c=c(1,NA,5,40)) f a b c 1 1 1 1 2 NA NA NA 3 NA 3 5 4 4 40 40 I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first row because there all 3 columns are the same and none is NA. thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://truepeace.org http://iris.org.il http://www.PetitionOnline.com/tap12009/ http://ffii.org http://jihadwatch.org War doesn't determine who's right, just who's left. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] select rows with identical columns from a data frame
Hello, Try the following. complete.cases(f) apply(f, 1, function(x) all(x == x[1])) Hope this helps, Rui Barradas Em 18-01-2013 20:53, Sam Steingold escreveu: I have a data frame with several columns. I want to select the rows with no NAs (as with complete.cases) and all columns identical. E.g., for --8---cut here---start-8--- f - data.frame(a=c(1,NA,NA,4),b=c(1,NA,3,40),c=c(1,NA,5,40)) f a b c 1 1 1 1 2 NA NA NA 3 NA 3 5 4 4 40 40 --8---cut here---end---8--- I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first row because there all 3 columns are the same and none is NA. thanks! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] select rows with identical columns from a data frame
On Jan 18, 2013, at 1:02 PM, Rui Barradas wrote: Hello, Try the following. complete.cases(f) apply(f, 1, function(x) all(x == x[1])) Hope this helps, Rui Barradas Em 18-01-2013 20:53, Sam Steingold escreveu: I have a data frame with several columns. I want to select the rows with no NAs (as with complete.cases) and all columns identical. E.g., for --8---cut here---start-8--- f - data.frame(a=c(1,NA,NA,4),b=c(1,NA,3,40),c=c(1,NA,5,40)) f a b c 1 1 1 1 2 NA NA NA 3 NA 3 5 4 4 40 40 --8---cut here---end---8--- f[ which( rowSums(f==f[[1]]) == length(f) ), ] a b c 1 1 1 1 I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first row because there all 3 columns are the same and none is NA. thanks! David Winsemius Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] select rows with identical columns from a data frame
Here are two related approaches to your problem. The first uses a logical vector, keep, to say which rows to keep. The second uses an integer vector, it can be considerably faster when the columns are not well correlated with one another (so the number of desired rows is small proportion of the input rows). f1 - function (x) { # sieve with logical 'keep' vector stopifnot(is.data.frame(x), ncol(x) 1) keep - x[[1]] == x[[2]] for (i in seq_len(ncol(x))[-(1:2)]) { keep - keep x[[i - 1]] == x[[i]] } !is.na(keep) keep } f2 - function (x) { # sieve with integer 'keep' vector stopifnot(is.data.frame(x), ncol(x) 1) keep - which(x[[1]] == x[[2]]) for (i in seq_len(ncol(x))[-(1:2)]) { keep - keep[which(x[[i - 1]][keep] == x[[i]][keep])] } seq_len(nrow(x)) %in% keep } E.g., for a 10 million by 10 data.frame I get: x - data.frame(lapply(structure(1:10,names=letters[1:10]), function(i)sample(c(NA,1,1,1,2,2,2,3), replace=TRUE, size=1e7))) system.time(v1 - f1(x)) user system elapsed 4.040.164.19 system.time(v2 - f2(x)) user system elapsed 0.800.000.79 identical(v1, v2) [1] TRUE head(x[v1,]) a b c d e f g h i j 4811 2 2 2 2 2 2 2 2 2 2 41706 1 1 1 1 1 1 1 1 1 1 56633 1 1 1 1 1 1 1 1 1 1 70859 1 1 1 1 1 1 1 1 1 1 83848 1 1 1 1 1 1 1 1 1 1 84767 1 1 1 1 1 1 1 1 1 1 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Sam Steingold Sent: Friday, January 18, 2013 12:53 PM To: r-help@r-project.org Subject: [R] select rows with identical columns from a data frame I have a data frame with several columns. I want to select the rows with no NAs (as with complete.cases) and all columns identical. E.g., for --8---cut here---start-8--- f - data.frame(a=c(1,NA,NA,4),b=c(1,NA,3,40),c=c(1,NA,5,40)) f a b c 1 1 1 1 2 NA NA NA 3 NA 3 5 4 4 40 40 --8---cut here---end---8--- I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first row because there all 3 columns are the same and none is NA. thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://memri.org http://mideasttruth.com http://honestreporting.com http://pmw.org.il http://iris.org.il All extremists should be taken out and shot. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] select rows with identical columns from a data frame
apply(f,1,function(x) all(duplicated(x)|duplicated(x,fromLast=TRUE)!is.na(x))) #[1] TRUE FALSE FALSE FALSE A.K. - Original Message - From: Sam Steingold s...@gnu.org To: r-help@r-project.org Cc: Sent: Friday, January 18, 2013 3:53 PM Subject: [R] select rows with identical columns from a data frame I have a data frame with several columns. I want to select the rows with no NAs (as with complete.cases) and all columns identical. E.g., for --8---cut here---start-8--- f - data.frame(a=c(1,NA,NA,4),b=c(1,NA,3,40),c=c(1,NA,5,40)) f a b c 1 1 1 1 2 NA NA NA 3 NA 3 5 4 4 40 40 --8---cut here---end---8--- I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first row because there all 3 columns are the same and none is NA. thanks! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000 http://www.childpsy.net/ http://memri.org http://mideasttruth.com http://honestreporting.com http://pmw.org.il http://iris.org.il All extremists should be taken out and shot. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.