Thanks, Greg. Turns out that there's an even faster alternative. Note that the OP asked whether one could include in the result the counts of each unordered pair, which I assume could be either 2 or 1. This can be done easily using table(), and it's quite a bit faster for my 1 million pair example. Herewith the details, which I'll define as functions for convenience.
## my earlier attempt using unique(): g <- function(x) { w <- x[,2] > x[, 1] x[w,] <- x[w, 2:1] unique(x) } ## present version using table(): f <- function(x){ w <- x[,2] > x[,1] x[w, ] <- x[w, 2:1] x$counts <- as.vector(table(x)) ## drop the dim x[x$counts>0, ] } > y <- expand.grid(source =1:4, target = 1:3) > g(y) source target 1 1 1 2 2 1 3 3 1 4 4 1 6 2 2 7 3 2 8 4 2 11 3 3 12 4 3 > f(y) source target counts 1 1 1 1 2 2 1 2 3 3 1 2 4 4 1 1 6 2 2 1 7 3 2 2 8 4 2 1 11 3 3 1 12 4 3 1 ## Timing: > y <- expand.grid(sample.int(1000), sample.int(1000)) ## > system.time(g(y)) user system elapsed 0.896 0.027 0.924 ## > system.time(f(y)) user system elapsed 0.142 0.009 0.151 And, yes, I was surprised by this, too. Again, it may not matter, but it is interesting. Your mileage may vary, of course. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Aug 20, 2021 at 12:39 PM Greg Minshall <minsh...@umich.edu> wrote: > > Bert, > > > The efficiency gains are due to vectorization and the use of more > > efficient primitives. None of this may matter of course, but it seemed > > worth mentioning. > > thanks very much! the varieties of code, and disparities of > performance, are truly wonderful. > > Rui's point that what works better for small n is not necessarily what > will work better for large n is important to keep in [my] mind. > > as a "so-far" summary, here are some timings. the relevant code is below. > ---- > my apply > user system elapsed > 8.397 0.124 8.531 > Bert's !duplicated > user system elapsed > 2.367 0.000 2.370 > Bert's x[,2]>x[,1] > user system elapsed > 1.052 0.000 1.054 > my a.d.f(unique(cbind(do.call))) > user system elapsed > 3.909 0.000 3.914 > Eric Berger's unique(...pmin...pmax) > user system elapsed > 0.848 0.000 0.850 > Eric Berger's transmuting tibble... > user system elapsed > 0.986 0.000 0.988 > Kimmo Elo's [OP] mutating paste > user system elapsed > 52.079 0.000 52.136 > Rui Barradas' sort-based > user system elapsed > 42.327 0.080 42.450 > ---- > > cheers, Greg > > ---- > n <- 1000 > x <- expand.grid(Source = 1:n, Target = 1:n) > > cat("my apply\n") > system.time({ > y <- apply(x, 1, function(y) return (c(A=min(y), B=max(y)))) > unique(t(y))}) > # user system elapsed > # 5.075 0.034 5.109 > > cat("Bert's !duplicated\n") > system.time({ > x[!duplicated(cbind(do.call(pmin, x), do.call(pmax, x))), ] > }) > # user system elapsed > # 1.340 0.013 1.353 > > # Still more efficient and still returning a data frame is: > cat("Bert's x[,2]>x[,1]\n") > system.time({ > w <- x[, 2] > x[,1] > x[w, ] <- x[w, 2:1] > unique(x)}) > # user system elapsed > # 0.693 0.011 0.703 > > cat("my a.d.f(unique(cbind(do.call)))\n") > system.time({ > as.data.frame(unique(cbind(A=do.call(pmin,x), B=do.call(pmax,x)))) > }) > > cat("Eric Berger's unique(...pmin...pmax)\n") > system.time({ > unique(data.frame(V1=pmin(x$Source,x$Target), V2=pmax(x$Source,x$Target))) > }) > > cat("Eric Berger's transmuting tibble...\n") > require(dplyr) > xt<-tibble(x) > system.time({ > xt %>% transmute( a=pmin(Source,Target), b=pmax(Source,Target)) %>% > unique() %>% rename(Source=a, Target=b) > }) > > cat("Kimmo Elo's [OP] mutating paste\n") > system.time({ > xt %>% > mutate(pair=mapply(function(x,y) > paste0(sort(c(x,y)),collapse="-"), Source, Target)) %>% > distinct(pair, > .keep_all = T) %>% > mutate(Source=sapply(pair, function(x) > unlist(strsplit(x, split="-"))[1]), Target=sapply(pair, function(x) > unlist(strsplit(x, split="-"))[2])) %>% > select(-pair) > }) > > cat("Rui Barradas' sort-based\n") > system.time({ > apply(x, 1, sort) |> t() |> unique() > }) ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.