I would appreciate it if you would follow the Posting Guide and give a reproducible example and post all messages using plain text.
Try m1 <- matrix(sample(0:999,2*1057837,TRUE),ncol=2) m2 <- matrix(sample(0:999,2*951980,TRUE),ncol=2) df1 <- as.data.frame(m1) df2 <- as.data.frame(m2) library(sqldf) system.time(df3 <- sqldf("SELECT DISTINCT df1.V1, df1.V2 FROM df1 INNER JOIN df2 ON df1.V1=df2.V1 AND df1.V2=df2.V2") ) The speed seems heavily dependent on how many rows are duplicated within the input data frames... so if the range of values is small then the query runs slower. Note also that moving the data from R to the database and back takes time... you may be able to import the data directly from your source data to the database and save some time. Read ?sqldf and ?read.csv.sql examples for more info. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. c char <charlie.hsia...@gmail.com> wrote: >I am not familiar with R's sort and sql libs. appreciate if you can >post a >code snippet when you got time. Thanks a lot! > > >On Tue, Jul 30, 2013 at 10:36 AM, Jeff Newmiller ><jdnew...@dcn.davis.ca.us>wrote: > >> In that case, you should be looking at a relational inner join, >perhaps >> with SQLite (see package sqldf). >> >--------------------------------------------------------------------------- >> Jeff Newmiller The ..... ..... Go >Live... >> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live >> Go... >> Live: OO#.. Dead: OO#.. >Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >rocks...1k >> >--------------------------------------------------------------------------- >> Sent from my phone. Please excuse my brevity. >> >> c char <charlie.hsia...@gmail.com> wrote: >> >Thanks a lot. >> >Still looking for some super fast and memory efficient solution, as >the >> >matrix I have in real world has billions of rows. >> > >> > >> >On Mon, Jul 29, 2013 at 6:24 PM, William Dunlap <wdun...@tibco.com> >> >wrote: >> > >> >> I haven't looked at the size-time relationship, but im2 (below) is >> >faster >> >> than your >> >> function on at least one example: >> >> >> >> intersectMat <- function(mat1, mat2) >> >> { >> >> #mat1 and mat2 are both deduplicated >> >> nr1 <- nrow(mat1) >> >> nr2 <- nrow(mat2) >> >> mat2[duplicated(rbind(mat1, mat2))[(nr1 + 1):(nr1 + nr2)], , >> >> drop=FALSE] >> >> } >> >> >> >> im2 <- function(mat1, mat2) >> >> { >> >> stopifnot(ncol(mat1)==2, ncol(mat1)==ncol(mat2)) >> >> toChar <- function(twoColMat) paste(sep="\1", twoColMat[,1], >> >> twoColMat[,2]) >> >> mat1[match(toChar(mat2), toChar(mat1), nomatch=0), , >drop=FALSE] >> >> } >> >> >> >> > m1 <- cbind(1:1e7, rep(1:10, len=1e7)) >> >> > m2 <- cbind(1:1e7, rep(1:20, len=1e7)) >> >> > system.time(r1 <- intersectMat(m1,m2)) >> >> user system elapsed >> >> 430.37 1.96 433.98 >> >> > system.time(r2 <- im2(m1,m2)) >> >> user system elapsed >> >> 27.89 0.20 28.13 >> >> > identical(r1, r2) >> >> [1] TRUE >> >> > dim(r1) >> >> [1] 5000000 2 >> >> >> >> Bill Dunlap >> >> Spotfire, TIBCO Software >> >> wdunlap tibco.com >> >> >> >> >> >> > -----Original Message----- >> >> > From: r-help-boun...@r-project.org >> >[mailto:r-help-boun...@r-project.org] >> >> On Behalf >> >> > Of c char >> >> > Sent: Monday, July 29, 2013 4:04 PM >> >> > To: r-help@r-project.org >> >> > Subject: [R] Intersecting two matrices >> >> > >> >> > Dear all, >> >> > >> >> > I am interested to know a faster matrix intersection package for >R >> >> handles >> >> > intersection of two integer matrices with ncol=2. Currently I am >> >using my >> >> > homemade code adapted from a previous thread: >> >> > >> >> > >> >> > intersectMat <- function(mat1, mat2){#mat1 and mat2 are both >> >> > deduplicated nr1 <- nrow(mat1) nr2 <- nrow(mat2) >> >> > mat2[duplicated(rbind(mat1, mat2))[(nr1 + 1):(nr1 + nr2)], ]} >> >> > >> >> > >> >> > which handles: >> >> > size A= 10578373 >> >> > size B= 9519807 >> >> > expected intersecting time= 251.2272 >> >> > intersecting for corssing MPRs took 409.602 seconds. >> >> > >> >> > scale a little bit worse than linearly but atomic operation is >not >> >good. >> >> > Wonder if a super fast C/C++ extension exists for this task. >Your >> >ideas >> >> are >> >> > appreciated. >> >> > >> >> > Thanks! >> >> > >> >> > [[alternative HTML version deleted]] >> >> > >> >> > ______________________________________________ >> >> > R-help@r-project.org mailing list >> >> > https://stat.ethz.ch/mailman/listinfo/r-help >> >> > PLEASE do read the posting guide >> >> http://www.R-project.org/posting-guide.html >> >> > and provide commented, minimal, self-contained, reproducible >code. >> >> >> > >> > [[alternative HTML version deleted]] >> > >> >______________________________________________ >> >R-help@r-project.org mailing list >> >https://stat.ethz.ch/mailman/listinfo/r-help >> >PLEASE do read the posting guide >> >http://www.R-project.org/posting-guide.html >> >and provide commented, minimal, self-contained, reproducible code. >> >> > > [[alternative HTML version deleted]] > >______________________________________________ >R-help@r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.