Re: [R] Fast multiple match function
Hi Jeff, Indeed the data.table package does provide a much cleaner way to achieve the same functionality, and a lot of other functionality as bonus. Thanks for letting me know about it. On Tue, 7 Apr 2015 at 15:41 Jeff Newmiller wrote: > You might find the data.table package helpful. It uses an index sorted > with a radix sort and minimizes moving the data around in memory. > --- > Jeff NewmillerThe . . Go Live... > DCN:Basics: ##.#. ##.#. Live > Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/BatteriesO.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --- > Sent from my phone. Please excuse my brevity. > > On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania > wrote: > >Hi all, > > > >Thanks for the responses. > >Herve's example is a good small size example of what I wanted. > > > >> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2) > >> someCoolFunc(-2, y) > >[1] 3 9 > >> someCoolFunc(15, y) > >[1] 4 5 8 > > > >The requirement is that I want someCoolFunc() to run in O(number of > >matches) time, instead of O(size of y). > >This is because y is big. And I don't know all the queries I want to > >do up-front. And the results of some queries might change the queries > >I want to do in the future. > > > >@David: I hope the above description is more clear. > >@Enrico, Herve: I want both the functionality provided by one function. > >- On repeated calls, fmatch() does give O(1) performance, but it does > >not give all matches. > >- findMatches() gives all matches, but I need to know the entire > >vector x beforehand. I don't have that luxury. > > > > > >I do have something that works now, using split and fmatch (package > >fastmatch). So just posting that in case anyone in the future has the > >same problem. > >> y.unique <- unique(y) > >> > >> # create a map from the unique elements of y to the locations of all > >occurrences of the element > >> y.map <- split(1:length(y), match(y, y.unique)) > >> > >> # write a wrapper function that does a look-up on the unique list. > >and then returns all matches using the map. > >> someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] } > > > > > > > >On Tue, 7 Apr 2015 at 13:21 Hervé Pagès wrote: > >> > >> Hi Keshav, > >> > >> findMatches() in the S4Vectors/IRanges packages (Bioconductor) I > >think > >> does what you want: > >> > >>library(IRanges) > >>y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) > >>x <- c(unique(y), 999L) > >>hits <- findMatches(x, y) > >> > >> Then: > >> > >>> hits > >>Hits object with 9 hits and 0 metadata columns: > >> queryHits subjectHits > >> > >> [1] 1 1 > >> [2] 2 2 > >> [3] 3 3 > >> [4] 3 9 > >> [5] 4 4 > >> [6] 4 5 > >> [7] 4 8 > >> [8] 5 6 > >> [9] 6 7 > >> --- > >> queryLength: 7 > >> subjectLength: 9 > >> > >> The Hits object can be turned into a list with: > >> > >>> as.list(hits) > >>[[1]] > >>[1] 1 > >> > >>[[2]] > >>[1] 2 > >> > >>[[3]] > >>[1] 3 9 > >> > >>[[4]] > >>[1] 4 5 8 > >> > >>[[5]] > >>[1] 6 > >> > >>[[6]] > >>[1] 7 > >> > >>[[7]] > >>integer(0) > >> > >> H. > >> > >> > sessionInfo() > >> R version 3.2.0 beta (2015-04-05 r68151) > >> Platform: x86_64-unknown-linux-gnu (64-bit) > >> Running under: Ubuntu 14.04.2 LTS > >> > >> locale: > >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >> [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 > >> [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 > >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] parallel stats4stats graphics grDevices utils > >datasets > >> [8] methods base > >> > >> other attached packages: > >> [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 > >> > >> loaded via a namespace (and not attached): > >> [1] tools_3.2.0 > >> > >> On 04/06/2015 01:56 PM, Keshav Dhandhania wrote: > >> > Hi, > >> > > >> > I know that one can find all occurrences of x in a vector v by > >doing > >> >> which(x == v). > >> > > >> > However, if I need to do this again and again, where v is remaining > >the > >> > same, then this is quite inefficient. In my particular case, I need > >to do > >> > this millions of times, and length(v) = 100 million. > >> > > >> > Does anyone have suggestion on how to go about it? > >> > I know of a package called f
Re: [R] Fast multiple match function
You might find the data.table package helpful. It uses an index sorted with a radix sort and minimizes moving the data around in memory. --- Jeff NewmillerThe . . Go Live... DCN:Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania wrote: >Hi all, > >Thanks for the responses. >Herve's example is a good small size example of what I wanted. > >> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2) >> someCoolFunc(-2, y) >[1] 3 9 >> someCoolFunc(15, y) >[1] 4 5 8 > >The requirement is that I want someCoolFunc() to run in O(number of >matches) time, instead of O(size of y). >This is because y is big. And I don't know all the queries I want to >do up-front. And the results of some queries might change the queries >I want to do in the future. > >@David: I hope the above description is more clear. >@Enrico, Herve: I want both the functionality provided by one function. >- On repeated calls, fmatch() does give O(1) performance, but it does >not give all matches. >- findMatches() gives all matches, but I need to know the entire >vector x beforehand. I don't have that luxury. > > >I do have something that works now, using split and fmatch (package >fastmatch). So just posting that in case anyone in the future has the >same problem. >> y.unique <- unique(y) >> >> # create a map from the unique elements of y to the locations of all >occurrences of the element >> y.map <- split(1:length(y), match(y, y.unique)) >> >> # write a wrapper function that does a look-up on the unique list. >and then returns all matches using the map. >> someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] } > > > >On Tue, 7 Apr 2015 at 13:21 Hervé Pagès wrote: >> >> Hi Keshav, >> >> findMatches() in the S4Vectors/IRanges packages (Bioconductor) I >think >> does what you want: >> >>library(IRanges) >>y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) >>x <- c(unique(y), 999L) >>hits <- findMatches(x, y) >> >> Then: >> >>> hits >>Hits object with 9 hits and 0 metadata columns: >> queryHits subjectHits >> >> [1] 1 1 >> [2] 2 2 >> [3] 3 3 >> [4] 3 9 >> [5] 4 4 >> [6] 4 5 >> [7] 4 8 >> [8] 5 6 >> [9] 6 7 >> --- >> queryLength: 7 >> subjectLength: 9 >> >> The Hits object can be turned into a list with: >> >>> as.list(hits) >>[[1]] >>[1] 1 >> >>[[2]] >>[1] 2 >> >>[[3]] >>[1] 3 9 >> >>[[4]] >>[1] 4 5 8 >> >>[[5]] >>[1] 6 >> >>[[6]] >>[1] 7 >> >>[[7]] >>integer(0) >> >> H. >> >> > sessionInfo() >> R version 3.2.0 beta (2015-04-05 r68151) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> Running under: Ubuntu 14.04.2 LTS >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats4stats graphics grDevices utils >datasets >> [8] methods base >> >> other attached packages: >> [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 >> >> loaded via a namespace (and not attached): >> [1] tools_3.2.0 >> >> On 04/06/2015 01:56 PM, Keshav Dhandhania wrote: >> > Hi, >> > >> > I know that one can find all occurrences of x in a vector v by >doing >> >> which(x == v). >> > >> > However, if I need to do this again and again, where v is remaining >the >> > same, then this is quite inefficient. In my particular case, I need >to do >> > this millions of times, and length(v) = 100 million. >> > >> > Does anyone have suggestion on how to go about it? >> > I know of a package called fmatch that does the above for the match >> > function. But they don't handle multiple matches. >> > >> > Thanks >> > >> > [[alternative HTML version deleted]] >> > >> > __ >> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > >> >> -- >> Hervé Pagès >> >> Pro
Re: [R] Fast multiple match function
Hi all, Thanks for the responses. Herve's example is a good small size example of what I wanted. > y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2) > someCoolFunc(-2, y) [1] 3 9 > someCoolFunc(15, y) [1] 4 5 8 The requirement is that I want someCoolFunc() to run in O(number of matches) time, instead of O(size of y). This is because y is big. And I don't know all the queries I want to do up-front. And the results of some queries might change the queries I want to do in the future. @David: I hope the above description is more clear. @Enrico, Herve: I want both the functionality provided by one function. - On repeated calls, fmatch() does give O(1) performance, but it does not give all matches. - findMatches() gives all matches, but I need to know the entire vector x beforehand. I don't have that luxury. I do have something that works now, using split and fmatch (package fastmatch). So just posting that in case anyone in the future has the same problem. > y.unique <- unique(y) > > # create a map from the unique elements of y to the locations of all > occurrences of the element > y.map <- split(1:length(y), match(y, y.unique)) > > # write a wrapper function that does a look-up on the unique list. and then > returns all matches using the map. > someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] } On Tue, 7 Apr 2015 at 13:21 Hervé Pagès wrote: > > Hi Keshav, > > findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think > does what you want: > >library(IRanges) >y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) >x <- c(unique(y), 999L) >hits <- findMatches(x, y) > > Then: > >> hits >Hits object with 9 hits and 0 metadata columns: > queryHits subjectHits > > [1] 1 1 > [2] 2 2 > [3] 3 3 > [4] 3 9 > [5] 4 4 > [6] 4 5 > [7] 4 8 > [8] 5 6 > [9] 6 7 > --- > queryLength: 7 > subjectLength: 9 > > The Hits object can be turned into a list with: > >> as.list(hits) >[[1]] >[1] 1 > >[[2]] >[1] 2 > >[[3]] >[1] 3 9 > >[[4]] >[1] 4 5 8 > >[[5]] >[1] 6 > >[[6]] >[1] 7 > >[[7]] >integer(0) > > H. > > > sessionInfo() > R version 3.2.0 beta (2015-04-05 r68151) > Platform: x86_64-unknown-linux-gnu (64-bit) > Running under: Ubuntu 14.04.2 LTS > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats4stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 > > loaded via a namespace (and not attached): > [1] tools_3.2.0 > > On 04/06/2015 01:56 PM, Keshav Dhandhania wrote: > > Hi, > > > > I know that one can find all occurrences of x in a vector v by doing > >> which(x == v). > > > > However, if I need to do this again and again, where v is remaining the > > same, then this is quite inefficient. In my particular case, I need to do > > this millions of times, and length(v) = 100 million. > > > > Does anyone have suggestion on how to go about it? > > I know of a package called fmatch that does the above for the match > > function. But they don't handle multiple matches. > > > > Thanks > > > > [[alternative HTML version deleted]] > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax:(206) 667-1319 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fast multiple match function
Hi Keshav, findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think does what you want: library(IRanges) y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) x <- c(unique(y), 999L) hits <- findMatches(x, y) Then: > hits Hits object with 9 hits and 0 metadata columns: queryHits subjectHits [1] 1 1 [2] 2 2 [3] 3 3 [4] 3 9 [5] 4 4 [6] 4 5 [7] 4 8 [8] 5 6 [9] 6 7 --- queryLength: 7 subjectLength: 9 The Hits object can be turned into a list with: > as.list(hits) [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 9 [[4]] [1] 4 5 8 [[5]] [1] 6 [[6]] [1] 7 [[7]] integer(0) H. > sessionInfo() R version 3.2.0 beta (2015-04-05 r68151) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 loaded via a namespace (and not attached): [1] tools_3.2.0 On 04/06/2015 01:56 PM, Keshav Dhandhania wrote: Hi, I know that one can find all occurrences of x in a vector v by doing which(x == v). However, if I need to do this again and again, where v is remaining the same, then this is quite inefficient. In my particular case, I need to do this millions of times, and length(v) = 100 million. Does anyone have suggestion on how to go about it? I know of a package called fmatch that does the above for the match function. But they don't handle multiple matches. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fast multiple match function
On Mon, 06 Apr 2015, Keshav Dhandhania writes: > Hi, > > I know that one can find all occurrences of x in a vector v by doing >> which(x == v). > > However, if I need to do this again and again, where v is remaining the > same, then this is quite inefficient. In my particular case, I need to do > this millions of times, and length(v) = 100 million. > > Does anyone have suggestion on how to go about it? > I know of a package called fmatch that does the above for the match > function. But they don't handle multiple matches. > Perhaps 'match(x, v)' is what you want? In which 'x' may be a vector of length > 1. In any case, have you actually tried package 'fastmatch'? The function 'fmatch', which that package provides, is very fast for repeated lookups in a table 'v'. -- Enrico Schumann Lucerne, Switzerland http://enricoschumann.net __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fast multiple match function
split() might help, but you should give a more complete explanation of your problem. Bill Dunlap TIBCO Software wdunlap tibco.com On Mon, Apr 6, 2015 at 1:56 PM, Keshav Dhandhania wrote: > Hi, > > I know that one can find all occurrences of x in a vector v by doing > > which(x == v). > > However, if I need to do this again and again, where v is remaining the > same, then this is quite inefficient. In my particular case, I need to do > this millions of times, and length(v) = 100 million. > > Does anyone have suggestion on how to go about it? > I know of a package called fmatch that does the above for the match > function. But they don't handle multiple matches. > > Thanks > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fast multiple match function
On Apr 6, 2015, at 1:56 PM, Keshav Dhandhania wrote: > Hi, > > I know that one can find all occurrences of x in a vector v by doing >> which(x == v). > > However, if I need to do this again and again, where v is remaining the > same, then this is quite inefficient. In my particular case, I need to do > this millions of times, and length(v) = 100 million. > > Does anyone have suggestion on how to go about it? > I know of a package called fmatch that does the above for the match > function. But they don't handle multiple matches. > You should explain why you need to do it millions of times and you should pose a small sample problem that presents the level of complexity needed in a minimal size. > Thanks > > [[alternative HTML version deleted]] And you should read the Posting Guide where it is strongly advised that you not post in HTML format. I have used gmail and I do know that it is fairly easy to post in plain text. -- David Winsemius Alameda, CA, USA __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Fast multiple match function
Hi, I know that one can find all occurrences of x in a vector v by doing > which(x == v). However, if I need to do this again and again, where v is remaining the same, then this is quite inefficient. In my particular case, I need to do this millions of times, and length(v) = 100 million. Does anyone have suggestion on how to go about it? I know of a package called fmatch that does the above for the match function. But they don't handle multiple matches. Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.