Re: [R] Fast multiple match function

2015-04-17 Thread Keshav Dhandhania
Hi Jeff,

Indeed the data.table package does provide a much cleaner way to achieve
the same functionality, and a lot of other functionality as bonus.

Thanks for letting me know about it.

On Tue, 7 Apr 2015 at 15:41 Jeff Newmiller  wrote:

> You might find the data.table package helpful. It uses an index sorted
> with a radix sort and minimizes moving the data around in memory.
> ---
> Jeff NewmillerThe .   .  Go Live...
> DCN:Basics: ##.#.   ##.#.  Live
> Go...
>   Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
> /Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
> ---
> Sent from my phone. Please excuse my brevity.
>
> On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania 
> wrote:
> >Hi all,
> >
> >Thanks for the responses.
> >Herve's example is a good small size example of what I wanted.
> >
> >> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2)
> >> someCoolFunc(-2, y)
> >[1] 3 9
> >> someCoolFunc(15, y)
> >[1] 4 5 8
> >
> >The requirement is that I want someCoolFunc() to run in O(number of
> >matches) time, instead of O(size of y).
> >This is because y is big. And I don't know all the queries I want to
> >do up-front. And the results of some queries might change the queries
> >I want to do in the future.
> >
> >@David: I hope the above description is more clear.
> >@Enrico, Herve: I want both the functionality provided by one function.
> >- On repeated calls, fmatch() does give O(1) performance, but it does
> >not give all matches.
> >- findMatches() gives all matches, but I need to know the entire
> >vector x beforehand. I don't have that luxury.
> >
> >
> >I do have something that works now, using split and fmatch (package
> >fastmatch). So just posting that in case anyone in the future has the
> >same problem.
> >> y.unique <- unique(y)
> >>
> >> # create a map from the unique elements of y to the locations of all
> >occurrences of the element
> >> y.map <- split(1:length(y), match(y, y.unique))
> >>
> >> # write a wrapper function that does a look-up on the unique list.
> >and then returns all matches using the map.
> >> someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] }
> >
> >
> >
> >On Tue, 7 Apr 2015 at 13:21 Hervé Pagès  wrote:
> >>
> >> Hi Keshav,
> >>
> >> findMatches() in the S4Vectors/IRanges packages (Bioconductor) I
> >think
> >> does what you want:
> >>
> >>library(IRanges)
> >>y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
> >>x <- c(unique(y), 999L)
> >>hits <- findMatches(x, y)
> >>
> >> Then:
> >>
> >>> hits
> >>Hits object with 9 hits and 0 metadata columns:
> >>  queryHits subjectHits
> >> 
> >>  [1] 1   1
> >>  [2] 2   2
> >>  [3] 3   3
> >>  [4] 3   9
> >>  [5] 4   4
> >>  [6] 4   5
> >>  [7] 4   8
> >>  [8] 5   6
> >>  [9] 6   7
> >>  ---
> >>  queryLength: 7
> >>  subjectLength: 9
> >>
> >> The Hits object can be turned into a list with:
> >>
> >>> as.list(hits)
> >>[[1]]
> >>[1] 1
> >>
> >>[[2]]
> >>[1] 2
> >>
> >>[[3]]
> >>[1] 3 9
> >>
> >>[[4]]
> >>[1] 4 5 8
> >>
> >>[[5]]
> >>[1] 6
> >>
> >>[[6]]
> >>[1] 7
> >>
> >>[[7]]
> >>integer(0)
> >>
> >> H.
> >>
> >>  > sessionInfo()
> >> R version 3.2.0 beta (2015-04-05 r68151)
> >> Platform: x86_64-unknown-linux-gnu (64-bit)
> >> Running under: Ubuntu 14.04.2 LTS
> >>
> >> locale:
> >>   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
> >>   [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
> >>   [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
> >>   [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
> >>   [9] LC_ADDRESS=C   LC_TELEPHONE=C
> >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>
> >> attached base packages:
> >> [1] parallel  stats4stats graphics  grDevices utils
> >datasets
> >> [8] methods   base
> >>
> >> other attached packages:
> >> [1] IRanges_2.1.43   S4Vectors_0.5.22 BiocGenerics_0.13.11
> >>
> >> loaded via a namespace (and not attached):
> >> [1] tools_3.2.0
> >>
> >> On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
> >> > Hi,
> >> >
> >> > I know that one can find all occurrences of x in a vector v by
> >doing
> >> >> which(x == v).
> >> >
> >> > However, if I need to do this again and again, where v is remaining
> >the
> >> > same, then this is quite inefficient. In my particular case, I need
> >to do
> >> > this millions of times, and length(v) = 100 million.
> >> >
> >> > Does anyone have suggestion on how to go about it?
> >> > I know of a package called f

Re: [R] Fast multiple match function

2015-04-07 Thread Jeff Newmiller
You might find the data.table package helpful. It uses an index sorted with a 
radix sort and minimizes moving the data around in memory.
---
Jeff NewmillerThe .   .  Go Live...
DCN:Basics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
--- 
Sent from my phone. Please excuse my brevity.

On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania  wrote:
>Hi all,
>
>Thanks for the responses.
>Herve's example is a good small size example of what I wanted.
>
>> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2)
>> someCoolFunc(-2, y)
>[1] 3 9
>> someCoolFunc(15, y)
>[1] 4 5 8
>
>The requirement is that I want someCoolFunc() to run in O(number of
>matches) time, instead of O(size of y).
>This is because y is big. And I don't know all the queries I want to
>do up-front. And the results of some queries might change the queries
>I want to do in the future.
>
>@David: I hope the above description is more clear.
>@Enrico, Herve: I want both the functionality provided by one function.
>- On repeated calls, fmatch() does give O(1) performance, but it does
>not give all matches.
>- findMatches() gives all matches, but I need to know the entire
>vector x beforehand. I don't have that luxury.
>
>
>I do have something that works now, using split and fmatch (package
>fastmatch). So just posting that in case anyone in the future has the
>same problem.
>> y.unique <- unique(y)
>>
>> # create a map from the unique elements of y to the locations of all
>occurrences of the element
>> y.map <- split(1:length(y), match(y, y.unique))
>>
>> # write a wrapper function that does a look-up on the unique list.
>and then returns all matches using the map.
>> someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] }
>
>
>
>On Tue, 7 Apr 2015 at 13:21 Hervé Pagès  wrote:
>>
>> Hi Keshav,
>>
>> findMatches() in the S4Vectors/IRanges packages (Bioconductor) I
>think
>> does what you want:
>>
>>library(IRanges)
>>y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
>>x <- c(unique(y), 999L)
>>hits <- findMatches(x, y)
>>
>> Then:
>>
>>> hits
>>Hits object with 9 hits and 0 metadata columns:
>>  queryHits subjectHits
>> 
>>  [1] 1   1
>>  [2] 2   2
>>  [3] 3   3
>>  [4] 3   9
>>  [5] 4   4
>>  [6] 4   5
>>  [7] 4   8
>>  [8] 5   6
>>  [9] 6   7
>>  ---
>>  queryLength: 7
>>  subjectLength: 9
>>
>> The Hits object can be turned into a list with:
>>
>>> as.list(hits)
>>[[1]]
>>[1] 1
>>
>>[[2]]
>>[1] 2
>>
>>[[3]]
>>[1] 3 9
>>
>>[[4]]
>>[1] 4 5 8
>>
>>[[5]]
>>[1] 6
>>
>>[[6]]
>>[1] 7
>>
>>[[7]]
>>integer(0)
>>
>> H.
>>
>>  > sessionInfo()
>> R version 3.2.0 beta (2015-04-05 r68151)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> Running under: Ubuntu 14.04.2 LTS
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>   [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats4stats graphics  grDevices utils
>datasets
>> [8] methods   base
>>
>> other attached packages:
>> [1] IRanges_2.1.43   S4Vectors_0.5.22 BiocGenerics_0.13.11
>>
>> loaded via a namespace (and not attached):
>> [1] tools_3.2.0
>>
>> On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
>> > Hi,
>> >
>> > I know that one can find all occurrences of x in a vector v by
>doing
>> >> which(x == v).
>> >
>> > However, if I need to do this again and again, where v is remaining
>the
>> > same, then this is quite inefficient. In my particular case, I need
>to do
>> > this millions of times, and length(v) = 100 million.
>> >
>> > Does anyone have suggestion on how to go about it?
>> > I know of a package called fmatch that does the above for the match
>> > function. But they don't handle multiple matches.
>> >
>> > Thanks
>> >
>> >   [[alternative HTML version deleted]]
>> >
>> > __
>> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>> --
>> Hervé Pagès
>>
>> Pro

Re: [R] Fast multiple match function

2015-04-07 Thread Keshav Dhandhania
Hi all,

Thanks for the responses.
Herve's example is a good small size example of what I wanted.

> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2)
> someCoolFunc(-2, y)
[1] 3 9
> someCoolFunc(15, y)
[1] 4 5 8

The requirement is that I want someCoolFunc() to run in O(number of
matches) time, instead of O(size of y).
This is because y is big. And I don't know all the queries I want to
do up-front. And the results of some queries might change the queries
I want to do in the future.

@David: I hope the above description is more clear.
@Enrico, Herve: I want both the functionality provided by one function.
- On repeated calls, fmatch() does give O(1) performance, but it does
not give all matches.
- findMatches() gives all matches, but I need to know the entire
vector x beforehand. I don't have that luxury.


I do have something that works now, using split and fmatch (package
fastmatch). So just posting that in case anyone in the future has the
same problem.
> y.unique <- unique(y)
>
> # create a map from the unique elements of y to the locations of all 
> occurrences of the element
> y.map <- split(1:length(y), match(y, y.unique))
>
> # write a wrapper function that does a look-up on the unique list. and then 
> returns all matches using the map.
> someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] }



On Tue, 7 Apr 2015 at 13:21 Hervé Pagès  wrote:
>
> Hi Keshav,
>
> findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think
> does what you want:
>
>library(IRanges)
>y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
>x <- c(unique(y), 999L)
>hits <- findMatches(x, y)
>
> Then:
>
>> hits
>Hits object with 9 hits and 0 metadata columns:
>  queryHits subjectHits
> 
>  [1] 1   1
>  [2] 2   2
>  [3] 3   3
>  [4] 3   9
>  [5] 4   4
>  [6] 4   5
>  [7] 4   8
>  [8] 5   6
>  [9] 6   7
>  ---
>  queryLength: 7
>  subjectLength: 9
>
> The Hits object can be turned into a list with:
>
>> as.list(hits)
>[[1]]
>[1] 1
>
>[[2]]
>[1] 2
>
>[[3]]
>[1] 3 9
>
>[[4]]
>[1] 4 5 8
>
>[[5]]
>[1] 6
>
>[[6]]
>[1] 7
>
>[[7]]
>integer(0)
>
> H.
>
>  > sessionInfo()
> R version 3.2.0 beta (2015-04-05 r68151)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.2 LTS
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>   [9] LC_ADDRESS=C   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats4stats graphics  grDevices utils datasets
> [8] methods   base
>
> other attached packages:
> [1] IRanges_2.1.43   S4Vectors_0.5.22 BiocGenerics_0.13.11
>
> loaded via a namespace (and not attached):
> [1] tools_3.2.0
>
> On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
> > Hi,
> >
> > I know that one can find all occurrences of x in a vector v by doing
> >> which(x == v).
> >
> > However, if I need to do this again and again, where v is remaining the
> > same, then this is quite inefficient. In my particular case, I need to do
> > this millions of times, and length(v) = 100 million.
> >
> > Does anyone have suggestion on how to go about it?
> > I know of a package called fmatch that does the above for the match
> > function. But they don't handle multiple matches.
> >
> > Thanks
> >
> >   [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Fast multiple match function

2015-04-07 Thread Hervé Pagès

Hi Keshav,

findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think
does what you want:

  library(IRanges)
  y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
  x <- c(unique(y), 999L)
  hits <- findMatches(x, y)

Then:

  > hits
  Hits object with 9 hits and 0 metadata columns:
queryHits subjectHits
   
[1] 1   1
[2] 2   2
[3] 3   3
[4] 3   9
[5] 4   4
[6] 4   5
[7] 4   8
[8] 5   6
[9] 6   7
---
queryLength: 7
subjectLength: 9

The Hits object can be turned into a list with:

  > as.list(hits)
  [[1]]
  [1] 1

  [[2]]
  [1] 2

  [[3]]
  [1] 3 9

  [[4]]
  [1] 4 5 8

  [[5]]
  [1] 6

  [[6]]
  [1] 7

  [[7]]
  integer(0)

H.

> sessionInfo()
R version 3.2.0 beta (2015-04-05 r68151)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
[1] IRanges_2.1.43   S4Vectors_0.5.22 BiocGenerics_0.13.11

loaded via a namespace (and not attached):
[1] tools_3.2.0

On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:

Hi,

I know that one can find all occurrences of x in a vector v by doing

which(x == v).


However, if I need to do this again and again, where v is remaining the
same, then this is quite inefficient. In my particular case, I need to do
this millions of times, and length(v) = 100 million.

Does anyone have suggestion on how to go about it?
I know of a package called fmatch that does the above for the match
function. But they don't handle multiple matches.

Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fast multiple match function

2015-04-07 Thread Enrico Schumann
On Mon, 06 Apr 2015, Keshav Dhandhania  writes:

> Hi,
>
> I know that one can find all occurrences of x in a vector v by doing
>> which(x == v).
>
> However, if I need to do this again and again, where v is remaining the
> same, then this is quite inefficient. In my particular case, I need to do
> this millions of times, and length(v) = 100 million.
>
> Does anyone have suggestion on how to go about it?
> I know of a package called fmatch that does the above for the match
> function. But they don't handle multiple matches.
>

Perhaps 'match(x, v)' is what you want? In which 'x' may be a vector of
length > 1.

In any case, have you actually tried package 'fastmatch'? The function
'fmatch', which that package provides, is very fast for repeated
lookups in a table 'v'.


-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fast multiple match function

2015-04-06 Thread William Dunlap
split() might help, but you should give a more complete
explanation of your problem.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Apr 6, 2015 at 1:56 PM, Keshav Dhandhania 
wrote:

> Hi,
>
> I know that one can find all occurrences of x in a vector v by doing
> > which(x == v).
>
> However, if I need to do this again and again, where v is remaining the
> same, then this is quite inefficient. In my particular case, I need to do
> this millions of times, and length(v) = 100 million.
>
> Does anyone have suggestion on how to go about it?
> I know of a package called fmatch that does the above for the match
> function. But they don't handle multiple matches.
>
> Thanks
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fast multiple match function

2015-04-06 Thread David Winsemius

On Apr 6, 2015, at 1:56 PM, Keshav Dhandhania wrote:

> Hi,
> 
> I know that one can find all occurrences of x in a vector v by doing
>> which(x == v).
> 
> However, if I need to do this again and again, where v is remaining the
> same, then this is quite inefficient. In my particular case, I need to do
> this millions of times, and length(v) = 100 million.
> 
> Does anyone have suggestion on how to go about it?
> I know of a package called fmatch that does the above for the match
> function. But they don't handle multiple matches.
> 

You should explain why you need to do it millions of times and you should pose 
a small sample problem that presents the level of complexity needed in a 
minimal size.

> Thanks
> 
>   [[alternative HTML version deleted]]

And you should read the Posting Guide where it is strongly advised that you not 
post in HTML format. I have used gmail and I do know that it is fairly easy to 
post in plain text.

-- 
David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Fast multiple match function

2015-04-06 Thread Keshav Dhandhania
Hi,

I know that one can find all occurrences of x in a vector v by doing
> which(x == v).

However, if I need to do this again and again, where v is remaining the
same, then this is quite inefficient. In my particular case, I need to do
this millions of times, and length(v) = 100 million.

Does anyone have suggestion on how to go about it?
I know of a package called fmatch that does the above for the match
function. But they don't handle multiple matches.

Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.