[R] Needing a better solution to a lookup problem.

Davis, Brian Wed, 14 Mar 2012 12:29:36 -0700

I have a solution (actually a few) to this problem, but none are 
computationally efficient enough to be useful.  I'm hoping someone can 
enlighten me to a better solution.


I have data frame of chromosome/position pairs (along with other data for the 
location).  For each pair I need to determine if it is with in a given data 
frame of ranges.  I need to keep only the pairs that are within any of the 
ranges for further processing.

Example:
snps<-NULL
snps$CHR<-c("1","2","2","3","X")
snps$POS<-as.integer(c(295,640,670,100,1100))
snps$DAT<-seq(1:length(snps$CHR))
snps<-as.data.frame(snps, stringsAsFactors=FALSE)

 snps
  CHR  POS DAT
1   1  295   1
2   2  640   2
3   2  670   3
4   3  100   4
5   X 1100   5

region<-NULL
region$CHR<-c("1","1","2","2","2","X")
region$START<-as.integer(c(10,210,430,650,810,1090))
region$STOP<-as.integer(c(100,350,630,675,850,1111))
region<-as.data.frame(region, stringsAsFactors=FALSE)

region
  CHR START STOP
1   1    10  100
2   1   210  350
3   2   430  630
4   2   650  675
5   2   810  850
6   X  1090 1111


The result I need would look like

Res

 CHR  POS DAT
   1  295   1
   2  670   3
   X 1100   5


I have a solution that works reasonably well on small sets, but my current data 
set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 
files to go through

I haven't found a good way to efficiently solve this problem.  I've tried 
various versions of mapply/lapply, for loops, etc which get the answer for 
small sets but takes hours (per file) on my real data.  Bioconductor seemed 
like the obvious place to look, but my GoogleFu must not be that great.  I 
never found anything relevant.

Any ideas or points to the right direction would be greatly appreciated.



Brian Davis

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Needing a better solution to a lookup problem.

Reply via email to