On Mar 14, 2012, at 3:27 PM, Davis, Brian wrote:
I have a solution (actually a few) to this problem, but none are
computationally efficient enough to be useful. I'm hoping someone
can enlighten me to a better solution.
I have data frame of chromosome/position pairs (along with other
data for the location). For each pair I need to determine if it is
with in a given data frame of ranges. I need to keep only the pairs
that are within any of the ranges for further processing.
Example:
snps<-NULL
snps$CHR<-c("1","2","2","3","X")
snps$POS<-as.integer(c(295,640,670,100,1100))
snps$DAT<-seq(1:length(snps$CHR))
snps<-as.data.frame(snps, stringsAsFactors=FALSE)
snps
CHR POS DAT
1 1 295 1
2 2 640 2
3 2 670 3
4 3 100 4
5 X 1100 5
region<-NULL
region$CHR<-c("1","1","2","2","2","X")
region$START<-as.integer(c(10,210,430,650,810,1090))
region$STOP<-as.integer(c(100,350,630,675,850,1111))
region<-as.data.frame(region, stringsAsFactors=FALSE)
region
CHR START STOP
1 1 10 100
2 1 210 350
3 2 430 630
4 2 650 675
5 2 810 850
6 X 1090 1111
The result I need would look like
Res
CHR POS DAT
1 295 1
2 670 3
X 1100 5
I have a solution that works reasonably well on small sets, but my
current data set is ~100K snp entries, and my regions table has
~200K entries. I have ~1500 files to go through
I haven't found a good way to efficiently solve this problem. I've
tried various versions of mapply/lapply, for loops, etc which get
the answer for small sets but takes hours (per file) on my real
data. Bioconductor seemed like the obvious place to look, but my
GoogleFu must not be that great. I never found anything relevant.
Any ideas or points to the right direction would be greatly
appreciated.
The usual BioC recommendation for this sort of problem is package
IRanges. And that mailing list probably has many readers who have used
that package, unkike this mailing list.
It purported to handle overlapping ranges as well as the non-
overlapping problem you pose.
http://www.googlesyndicatedsearch.com/u/newcastlemaths?q=+chromosome+position+iranges&sa=Google+Search
--
David Winsemius, MD
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.