You could try doing it without a loop (.C or other):

 (rgnsnp <- merge(region,snps))
 (rgnsnp[with(rgnsnp,STOP>=POS & POS >= START),])

Here is my test for merge+search on 100k/200k:

fdf1 <- data.frame(chr=1:100000,p=runif(100000),d=sample(100000))
fdf2 <- data.frame(chr=rep(1:100000,2),s=runif(200000),t=runif(200000))
system.time(with(FDF <- merge(fdf2,fdf1),FDF[s>=p & p >= t,]))
   user  system elapsed
  2.560   0.152   2.905

 Hope this helps
Elai

On Wed, Mar 14, 2012 at 1:27 PM, Davis, Brian <brian.da...@uth.tmc.edu> wrote:
> I have a solution (actually a few) to this problem, but none are 
> computationally efficient enough to be useful.  I'm hoping someone can 
> enlighten me to a better solution.
>
> I have data frame of chromosome/position pairs (along with other data for the 
> location).  For each pair I need to determine if it is with in a given data 
> frame of ranges.  I need to keep only the pairs that are within any of the 
> ranges for further processing.
>
> Example:
> snps<-NULL
> snps$CHR<-c("1","2","2","3","X")
> snps$POS<-as.integer(c(295,640,670,100,1100))
> snps$DAT<-seq(1:length(snps$CHR))
> snps<-as.data.frame(snps, stringsAsFactors=FALSE)
>
>  snps
>  CHR  POS DAT
> 1   1  295   1
> 2   2  640   2
> 3   2  670   3
> 4   3  100   4
> 5   X 1100   5
>
> region<-NULL
> region$CHR<-c("1","1","2","2","2","X")
> region$START<-as.integer(c(10,210,430,650,810,1090))
> region$STOP<-as.integer(c(100,350,630,675,850,1111))
> region<-as.data.frame(region, stringsAsFactors=FALSE)
>
> region
>  CHR START STOP
> 1   1    10  100
> 2   1   210  350
> 3   2   430  630
> 4   2   650  675
> 5   2   810  850
> 6   X  1090 1111
>
>
> The result I need would look like
>
> Res
>
>  CHR  POS DAT
>   1  295   1
>   2  670   3
>   X 1100   5
>
>
> I have a solution that works reasonably well on small sets, but my current 
> data set is ~100K snp entries, and my regions table has ~200K entries. I have 
> ~1500 files to go through
>
> I haven't found a good way to efficiently solve this problem.  I've tried 
> various versions of mapply/lapply, for loops, etc which get the answer for 
> small sets but takes hours (per file) on my real data.  Bioconductor seemed 
> like the obvious place to look, but my GoogleFu must not be that great.  I 
> never found anything relevant.
>
> Any ideas or points to the right direction would be greatly appreciated.
>
>
>
> Brian Davis
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to