On 1/12/11 6:12 PM, Martin Morgan wrote:
The Bioconductor project has many tools for dealing with sequence-related data. With the data

k <- read.table(textConnection(
"chr1    3237546    3237547    rs52310428    0    +
chr1    3237549    3237550    rs52097582    0    +
chr2    4513326    4513327    rs29769280    0    +
chr2    4513337    4513338    rs33286009    0    +"))

f <- read.table(textConnection(
"chr1    3213435    G    C
chr1    3237547    T    C
chr1    3237549    G    T
chr2    4513326    A    G
chr2    4513337    C    G"))

One might use the GenomicRanges package as

library(GenomicRanges)
kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
olaps <- findOverlaps(fgr, kgr)
idx <- countOverlaps(fgr, kgr) != 0

resulting in

> idx
[1] FALSE  TRUE  TRUE  TRUE  TRUE

This will be fast.

Thanks so much for your suggestion Martin. I had Bioconductor installed but I honestly do not know all its applications. Anyway, I am testing GenomicRanges with my data now. I will report back when I get the result.


One could write foundY with as.data.frame(fgr[idx]) (maybe a little editing) but likely one would want to stay in R / Bioc and do something more interesting...


I suppose foundN <- as.data.frame(fgr[!idx]) and foundY <- as.data.frame(fgr[idx]) as you suggested, but I dont really understand your last comment :).

Thanks,

D.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to