On 1/12/11 6:12 PM, Martin Morgan wrote:
The Bioconductor project has many tools for dealing with
sequence-related data. With the data
k <- read.table(textConnection(
"chr1 3237546 3237547 rs52310428 0 +
chr1 3237549 3237550 rs52097582 0 +
chr2 4513326 4513327 rs29769280 0 +
chr2 4513337 4513338 rs33286009 0 +"))
f <- read.table(textConnection(
"chr1 3213435 G C
chr1 3237547 T C
chr1 3237549 G T
chr2 4513326 A G
chr2 4513337 C G"))
One might use the GenomicRanges package as
library(GenomicRanges)
kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
olaps <- findOverlaps(fgr, kgr)
idx <- countOverlaps(fgr, kgr) != 0
resulting in
> idx
[1] FALSE TRUE TRUE TRUE TRUE
This will be fast.
Thanks so much for your suggestion Martin. I had Bioconductor installed
but I honestly do not know all its applications. Anyway, I am testing
GenomicRanges with my data now. I will report back when I get the result.
One could write foundY with as.data.frame(fgr[idx]) (maybe a little
editing) but likely one would want to stay in R / Bioc and do
something more interesting...
I suppose foundN <- as.data.frame(fgr[!idx]) and foundY <-
as.data.frame(fgr[idx]) as you suggested, but I dont really understand
your last comment :).
Thanks,
D.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.