Re: [R] speed up subsetting with certain conditions

Duke Wed, 12 Jan 2011 15:45:19 -0800

On 1/12/11 6:12 PM, Martin Morgan wrote:

The Bioconductor project has many tools for dealing withsequence-related data. With the data


k <- read.table(textConnection(
"chr1    3237546    3237547    rs52310428    0    +
chr1    3237549    3237550    rs52097582    0    +
chr2    4513326    4513327    rs29769280    0    +
chr2    4513337    4513338    rs33286009    0    +"))

f <- read.table(textConnection(
"chr1    3213435    G    C
chr1    3237547    T    C
chr1    3237549    G    T
chr2    4513326    A    G
chr2    4513337    C    G"))

One might use the GenomicRanges package as

library(GenomicRanges)
kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
olaps <- findOverlaps(fgr, kgr)
idx <- countOverlaps(fgr, kgr) != 0

resulting in

> idx
[1] FALSE  TRUE  TRUE  TRUE  TRUE

This will be fast.

Thanks so much for your suggestion Martin. I had Bioconductor installedbut I honestly do not know all its applications. Anyway, I am testingGenomicRanges with my data now. I will report back when I get the result.

One could write foundY with as.data.frame(fgr[idx]) (maybe a littleediting) but likely one would want to stay in R / Bioc and dosomething more interesting...

I suppose foundN <- as.data.frame(fgr[!idx]) and foundY <-as.data.frame(fgr[idx]) as you suggested, but I dont really understandyour last comment :).


Thanks,

D.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] speed up subsetting with certain conditions

Reply via email to