Re: [R] speed up subsetting with certain conditions

Martin Morgan Wed, 12 Jan 2011 15:13:08 -0800

On 1/12/2011 2:52 PM, Duke wrote:

Hi folks,
I am working on a project that requires subsetting of a found filebased on some known file. The known file contains several lines likebelow:
chr1    3237546    3237547    rs52310428    0    +
chr1    3237549    3237550    rs52097582    0    +
chr2    4513326    4513327    rs29769280    0    +
chr2    4513337    4513338    rs33286009    0    +
where the first column can be chr2, chr1, chr12 etc... The second andthird are numbers (cordinates). The found file contains lines like:
chr1    3213435    G    C
chr1    3237547    T    C
chr1    3237549    G    T
chr2    4513326    A    G
chr2    4513337    C    G
where the first column, again, can be chr1, chr2, chr12 etc... and thesecond is a number. What I have to do is to separate the found file totwo files: one (foundY) contains lines that have the same first columnand the second column in range of the two columns 2 and 3 of any lineof known file, and one (foundN) contains lines that do not meet theprevious condition. For the two examples above, foundN will be thefirst line, and foundY will be the next 4 lines.
What I came up with is this algorithm:
* get the uniq item in the first column of found file (chr1, chr2,chr12, chr13 etc...)* for each of the uniq item, set subset of the known file and thefound file that have same first column, then scanning each item in theknown subset to see if any line meets any condition
The code is like below:

########## CODE START###########
# import known and found files to data frames
known <- read.table( "known.txt", sep="\t", header=FALSE )
found <- read.table( "found.txt", sep="\t", header=FALSE, fill=TRUE )

# get the uniq item in first column of found file
found.Chr <- as.character(found[!duplicated(found[[1]]),1])

# create two empty result data frames
foundN <- found[0,]
foundY <- found[0,]

# scan for each of the uniq items
for ( iChr in found.Chr ) {
  # subset of known and found with specific item
  found.iChr <- found[found[[1]]==iChr,]
  known.iChr <- known[known[[1]]==iChr,]

  # scan through all found subset items
  if ( nrow(known.iChr)>0 ) {
    for ( i in 1:nrow(found.iChr) ) {
if ( nrow(known.iChr[known.iChr[[3]]>=found.iChr[i,2] &known.iChr[[2]]<=found.iChr[i,2],])==0 ) {
          foundN <- rbind( foundN, found.iChr[i,] )
      } else {
          foundY <- rbind( foundN, found.iChr[i,] )
      }
    }
  }
}

########## CODE END###########
The code works well, but I tested it for only small known and foundfiles. When trying with larger files (the known file can contains ~ 15million lines, the found ~ 15k lines), it takes like hrs to run.
I want to speed up the process, and I believe there must be a betteralgorithm to do this with R. My questions are:
* any body has a better algorithm or comments or suggestion?

The Bioconductor project has many tools for dealing withsequence-related data. With the data


k <- read.table(textConnection(
"chr1    3237546    3237547    rs52310428    0    +
chr1    3237549    3237550    rs52097582    0    +
chr2    4513326    4513327    rs29769280    0    +
chr2    4513337    4513338    rs33286009    0    +"))

f <- read.table(textConnection(
"chr1    3213435    G    C
chr1    3237547    T    C
chr1    3237549    G    T
chr2    4513326    A    G
chr2    4513337    C    G"))

One might use the GenomicRanges package as

library(GenomicRanges)
kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
olaps <- findOverlaps(fgr, kgr)
idx <- countOverlaps(fgr, kgr) != 0

resulting in

> idx
[1] FALSE  TRUE  TRUE  TRUE  TRUE

This will be fast.

One could write foundY with as.data.frame(fgr[idx]) (maybe a littleediting) but likely one would want to stay in R / Bioc and do somethingmore interesting...


See

http://bioconductor.org/install/index.html

Martin

* I read (google) that matrices work faster than data frame. Can I usematrices for this case? (is matrices for numbers only?)* I read (google) that I should avoid rbind, and prelocate data framefor faster speed. How would I do that in this case?
Thank you very much in advance,

Bests,

D.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] speed up subsetting with certain conditions

Reply via email to