Re: [Bioc-sig-seq] Extracting DNA sequences from BSgenome.Mmusculus.UCSC.mm9_1.3.11

Hervé Pagès Mon, 01 Jun 2009 11:18:31 -0700

Hi Paul,

Thanks for the feedback! I just set up a script that does something similar
to your code and was able to trigger the spooky/infamous bug. The first time
after 140 iterations (55 sec.), the second time after 73, etc... Seems like
I always get it after a few dozens of iterations. This is great because
that's going to make troubleshooting this issue a little bit easier!


BTW, if you are interested in speeding up your code (and at the same time, you
might get rid of the spooky bug), you can now take advantage of the fact that
getSeq() is vectorized (starting with BSgenome >= 1.12.2). Given that there is
also a vectorized version of countPattern() (it's the vcountPattern() function),
then you don't need the "for" loop anymore:

  myb.consen.count <- character(nrow(peaks))
  all.genomic <- DNAStringSet(getSeq(Mmusculus, peaks$chr,
                                     peaks$start, peaks$end))
  c.1 <- vcountPattern(myb.dna, all.genomic, fixed=FALSE)
  c.2 <- vcountPattern(myb.dna.comp, all.genomic, fixed=FALSE)
  c.3 <- vcountPattern(myb.dna.rev, all.genomic, fixed=FALSE)
  c.4 <- vcountPattern(myb.dna.comp.rev, all.genomic, fixed=FALSE)
  ## Now c.1, c.2, c.3 and c.4 are integer vectors of the same length
  ## as 'all.genomic'
  the.sum <- c.1 + c.2 + c.3 + c.4
  the.consen.count <- paste("F:", c.1, "FC:", c.2, "FRev:", c.3, "FCRev:", c.4)
  myb.consen.count[the.sum > 0] <- the.consen.count[the.sum > 0]

This is thousands of times faster than calling getSeq() and countPattern()
inside a "for" loop. Also, it doesn't seem to trigger the spooky/infamous
bug :-)

Thanks again,
H.


Paul Leo wrote:

Sorry if this is not on topic but I think I can generate *this* spooky
bug with this simple code, I post in case it helps:

library(BSgenome.Mmusculus.UCSC.mm9)
all.genomic<-getSeq(Mmusculus, the.chrom, starts, ends)

myb<-"YAACKG"myb.dna<-DNAString(myb)

myb.dna.rev<-reverse(myb.dna)
myb.dna.comp<-complement(myb.dna)
myb.dna.comp.rev<-reverseComplement(myb.dna)

### peaks is just a set of regions - not large but lots of them#> peaks[1:5,1:4]

 #     chr     start       end length
#7568   14 119073660 119074474    815
#11980  19  41922512  41923422    911
#9375   16  70351968  70352514    547
#158     1  33635080  33636130   1051
#15576   3 151996276 151997236    961

for(i in 1:dim(peaks)[1]){

genomic<-getSeq(Mmusculus,the.chrom[i],start=starts[i],end=ends[i],as.character=FALSE)
print(length(genomic)  )

c.1<-countPattern(myb.dna,genomic,max.mismatch=0, fixed=FALSE)

c.2<-countPattern(myb.dna.comp,genomic,max.mismatch=0, fixed=FALSE)c.3<-countPattern(myb.dna.rev,genomic,max.mismatch=0, fixed=FALSE)

c.4<-countPattern(myb.dna.comp.rev,genomic,max.mismatch=0, fixed=FALSE)
the.sum<-(c.1+c.2+c.3+c.4)
print(paste("counter:",i,"sites:",the.sum,sep=" "))
if( the.sum >
0){myb.consen.count[i,]<-paste("F:",c.1,"FC:",c.2,"FRev:",c.3,"FCRev:",c.4,sep=" 
")}
}

This loop fails randomly ever few hundred iterations with no apparent
reason all the variables are fine , you can restart at the error point
and it goes fine for another few hundred then dies again....

eg

first dies atcounter 103,.......

[1] "counter: 102 sites: 5"
[1] 729

Error in assign(".nextMethod", nextMethod, envir = callEnv) :formal argument "envir" matched by multiple actual argumentsError in isEmpty(xx) :error in evaluating the argument 'x' in selecting a method for

function 'isEmpty'
Error in countPattern(pattern, toXStringViewsOrXString(subject),

algorithm = algorithm, :error in evaluating the argument 'subject' in selecting a method for

function 'countPattern'

genomic

  729-letter "MaskedDNAString" instance (# for masking)
seq:
ACCAGGCTCCGGACGCCTTCTCCCCAGCGTTTTGCA...ATTGGGTTGGAAACTTTGTAGGAAGGAAAGGAAAAT
masks:

#### Genomic is the corect length and is ok..straight away can do:

c.1<-countPattern(myb.dna,genomic,max.mismatch=0, fixed=FALSE)
c.2<-countPattern(myb.dna.comp,genomic,max.mismatch=0, fixed=FALSE)

#SAME matchPattern(myb.dna,complement(genomic),max.mismatch=0,

fixed=FALSE)

c.3<-countPattern(myb.dna.rev,genomic,max.mismatch=0, fixed=FALSE)
c.4<-countPattern(myb.dna.comp.rev,genomic,max.mismatch=0,

fixed=FALSE)

the.sum<-(c.1+c.2+c.3+c.4)
print(paste("counter:",i,"sites:",the.sum,sep=" "))

[1] "counter: 103 sites: 3"
### see no problem what bug???


###### restart at i=103...it dies again at 604

[1] 737
[1] "counter: 604 sites: 3"
[1] 834

Error in assign(".defined", met...@defined, envir = envir) :formal argument "envir" matched by multiple actual arguments



## I will use Herves last post to get about this.....Can run on the ##
development version too if that helps??

sessionInfo()

R version 2.9.0 (2009-04-17)x86_64-pc-linux-gnu

locale:
LC_CTYPE=en_AU.UTF-8;LC_NUMERIC=C;LC_TIME=en_AU.UTF-8;LC_COLLATE=en_AU.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_AU.UTF-8;LC_PAPER=en_AU.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_AU.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods

base

other attached packages:
 [1] BSgenome.Mmusculus.UCSC.mm9_1.3.11

BSgenome_1.12.0[3] Biostrings_2.12.0IRanges_1.2.1[5] KEGG.db_2.2.11RSQLite_0.7-1[7] DBI_0.2-4AnnotationDbi_1.6.0[9] Biobase_2.4.0biomaRt_2.0.0

loaded via a namespace (and not attached):

[1] RCurl_0.94-1 tcltk_2.9.0 tools_2.9.0 XML_1.95-3

-Dr Paul Leo

Bioinformatician
Diamantina Institute for Cancer, Immunology and Metabolic Medicine
University of Queensland
--------------------------------------------------------------------------------------
Research Wing, Bldg 1

Princess Alexandria HospitalWoolloongabba, QLD, 4102

Tel: +61 7 3240 7740  Mob: 041 303 8691  Fax: +61 7 3240 5946
Email: [email protected]   Web: http://www.di.uq.edu.au

-----Original Message-----
From: Hervé Pagès <[email protected]>
To: Ivan Gregoretti <[email protected]>
Cc: [email protected]
Subject: Re: [Bioc-sig-seq] Extracting DNA sequences from
BSgenome.Mmusculus.UCSC.mm9_1.3.11
Date: Thu, 28 May 2009 17:21:15 -0700

Ivan,

With BSgenome 1.12.1 (release) and 1.13.5 (devel) you can now do:

   myseqs <- data.frame(
     chr=c("chrY", "chr1", "chr2", "chr3", "chrY", "chr3", "chr1", "chr1"),
     start=c(NA, -40, 8510201, 4920301, 30001, 9220500, -2804, -30),
     end=c(50, NA, 8510220, 4920330, 30011, 9220555, -2801, -11)
   )

   library(BSgenome.Mmusculus.UCSC.mm9)

   > getSeq(Mmusculus, myseqs$chr, myseqs$start, myseqs$end)
   [1] "GATCCAAAACACATTCTCCCTGGTAGCATGGACAAGCAACATTTTGGGAG"
   [2] "TTCTGTAAAGAATTTGGTATTAAACTTAAAACTGGAATTC"
   [3] "ACGACTATAAAAACCTTTAG"
   [4] "CATACAATAATTGTGGGGGAACTTCAAAAC"
   [5] "ATCTTAATCAC"
   [6] "CAGTAGTGGCGTACACCTTTAATCCCAGCACGTGGTAGGCAGAGGCAGATGGATTT"
   [7] "ATGA"
   [8] "AATTTGGTATTAAACTTAAA"

to extract multiple subsequences from multiple chromosomes at once.
(Note support for NAs and negative start or end.)

It takes about 35 sec. to extract 1 million short subsequences so it's
much faster than the sapply() solution I proposed earlier but there is
still room for improvement. Also, the extracted sequences are currently
returned as a character vector, not a DNAStringSet.
But addressing these 2 issues (speed + DNAStringSet) will require that
we first put in place an efficient implementation for combining several
DNAStringSet objects (this is an important piece of the infrastructure).
So that will take a few weeks.

Hopefully this time you won't get hit by the infamous bug you reported
earlier (BTW anything new on that front? Were you able to reproduce it?
Thanks).

Cheers,
H.

PS: The latest BSgenome software packages will propagate to the public
repos in the next 24 hours.

Ivan Gregoretti wrote:

I didn't notice that I was emailing the individuals rather than the
list!!!!!!

Briefly, THANKS TO EVERYONE.

Second, I managed to call and display all my sequences like this

sapply(1:dim(A)[1], function(i) paste(A$chr[i],' ',A$start[i],' ',A$end[i],'
tags: ',A$tags[i],' ',getSeq(Mmusculus, as.character(A$chr[i]), A$start[i],
A$end[i])))

As you see, it's made up of ideas from all three respondents. (Michael's
solution actually gave me an idea to solve some other problem not mentioned
here. So, nothing gets wasted.)

Sorry for the off-list emailing issue.

Ivan

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Extracting DNA sequences from BSgenome.Mmusculus.UCSC.mm9_1.3.11

Reply via email to