Hi Kristoffer,

On 08/26/2014 07:40 AM, Kristoffer Vitting-Seerup wrote:
The problem is best illustrated with an example (with comments in bold):

# Load libraries:
library('GenomicFeatures')
library("BSgenome.Hsapiens.UCSC.hg19�)

# Create two test transcripts, one on each strand
testTranscriptPlus <- GRanges(seqnames='chr1', 
ranges=IRanges(start=1e5+c(1,11), end=1e5+c(3,13)), strand='+')
testTranscriptMinus <- GRanges(seqnames='chr1', 
ranges=IRanges(start=1e5+c(1,11), end=1e5+c(3,13)), strand='-�)

# Use getSeq() to extract sequence

getSeq(Hsapiens, testTranscriptPlus)
   A DNAStringSet instance of length 2
     width seq
[1]     3 ACT
[2]     3 CAG

testTranscriptMinus <- GRanges(seqnames='chr1', 
ranges=IRanges(start=1e5+c(1,11), end=1e5+c(3,13)), strand='-')
getSeq(Hsapiens, testTranscriptMinus)
   A DNAStringSet instance of length 2
     width seq
[1]     3 AGT
[2]     3 CTG

# By comparing the plus and minus transcripts it can be seen that getSeq 
returns the sequences in the 5� to 3� orientation no matter the strand (very 
nice feature by the way)
# Furthermore we would expect the transcript sequence of the 
testTranscriptMinus to be: CTGAGT (since we are on the minus strand).

Nope. You should expect the transcript sequence to be AGTCTG. Because
the convention we use is that exons are ranked according to their
position (i.e. index) in the GRanges object, and this *independently*
of the strand. So the 1st element in the GRanges object is the 1st
exon, the 2nd element is the 2nd exon, etc...


# Now lets extract the sequences of the minus strand transcript using the 
extractTranscriptSeqs() function
extractTranscriptSeqs( Hsapiens, split( testTranscriptMinus, 
f=c('test','test')))
   A DNAStringSet instance of length 1
     width seq                                                                  
                                                           names
[1]     6 AGTCTG                                                                
                                                          test

# From which it can be observed that extractTranscriptSeqs() have concatenated 
the exons without taking the order into account (which works fine on plus 
strand but results in non-exisiting transcript from the minus strand).

Yes, exon rank was taken into account. See above.

Note that extractTranscriptSeqs() is more often used on a GRangesList
object that represents a set of transcripts:

  library(TxDb.Hsapiens.UCSC.hg19.knownGene)
  txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
  transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)

'transcripts' is a named GRangesList object with one list element per
transcript. The names on 'transcripts' are the UCSC transcript ids.
Each list element in 'transcripts' is a small GRanges object that lists
the exons for a particular transcript. Let's look at 3 transcripts, 2
on the + strand, and 1 on the - strand:

  > transcripts[4072:4074]
  GRangesList of length 3:
  $uc009xhd.3
  GRanges with 4 ranges and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank <Rle> <IRanges> <Rle> | <integer> <character> <integer> [1] chr1 [249200442, 249200541] + | 13958 <NA> 1 [2] chr1 [249208015, 249208078] + | 13959 <NA> 2 [3] chr1 [249208638, 249208758] + | 13960 <NA> 3 [4] chr1 [249210801, 249213345] + | 13961 <NA> 4

  $uc021pmh.1
  GRanges with 1 range and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank [1] chr1 [249211537, 249212562] + | 13963 <NA> 1

  $uc009vis.3
  GRanges with 4 ranges and 3 metadata columns:
        seqnames         ranges strand | exon_id exon_name exon_rank
    [1]     chr1 [16607, 16765]      - |   13970      <NA>         1
    [2]     chr1 [15796, 15942]      - |   13968      <NA>         2
    [3]     chr1 [14970, 15038]      - |   13966      <NA>         3
    [4]     chr1 [14362, 14829]      - |   13964      <NA>         4

  ---
  seqlengths:
                    chr1                  chr2 ...        chrUn_gl000249
               249250621             243199373 ...                 38502

As you can see, exons are always listed in the order corresponding to
their rank, independently of the strand. This means that, for a normal
transcript on the minus strand (like transcript uc009vis.3), exons
will usually be sorted by descending coordinates. But for some exotic
transcripts (like the very unrealistic one you crafted in
'testTranscriptMinus') this might not be the case.

Using getSeq() on transcript uc009vis.3:

  library(BSgenome.Hsapiens.UCSC.hg19)
  genome <- BSgenome.Hsapiens.UCSC.hg19
  exon_seqs <- getSeq(genome, transcripts$uc009vis.3)

Then:

  > exon_seqs
    A DNAStringSet instance of length 4
      width seq
[1] 159 ACCTACAAGATGGGGTACTAACACCACCCCCACC...AGGACGACAGCAGCAGCAGCGCGTCTCCTTCAG [2] 147 GGAGCTCCCAGGGAAGTGGTTGACCCCTCCGGTG...TGGAGAAGCAGCAGCAGAAGGAGCAGGAGCAAG [3] 69 TGAGAGCCACGAGCCAAGGTGGGCACTTGATGTCGGATCTCTTCAACAAGCTGGTCATGAGGCGCAAGG [4] 468 GCATCTCTGGGAAAGGACCTGGGGCTGGTGAGGG...TGCTTTTAATAAAGGATCTCTAGCTGTGCAGGA

Now with extractTranscriptSeqs():

  > tx_seq <- extractTranscriptSeqs(genome, transcripts["uc009vis.3"])
  > tx_seq
    A DNAStringSet instance of length 1
width seq names
  [1]   843 ACCTACAAGATGGGGTACTAACA...AAAGGATCTCTAGCTGTGCAGGA uc009vis.3

'tx_seq' contains the 4 sequences returned by getSeq() and concatenated:

  > tx_seq == unlist(exon_seqs)
  [1] TRUE

Using UCSC online query tools will give you the same sequence:

  http://genome.ucsc.edu/cgi-bin/hgGene?db=hg19&hgg_gene=uc009vis.3

(Click on Genomic Sequence, then check 5' UTR Exons, CDS Exons, and
3' UTR Exons. No introns, no promoter/upstream, no downstream
sequences.)

Hope this helps clarifying things.

H.


sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] C

attached base packages:
[1] grDevices datasets  grid      parallel  stats     graphics  utils     
methods   base

other attached packages:
  [1] GenomicFeatures_1.17.14            AnnotationDbi_1.27.9               
Biobase_2.25.0                     BSgenome.Hsapiens.UCSC.hg19_1.3.99
  [5] BSgenome_1.33.8                    Biostrings_2.33.13                 
XVector_0.5.7                      spliceR_1.7.1
  [9] plyr_1.8.1                         RColorBrewer_1.0-5                 
VennDiagram_1.6.7                  cummeRbund_2.7.2
[13] Gviz_1.9.11                        rtracklayer_1.25.13                
GenomicRanges_1.17.28              GenomeInfoDb_1.1.18
[17] IRanges_1.99.24                    S4Vectors_0.1.2                    
fastcluster_1.1.13                 reshape2_1.4
[21] ggplot2_1.0.0                      RSQLite_0.11.4                     
DBI_0.2-7                          BiocGenerics_0.11.4




_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to