Hi Michael,

On 07/08/2014 12:11 PM, Michael Love wrote:
The recent TranscriptDb thread reminded me of a question: are there
plans (or am I missing the function) to easily get a TranscriptDb out
of the AnnotationHub objects? It would be great to have a preprocessed
Ensembl txdb like we have for UCSC.

I think the 1st thing we should do is have a makeTranscriptDbFromGRanges() function. It should not be too hard
because we already have the code :) Marc wrote it. But it's currently
part of the makeTranscriptDbFromGFF() function. Roughly speaking this
function does 2 things: (1) import the GFF or GTF file as a GRanges
object, then (2) turn that GRanges object into a TranscriptDb object.
So we should move the code that does (2) into a separate function,
the makeTranscriptDbFromGRanges() function, and have
makeTranscriptDbFromGFF() call it internally.

Then you could call makeTranscriptDbFromGRanges() on any of these
GFF- or GTF-based GRanges objects you get from AnnotationHub.

We'll work on this soon and announce here when it becomes available.

Cheers,
H.


ah <- AnnotationHub()
gr <- 
ah$ensembl.release.73.gtf.homo_sapiens.Homo_sapiens.GRCh37.73.gtf_0.0.1.RData
gr
GRanges with 2268089 ranges and 12 metadata columns:
             seqnames         ranges strand   |                 source
                <Rle>      <IRanges>  <Rle>   |               <factor>
         [1]        1 [11869, 12227]      +   |   processed_transcript
         [2]        1 [12613, 12721]      +   |   processed_transcript
         [3]        1 [13221, 14409]      +   |   processed_transcript
         [4]        1 [11872, 12227]      +   | unprocessed_pseudogene
         [5]        1 [12613, 12721]      +   | unprocessed_pseudogene
         ...      ...            ...    ... ...                    ...
   [2268085]       MT [14747, 15887]      +   |         protein_coding
   [2268086]       MT [14747, 15887]      +   |         protein_coding
   [2268087]       MT [14747, 14749]      +   |         protein_coding
   [2268088]       MT [15888, 15953]      +   |                Mt_tRNA
   [2268089]       MT [15956, 16023]      -   |                Mt_tRNA
                    type     score     phase         gene_id   transcript_id
                <factor> <numeric> <integer>     <character>     <character>
         [1]        exon      <NA>      <NA> ENSG00000223972 ENST00000456328
         [2]        exon      <NA>      <NA> ENSG00000223972 ENST00000456328
         [3]        exon      <NA>      <NA> ENSG00000223972 ENST00000456328
         [4]        exon      <NA>      <NA> ENSG00000223972 ENST00000515242
         [5]        exon      <NA>      <NA> ENSG00000223972 ENST00000515242
         ...         ...       ...       ...             ...             ...
   [2268085]        exon      <NA>      <NA> ENSG00000198727 ENST00000361789
   [2268086]         CDS      <NA>         0 ENSG00000198727 ENST00000361789
   [2268087] start_codon      <NA>         0 ENSG00000198727 ENST00000361789
   [2268088]        exon      <NA>      <NA> ENSG00000210195 ENST00000387460
   [2268089]        exon      <NA>      <NA> ENSG00000210196 ENST00000387461
             exon_number   gene_name   gene_biotype transcript_name
               <numeric> <character>    <character>     <character>
         [1]           1     DDX11L1     pseudogene     DDX11L1-002
         [2]           2     DDX11L1     pseudogene     DDX11L1-002
         [3]           3     DDX11L1     pseudogene     DDX11L1-002
         [4]           1     DDX11L1     pseudogene     DDX11L1-201
         [5]           2     DDX11L1     pseudogene     DDX11L1-201
         ...         ...         ...            ...             ...
   [2268085]           1      MT-CYB protein_coding      MT-CYB-201
   [2268086]           1      MT-CYB protein_coding      MT-CYB-201
   [2268087]           1      MT-CYB protein_coding      MT-CYB-201
   [2268088]           1       MT-TT        Mt_tRNA       MT-TT-201
   [2268089]           1       MT-TP        Mt_tRNA       MT-TP-201
                     exon_id      protein_id
                 <character>     <character>
         [1] ENSE00002234944            <NA>
         [2] ENSE00003582793            <NA>
         [3] ENSE00002312635            <NA>
         [4] ENSE00002234632            <NA>
         [5] ENSE00003608237            <NA>
         ...             ...             ...
   [2268085] ENSE00001436074            <NA>
   [2268086]            <NA> ENSP00000354554
   [2268087]            <NA>            <NA>
   [2268088] ENSE00001544475            <NA>
   [2268089] ENSE00001544473            <NA>
   ---
   seqlengths:
                      1                   2 ...                  MT
                     NA                  NA ...                  NA

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to