Hi Michael,
On 07/08/2014 12:11 PM, Michael Love wrote:
The recent TranscriptDb thread reminded me of a question: are there
plans (or am I missing the function) to easily get a TranscriptDb out
of the AnnotationHub objects? It would be great to have a preprocessed
Ensembl txdb like we have for UCSC.
I think the 1st thing we should do is have a
makeTranscriptDbFromGRanges() function. It should not be too hard
because we already have the code :) Marc wrote it. But it's currently
part of the makeTranscriptDbFromGFF() function. Roughly speaking this
function does 2 things: (1) import the GFF or GTF file as a GRanges
object, then (2) turn that GRanges object into a TranscriptDb object.
So we should move the code that does (2) into a separate function,
the makeTranscriptDbFromGRanges() function, and have
makeTranscriptDbFromGFF() call it internally.
Then you could call makeTranscriptDbFromGRanges() on any of these
GFF- or GTF-based GRanges objects you get from AnnotationHub.
We'll work on this soon and announce here when it becomes available.
Cheers,
H.
ah <- AnnotationHub()
gr <-
ah$ensembl.release.73.gtf.homo_sapiens.Homo_sapiens.GRCh37.73.gtf_0.0.1.RData
gr
GRanges with 2268089 ranges and 12 metadata columns:
seqnames ranges strand | source
<Rle> <IRanges> <Rle> | <factor>
[1] 1 [11869, 12227] + | processed_transcript
[2] 1 [12613, 12721] + | processed_transcript
[3] 1 [13221, 14409] + | processed_transcript
[4] 1 [11872, 12227] + | unprocessed_pseudogene
[5] 1 [12613, 12721] + | unprocessed_pseudogene
... ... ... ... ... ...
[2268085] MT [14747, 15887] + | protein_coding
[2268086] MT [14747, 15887] + | protein_coding
[2268087] MT [14747, 14749] + | protein_coding
[2268088] MT [15888, 15953] + | Mt_tRNA
[2268089] MT [15956, 16023] - | Mt_tRNA
type score phase gene_id transcript_id
<factor> <numeric> <integer> <character> <character>
[1] exon <NA> <NA> ENSG00000223972 ENST00000456328
[2] exon <NA> <NA> ENSG00000223972 ENST00000456328
[3] exon <NA> <NA> ENSG00000223972 ENST00000456328
[4] exon <NA> <NA> ENSG00000223972 ENST00000515242
[5] exon <NA> <NA> ENSG00000223972 ENST00000515242
... ... ... ... ... ...
[2268085] exon <NA> <NA> ENSG00000198727 ENST00000361789
[2268086] CDS <NA> 0 ENSG00000198727 ENST00000361789
[2268087] start_codon <NA> 0 ENSG00000198727 ENST00000361789
[2268088] exon <NA> <NA> ENSG00000210195 ENST00000387460
[2268089] exon <NA> <NA> ENSG00000210196 ENST00000387461
exon_number gene_name gene_biotype transcript_name
<numeric> <character> <character> <character>
[1] 1 DDX11L1 pseudogene DDX11L1-002
[2] 2 DDX11L1 pseudogene DDX11L1-002
[3] 3 DDX11L1 pseudogene DDX11L1-002
[4] 1 DDX11L1 pseudogene DDX11L1-201
[5] 2 DDX11L1 pseudogene DDX11L1-201
... ... ... ... ...
[2268085] 1 MT-CYB protein_coding MT-CYB-201
[2268086] 1 MT-CYB protein_coding MT-CYB-201
[2268087] 1 MT-CYB protein_coding MT-CYB-201
[2268088] 1 MT-TT Mt_tRNA MT-TT-201
[2268089] 1 MT-TP Mt_tRNA MT-TP-201
exon_id protein_id
<character> <character>
[1] ENSE00002234944 <NA>
[2] ENSE00003582793 <NA>
[3] ENSE00002312635 <NA>
[4] ENSE00002234632 <NA>
[5] ENSE00003608237 <NA>
... ... ...
[2268085] ENSE00001436074 <NA>
[2268086] <NA> ENSP00000354554
[2268087] <NA> <NA>
[2268088] ENSE00001544475 <NA>
[2268089] ENSE00001544473 <NA>
---
seqlengths:
1 2 ... MT
NA NA ... NA
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel