[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Robert Castelo Mon, 11 Jan 2016 06:30:05 -0800

hi,

if i'm interpreting this correctly, the news archive of the UCSC GenomeBrowser accessible here:


 http://genome.ucsc.edu/goldenPath/newsarch.html

announced on June 29th, 2015, that they are discontinuing the generationof UCSC Known Genes annotations for human, and provide the Gencodeannotations as default replacement.

the BioC site provides as default gene annotations for human the UCSCKnown Genes track and currently does not provide the Gencode annotations.

the GenomicFeatures package allows one to build such an annotationpackage. unfortunately the current "supported" UCSC tables that can beeasily used via 'makeTxDbPackageFromUCSC()' reports up to Gencodeversion V17:


library(GenomicFeatures)

xx <- supportedUCSCtables()
xx[grep("GENCODE Genes", xx$track), ]
                                             track subtrack
wgEncodeGencodeBasicV17          GENCODE Genes V17     <NA>
wgEncodeGencodeCompV17           GENCODE Genes V17     <NA>
wgEncodeGencodePseudoGeneV17     GENCODE Genes V17     <NA>
wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17     <NA>
wgEncodeGencodePolyaV17          GENCODE Genes V17     <NA>
wgEncodeGencodeBasicV14          GENCODE Genes V14     <NA>
wgEncodeGencodeCompV14           GENCODE Genes V14     <NA>
wgEncodeGencodePseudoGeneV14     GENCODE Genes V14     <NA>
wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14     <NA>
wgEncodeGencodePolyaV14          GENCODE Genes V14     <NA>
wgEncodeGencodeBasicV7            GENCODE Genes V7     <NA>
wgEncodeGencodeCompV7             GENCODE Genes V7     <NA>
wgEncodeGencodePseudoGeneV7       GENCODE Genes V7     <NA>
wgEncodeGencode2wayConsPseudoV7   GENCODE Genes V7     <NA>
wgEncodeGencodePolyaV7            GENCODE Genes V7     <NA>

which is about 2 years old. current Gencode gene annotations are V24 andat least V22 was available at:


http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database

before the last BioC release.

according to a recent announcement at the BioC support site:

https://support.bioconductor.org/p/71574

AnnotationHub seems to be now the proper way to import the most recentGencode annotations into BioC. however, at least in my hands, making thecorresponding TxDb object produces an error; see the following example:


library(AnnotationHub)

ah <- AnnotationHub()
human_gff <- query(ah, c("Gencode", "gff", "human"))

gencodeV23basicGFF <- ah[["AH49556"]]
metadata <- data.frame(name=c("Data source", "Genome", "Organism",
                              "Resource URL", "Full dataset"),

value=c(ah["AH49556"]$dataprovider,ah["AH49556"]$genome,ah["AH49556"]$species,ah["AH49556"]$sourceurl, "no"))

txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
Error in .merge_transcript_parts(transcripts) :

The following transcripts have multiple parts that cannot be mergedbecause of incompatible seqnames: ENST00000244174.9,ENST00000262640.10, ENST00000286448.10, ENST00000302805.6,ENST00000313871.7, ENST00000326153.8, ENST00000331035.8,ENST00000334060.7, ENST00000334651.9, ENST00000355432.7,ENST00000355805.6, ENST00000359512.7, ENST00000369423.6,ENST00000381180.7, ENST00000381184.5, ENST00000381187.7,ENST00000381192.7, ENST00000381218.7, ENST00000381222.6,ENST00000381223.8, ENST00000381229.8, ENST00000381233.7,ENST00000381241.7, ENST00000381261.7, ENST00000381297.8,ENST00000381317.7, ENST00000381333.8, ENST00000381401.9,ENST00000381469.6, ENST00000381500.5, ENST00000381509.7,ENST00000381524.7, ENST00000381529.7, ENST00000381566.5,ENST00000381567.7, ENST00000381575.5, ENST00000381578.5,ENST00000381657.6, ENST00000381663.7, ENST00000390665.7,ENST00000391707.6, ENST00000399012.5, ENST00000399966.8,

  ENST00000400841.6, ENST00000411342.5, ENST00000412936

on top of this, even if it would work, these annotations are anchored atEnsembl Gene identifiers while the gene-centric annotations atorg.Hs.eg.db are anchored at Entrez Gene identifiers. this means thatmore code would have to be involved to add the corresponding Entrez IDs(resolving multiplicities, etc.) and produce a TxDb package that can beused across many of the typical BioC pipelines.

since human gene annotations are at the core of many BioC pipelines, i'dlike to suggest for the forthcoming release cycles, that the BioC coreteam packages Gencode annotations anchored at Entrez IDs, at least whatis called the "basic set", similarly to what is done withTxDb.Hsapiens.UCSC.knownGene to have an easy starting point for theanalysis of human data.



cheers,

robert.

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Reply via email to