Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Robert Castelo Mon, 11 Jan 2016 14:37:16 -0800

that looks great, thanks Hervé for addressing this quickly.


robert.

On 1/11/16 11:18 PM, Hervé Pagès wrote:

With GenomicFeatures 1.23.16:

> txdb <- makeTxDbFromUCSC("hg38", "knownGene")
Download the knownGene table ... OK
Download the knownToLocusLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
  UCSC data anomaly in 19942 transcript(s): the cds cumulative length is
  not a multiple of 3 for transcripts ‘uc057axw.1’ ‘uc031pko.2’
  ‘uc057ayh.1’ ‘uc057ayr.1’ ‘uc057azl.1’ ‘uc057azn.1’ ‘uc057azp.1’
  ‘uc057azt.1’ ‘uc057azy.1’ ‘uc057bad.1’ ‘uc057bap.1’ ‘uc057bav.1’
  ‘uc057bay.1’ ‘uc057bbh.1’ ‘uc057bbv.1’ ‘uc057bcm.1’ ‘uc057bcp.1’
  ‘uc057bcw.1’ ‘uc057bcx.1’ ‘uc057bdf.1’ ‘uc057bdj.1’ ‘uc057bdk.1’
  ‘uc057bdl.1’ ‘uc057bdp.1’ ‘uc057beb.1’ ‘uc057beq.1’ ‘uc057bfn.1’
  ‘uc057bfs.1’ ‘uc057bgm.1’ ‘uc057bgo.1’ ‘uc057bgt.1’ ‘uc057bhd.1’
  ‘uc057bhi.1’ ‘uc057bio.1’ ‘uc057biu.1’ ‘uc057bji.1’ ‘uc057bjj.1’
  ‘uc057bjk.1’ ‘uc057bkd.1’ ‘uc057bkf.1’ ‘uc057bkj.1’ ‘uc057bkl.1’
  ‘uc057bkt.1’ ‘uc057bld.1’ ‘uc057bli.1’ ‘uc057blj.1’ ‘uc057blz.1’
  ‘uc057bmf.1’ ‘uc057bmr.1’ ‘uc057bmv.1’ ‘uc057bni.1’ ‘u [... truncated]

> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: knownGene
# UCSC Track: GENCODE v22
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 195178
# exon_nrow: 575044
# cds_nrow: 291225
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2016-01-11 14:10:24 -0800 (Mon, 11 Jan 2016)
# GenomicFeatures version at creation time: 1.23.16
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1

Note the new "UCSC Track" field above.

Cheers,
H.


On 01/11/2016 01:12 PM, Hervé Pagès wrote:

Hi Robert and others,

I looked at this and the new situation doesn't seem as disruptive as
it sounds. The bulk of the data for both tracks (i.e. the "UCSC Genes"
track for hg19 and the "GENCODE v22" track for hg38) is stored in the
knownGene table.

The hg19.knownGene table is described here:

https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema



The hg38.knownGene table is described here:

https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema



The 2 pages are very similar. In particular both tables are connected
to the knownToLocusLink table where Entrez Gene IDs are stored.

So from a makeTxDbFromUCSC() point of view everything looks the same
except the name of the track: "UCSC Genes" for hg19 and "GENCODE v22"
for hg38. That means it shouldn't be hard to tweak makeTxDbFromUCSC()
to support:

     txdb <- makeTxDbFromUCSC("hg38", "knownGene")

The returned 'txdb' will contain data from the "GENCODE v22" track
and with transcripts mapped to Entrez Gene IDs.

I'll work on this and will also investigate makeTxDbFromGRanges's
failure on AnnotationHub's GFF files from GENCODE.

H.


On 01/11/2016 06:29 AM, Robert Castelo wrote:

hi,

if i'm interpreting this correctly, the news archive of the UCSC Genome
Browser accessible here:

  http://genome.ucsc.edu/goldenPath/newsarch.html

announced on June 29th, 2015, that they are discontinuing thegeneration

of UCSC Known Genes annotations for human, and provide the Gencode
annotations as default replacement.

the BioC site provides as default gene annotations for human the UCSC

Known Genes track and currently does not provide the Gencodeannotations.


the GenomicFeatures package allows one to build such an annotation
package. unfortunately the current "supported" UCSC tables that can be
easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode
version V17:

library(GenomicFeatures)

xx <- supportedUCSCtables()
xx[grep("GENCODE Genes", xx$track), ]
                                              track subtrack
wgEncodeGencodeBasicV17          GENCODE Genes V17 <NA>
wgEncodeGencodeCompV17           GENCODE Genes V17 <NA>
wgEncodeGencodePseudoGeneV17     GENCODE Genes V17 <NA>
wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 <NA>
wgEncodeGencodePolyaV17          GENCODE Genes V17 <NA>
wgEncodeGencodeBasicV14          GENCODE Genes V14 <NA>
wgEncodeGencodeCompV14           GENCODE Genes V14 <NA>
wgEncodeGencodePseudoGeneV14     GENCODE Genes V14 <NA>
wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 <NA>
wgEncodeGencodePolyaV14          GENCODE Genes V14 <NA>
wgEncodeGencodeBasicV7            GENCODE Genes V7 <NA>
wgEncodeGencodeCompV7             GENCODE Genes V7 <NA>
wgEncodeGencodePseudoGeneV7       GENCODE Genes V7 <NA>
wgEncodeGencode2wayConsPseudoV7   GENCODE Genes V7 <NA>
wgEncodeGencodePolyaV7            GENCODE Genes V7 <NA>

which is about 2 years old. current Gencode gene annotations are V24and

at least V22 was available at:

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database

before the last BioC release.

according to a recent announcement at the BioC support site:

https://support.bioconductor.org/p/71574

AnnotationHub seems to be now the proper way to import the most recent

Gencode annotations into BioC. however, at least in my hands, makingthe

corresponding TxDb object produces an error; see the following example:

library(AnnotationHub)

ah <- AnnotationHub()
human_gff <- query(ah, c("Gencode", "gff", "human"))

gencodeV23basicGFF <- ah[["AH49556"]]
metadata <- data.frame(name=c("Data source", "Genome", "Organism",
                               "Resource URL", "Full dataset"),
                        value=c(ah["AH49556"]$dataprovider,
ah["AH49556"]$genome,
                                ah["AH49556"]$species,
ah["AH49556"]$sourceurl, "no"))
txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
Error in .merge_transcript_parts(transcripts) :
   The following transcripts have multiple parts that cannot be merged
because of incompatible seqnames: ENST00000244174.9,
   ENST00000262640.10, ENST00000286448.10, ENST00000302805.6,
ENST00000313871.7, ENST00000326153.8, ENST00000331035.8,
   ENST00000334060.7, ENST00000334651.9, ENST00000355432.7,
ENST00000355805.6, ENST00000359512.7, ENST00000369423.6,
   ENST00000381180.7, ENST00000381184.5, ENST00000381187.7,
ENST00000381192.7, ENST00000381218.7, ENST00000381222.6,
   ENST00000381223.8, ENST00000381229.8, ENST00000381233.7,
ENST00000381241.7, ENST00000381261.7, ENST00000381297.8,
   ENST00000381317.7, ENST00000381333.8, ENST00000381401.9,
ENST00000381469.6, ENST00000381500.5, ENST00000381509.7,
   ENST00000381524.7, ENST00000381529.7, ENST00000381566.5,
ENST00000381567.7, ENST00000381575.5, ENST00000381578.5,
   ENST00000381657.6, ENST00000381663.7, ENST00000390665.7,
ENST00000391707.6, ENST00000399012.5, ENST00000399966.8,
   ENST00000400841.6, ENST00000411342.5, ENST00000412936

on top of this, even if it would work, these annotations areanchored at

Ensembl Gene identifiers while the gene-centric annotations at
org.Hs.eg.db are anchored at Entrez Gene identifiers. this means that
more code would have to be involved to add the corresponding Entrez IDs
(resolving multiplicities, etc.) and produce a TxDb package that can be
used across many of the typical BioC pipelines.

since human gene annotations are at the core of many BioC pipelines,i'd

like to suggest for the forthcoming release cycles, that the BioC core
team packages Gencode annotations anchored at Entrez IDs, at least what
is called the "basic set", similarly to what is done with
TxDb.Hsapiens.UCSC.knownGene to have an easy starting point for the
analysis of human data.


cheers,

robert.

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Reply via email to