Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Hervé Pagès Mon, 11 Jan 2016 22:55:22 -0800

Hi Vince, Robert,

On 01/11/2016 07:07 AM, Vincent Carey wrote:

On Mon, Jan 11, 2016 at 9:29 AM, Robert Castelo <robert.cast...@upf.edu>
wrote:

hi,

if i'm interpreting this correctly, the news archive of the UCSC Genome
Browser accessible here:

  http://genome.ucsc.edu/goldenPath/newsarch.html

announced on June 29th, 2015, that they are discontinuing the generation
of UCSC Known Genes annotations for human, and provide the Gencode
annotations as default replacement.

the BioC site provides as default gene annotations for human the UCSC
Known Genes track and currently does not provide the Gencode annotations.

the GenomicFeatures package allows one to build such an annotation
package. unfortunately the current "supported" UCSC tables that can be
easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version
V17:

library(GenomicFeatures)

xx <- supportedUCSCtables()
xx[grep("GENCODE Genes", xx$track), ]
                                              track subtrack
wgEncodeGencodeBasicV17          GENCODE Genes V17     <NA>
wgEncodeGencodeCompV17           GENCODE Genes V17     <NA>
wgEncodeGencodePseudoGeneV17     GENCODE Genes V17     <NA>
wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17     <NA>
wgEncodeGencodePolyaV17          GENCODE Genes V17     <NA>
wgEncodeGencodeBasicV14          GENCODE Genes V14     <NA>
wgEncodeGencodeCompV14           GENCODE Genes V14     <NA>
wgEncodeGencodePseudoGeneV14     GENCODE Genes V14     <NA>
wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14     <NA>
wgEncodeGencodePolyaV14          GENCODE Genes V14     <NA>
wgEncodeGencodeBasicV7            GENCODE Genes V7     <NA>
wgEncodeGencodeCompV7             GENCODE Genes V7     <NA>
wgEncodeGencodePseudoGeneV7       GENCODE Genes V7     <NA>
wgEncodeGencode2wayConsPseudoV7   GENCODE Genes V7     <NA>
wgEncodeGencodePolyaV7            GENCODE Genes V7     <NA>

which is about 2 years old. current Gencode gene annotations are V24 and
at least V22 was available at:

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database

before the last BioC release.

according to a recent announcement at the BioC support site:

https://support.bioconductor.org/p/71574

AnnotationHub seems to be now the proper way to import the most recent
Gencode annotations into BioC. however, at least in my hands, making the
corresponding TxDb object produces an error; see the following example:

library(AnnotationHub)

ah <- AnnotationHub()
human_gff <- query(ah, c("Gencode", "gff", "human"))

gencodeV23basicGFF <- ah[["AH49556"]]
metadata <- data.frame(name=c("Data source", "Genome", "Organism",
                               "Resource URL", "Full dataset"),
                        value=c(ah["AH49556"]$dataprovider,
ah["AH49556"]$genome,
                                ah["AH49556"]$species,
ah["AH49556"]$sourceurl, "no"))
txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
Error in .merge_transcript_parts(transcripts) :
   The following transcripts have multiple parts that cannot be merged
because of incompatible seqnames: ENST00000244174.9,


should this be an error, or would a softer landing be more useful here?
  warn and exclude the offensive elements, perhaps with an option
to retrieve them through some special step (option or new function)?


This was actually a bug in makeTxDbFromGRanges(). It's fixed in
GenomicFeatures 1.22.8 (release) and 1.23.17 (devel). With this
fix:

> txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: Gencode
# Genome: GRCh38
# Organism: Homo sapiens

# Resource URL:ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/gencode.v23.basic.annotation.gff3.gz

# Full dataset: no
# transcript_nrow: 100769
# exon_nrow: 676601
# cds_nrow: 535301
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2016-01-11 22:28:51 -0800 (Mon, 11 Jan 2016)
# GenomicFeatures version at creation time: 1.23.17
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1

> transcripts(txdb)
GRanges object with 100769 ranges and 2 metadata columns:
           seqnames         ranges strand   |     tx_id           tx_name
              <Rle>      <IRanges>  <Rle>   | <integer>       <character>
       [1]     chr1 [11869, 14409]      +   |         1 ENST00000456328.2
       [2]     chr1 [12010, 13670]      +   |         2 ENST00000450305.2
       [3]     chr1 [29554, 31097]      +   |         3 ENST00000473358.1
       [4]     chr1 [30267, 31109]      +   |         4 ENST00000469289.1
       [5]     chr1 [30366, 30503]      +   |         5 ENST00000607096.1
       ...      ...            ...    ... ...       ...               ...
  [100765]     chrM [ 5826,  5891]      -   |    100765 ENST00000387409.1
  [100766]     chrM [ 7446,  7514]      -   |    100766 ENST00000387416.2
  [100767]     chrM [14149, 14673]      -   |    100767 ENST00000361681.2
  [100768]     chrM [14674, 14742]      -   |    100768 ENST00000387459.1
  [100769]     chrM [15956, 16023]      -   |    100769 ENST00000387461.2
  -------
  seqinfo: 25 sequences from GRCh38 genome; no seqlengths

H.

   ENST00000262640.10, ENST00000286448.10, ENST00000302805.6,
ENST00000313871.7, ENST00000326153.8, ENST00000331035.8,
   ENST00000334060.7, ENST00000334651.9, ENST00000355432.7,
ENST00000355805.6, ENST00000359512.7, ENST00000369423.6,
   ENST00000381180.7, ENST00000381184.5, ENST00000381187.7,
ENST00000381192.7, ENST00000381218.7, ENST00000381222.6,
   ENST00000381223.8, ENST00000381229.8, ENST00000381233.7,
ENST00000381241.7, ENST00000381261.7, ENST00000381297.8,
   ENST00000381317.7, ENST00000381333.8, ENST00000381401.9,
ENST00000381469.6, ENST00000381500.5, ENST00000381509.7,
   ENST00000381524.7, ENST00000381529.7, ENST00000381566.5,
ENST00000381567.7, ENST00000381575.5, ENST00000381578.5,
   ENST00000381657.6, ENST00000381663.7, ENST00000390665.7,
ENST00000391707.6, ENST00000399012.5, ENST00000399966.8,
   ENST00000400841.6, ENST00000411342.5, ENST00000412936


on top of this, even if it would work, these annotations are anchored at
Ensembl Gene identifiers while the gene-centric annotations at org.Hs.eg.db



Is it true that there is an asymmetry between Entrez gene ID and Ensembl
gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
as a symbol mapping resource)?  Both ENTREZID and ENSEMBL are listed as
keytypes.  My question is whether this "anchor" concept
holds in the current infrastructure.

are anchored at Entrez Gene identifiers. this means that more code would

have to be involved to add the corresponding Entrez IDs (resolving
multiplicities, etc.) and produce a TxDb package that can be used across
many of the typical BioC pipelines.

since human gene annotations are at the core of many BioC pipelines, i'd
like to suggest for the forthcoming release cycles, that the BioC core team
packages Gencode annotations anchored at Entrez IDs, at least what is
called the "basic set", similarly to what is done with
TxDb.Hsapiens.UCSC.knownGene to have an easy starting point for the
analysis of human data.


cheers,

robert.

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Reply via email to