Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
Just as an info… EnsDb objects/packages (from ensembldb package) provide similar functionality than the TxDb, are tailored to Ensembl annotations and can be build from the GTF files from Ensembl (which can be fetched via AnnotationHub; it’s all described in the ensembldb vignette). cheers, jo > On 11 Jan 2016, at 21:40, Paul Grosu wrote: > > > Tim, you always crack me up! :) I totally agree, and it would probably be > good to also have the tools enabled to download directly from Ensembl, NCBI, > cloud-annotation source, etc. and build/update the AnnDbBimap objects. This > way the annotation sources can maintain the data and us the scripts, > including the pre-built AnnDbBimap objects just in case. > > ~p > > -Original Message- > From: Bioc-devel [mailto:bioc-devel-boun...@r-project.org] On Behalf Of Tim > Triche, Jr. > Sent: Monday, January 11, 2016 2:02 PM > To: Vincent Carey > Cc: bioc-devel@r-project.org > Subject: Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC > > ENSEMBL > > knownGene was always a disaster. For extra amusement/horror, be sure to > check out the sad saga of the TCGA GAF and its disconnection from knownGenes > as well as reality. Three cheers for rendering transcript-level estimates > useless (and no this was not Katie's fault) > > Rainer and many others have made a herculean effort to bring all the BioC > annotation infrastructure into the 21st century... having worked with > Kallisto extensively of late, I see no reason to use a non-ENSEMBL > "conservative" reference transcriptome (I see plenty of reasons to use > miTranscriptome, etc. but that is another discussion). > > sorry if slighting anyone/everyone, but ENSEMBL is the clear choice IMHO. > > $0.02 - transmission costs > > > --t > > On Mon, Jan 11, 2016 at 10:57 AM, Vincent Carey > wrote: > >> I think these are all good observations and we may benefit from a >> wider discussion on the support site? >> >> the abandonment of knownGene seems to have clear implications for >> changing our most visible txdb examples. what should we change to? >> can we make a more future-proof design for these annotation >> selections? >> >> On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo >> >> wrote: >> >>> hi, >>> >>> On 01/11/2016 04:07 PM, Vincent Carey wrote: >>> [...] >>> >>>> Is it true that there is an asymmetry between Entrez gene ID and >>>> Ensembl gene ID for querying org.Hs.eg.db (I tend to prefer >>>> Homo.sapiens as a symbol mapping resource)? Both ENTREZID and >>>> ENSEMBL are listed as keytypes. My question is whether this >>>> "anchor" concept holds in the current infrastructure. >>>> >>> >>> you're right that the infrastructure is probably symmetric at least >>> between Entrez and Ensembl, so maybe i'm not using the term "anchor" >>> correctly here, i'm just referring to the fact that many package >> functions >>> and use cases of BioC are based in, or illustrated, using Entrez IDs. >>> examples are: >>> >>> head(org.Hs.eg.db::keys(org.Hs.eg.db)) >>> [1] "1" "2" "3" "9" "10" "11" >>> >>> i.e., by default the 'keytype' is 'ENTREZID' >>> >>> genefilter::nsFilter() argument 'require.entrez' filters out >>> features without an Entrez Gene ID annotation. >>> >>> Category::categoryToEntrezBuilder() returns a list mapping category >>> ids >> to >>> the Entrez Gene ids annotated at the cateogry id. >>> >>> SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a >>> keytype to map ranges to genes. By default the keytype is 'ENTREZID' >>> >>> some of the workflows are also based on Entrez IDs, such as: >>> >>> >> http://www.bioconductor.org/help/workflows/annotation/Annotation_Resou >> rces >>> >>> http://www.bioconductor.org/help/workflows/variants >>> >>> so if the user just replaces the txdb object in one of those >>> examples or argument functions by a txdb object that does not have >>> Entrez identifiers as primary gene key, those functions, examples or >>> workflows will require modification. this is not necessarily bad, >>> but may put more burden on the user who is learning with a "
Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
Hi Vince, Robert, On 01/11/2016 07:07 AM, Vincent Carey wrote: On Mon, Jan 11, 2016 at 9:29 AM, Robert Castelo wrote: hi, if i'm interpreting this correctly, the news archive of the UCSC Genome Browser accessible here: http://genome.ucsc.edu/goldenPath/newsarch.html announced on June 29th, 2015, that they are discontinuing the generation of UCSC Known Genes annotations for human, and provide the Gencode annotations as default replacement. the BioC site provides as default gene annotations for human the UCSC Known Genes track and currently does not provide the Gencode annotations. the GenomicFeatures package allows one to build such an annotation package. unfortunately the current "supported" UCSC tables that can be easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version V17: library(GenomicFeatures) xx <- supportedUCSCtables() xx[grep("GENCODE Genes", xx$track), ] track subtrack wgEncodeGencodeBasicV17 GENCODE Genes V17 wgEncodeGencodeCompV17 GENCODE Genes V17 wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 wgEncodeGencodePolyaV17 GENCODE Genes V17 wgEncodeGencodeBasicV14 GENCODE Genes V14 wgEncodeGencodeCompV14 GENCODE Genes V14 wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 wgEncodeGencodePolyaV14 GENCODE Genes V14 wgEncodeGencodeBasicV7GENCODE Genes V7 wgEncodeGencodeCompV7 GENCODE Genes V7 wgEncodeGencodePseudoGeneV7 GENCODE Genes V7 wgEncodeGencode2wayConsPseudoV7 GENCODE Genes V7 wgEncodeGencodePolyaV7GENCODE Genes V7 which is about 2 years old. current Gencode gene annotations are V24 and at least V22 was available at: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database before the last BioC release. according to a recent announcement at the BioC support site: https://support.bioconductor.org/p/71574 AnnotationHub seems to be now the proper way to import the most recent Gencode annotations into BioC. however, at least in my hands, making the corresponding TxDb object produces an error; see the following example: library(AnnotationHub) ah <- AnnotationHub() human_gff <- query(ah, c("Gencode", "gff", "human")) gencodeV23basicGFF <- ah[["AH49556"]] metadata <- data.frame(name=c("Data source", "Genome", "Organism", "Resource URL", "Full dataset"), value=c(ah["AH49556"]$dataprovider, ah["AH49556"]$genome, ah["AH49556"]$species, ah["AH49556"]$sourceurl, "no")) txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata) Error in .merge_transcript_parts(transcripts) : The following transcripts have multiple parts that cannot be merged because of incompatible seqnames: ENST0244174.9, should this be an error, or would a softer landing be more useful here? warn and exclude the offensive elements, perhaps with an option to retrieve them through some special step (option or new function)? This was actually a bug in makeTxDbFromGRanges(). It's fixed in GenomicFeatures 1.22.8 (release) and 1.23.17 (devel). With this fix: > txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata) > txdb TxDb object: # Db type: TxDb # Supporting package: GenomicFeatures # Data source: Gencode # Genome: GRCh38 # Organism: Homo sapiens # Resource URL: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/gencode.v23.basic.annotation.gff3.gz # Full dataset: no # transcript_nrow: 100769 # exon_nrow: 676601 # cds_nrow: 535301 # Db created by: GenomicFeatures package from Bioconductor # Creation time: 2016-01-11 22:28:51 -0800 (Mon, 11 Jan 2016) # GenomicFeatures version at creation time: 1.23.17 # RSQLite version at creation time: 1.0.0 # DBSCHEMAVERSION: 1.1 > transcripts(txdb) GRanges object with 100769 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name | [1] chr1 [11869, 14409] + | 1 ENST0456328.2 [2] chr1 [12010, 13670] + | 2 ENST0450305.2 [3] chr1 [29554, 31097] + | 3 ENST0473358.1 [4] chr1 [30267, 31109] + | 4 ENST0469289.1 [5] chr1 [30366, 30503] + | 5 ENST0607096.1 ... ......... ... ... ... [100765] chrM [ 5826, 5891] - |100765 ENST0387409.1 [100766] chrM [ 7446, 7514] - |100766 ENST0387416.2 [100767] chrM [14149, 14673] - |100767 ENST0361681.2 [100768] chrM [14674, 14742] - |100768 ENST0387459.1 [100769] chrM [15956, 16023] - |100769 ENST0387461.2 -
Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
that looks great, thanks Hervé for addressing this quickly. robert. On 1/11/16 11:18 PM, Hervé Pagès wrote: With GenomicFeatures 1.23.16: > txdb <- makeTxDbFromUCSC("hg38", "knownGene") Download the knownGene table ... OK Download the knownToLocusLink table ... OK Extract the 'transcripts' data frame ... OK Extract the 'splicings' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Warning message: In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) : UCSC data anomaly in 19942 transcript(s): the cds cumulative length is not a multiple of 3 for transcripts ‘uc057axw.1’ ‘uc031pko.2’ ‘uc057ayh.1’ ‘uc057ayr.1’ ‘uc057azl.1’ ‘uc057azn.1’ ‘uc057azp.1’ ‘uc057azt.1’ ‘uc057azy.1’ ‘uc057bad.1’ ‘uc057bap.1’ ‘uc057bav.1’ ‘uc057bay.1’ ‘uc057bbh.1’ ‘uc057bbv.1’ ‘uc057bcm.1’ ‘uc057bcp.1’ ‘uc057bcw.1’ ‘uc057bcx.1’ ‘uc057bdf.1’ ‘uc057bdj.1’ ‘uc057bdk.1’ ‘uc057bdl.1’ ‘uc057bdp.1’ ‘uc057beb.1’ ‘uc057beq.1’ ‘uc057bfn.1’ ‘uc057bfs.1’ ‘uc057bgm.1’ ‘uc057bgo.1’ ‘uc057bgt.1’ ‘uc057bhd.1’ ‘uc057bhi.1’ ‘uc057bio.1’ ‘uc057biu.1’ ‘uc057bji.1’ ‘uc057bjj.1’ ‘uc057bjk.1’ ‘uc057bkd.1’ ‘uc057bkf.1’ ‘uc057bkj.1’ ‘uc057bkl.1’ ‘uc057bkt.1’ ‘uc057bld.1’ ‘uc057bli.1’ ‘uc057blj.1’ ‘uc057blz.1’ ‘uc057bmf.1’ ‘uc057bmr.1’ ‘uc057bmv.1’ ‘uc057bni.1’ ‘u [... truncated] > txdb TxDb object: # Db type: TxDb # Supporting package: GenomicFeatures # Data source: UCSC # Genome: hg38 # Organism: Homo sapiens # Taxonomy ID: 9606 # UCSC Table: knownGene # UCSC Track: GENCODE v22 # Resource URL: http://genome.ucsc.edu/ # Type of Gene ID: Entrez Gene ID # Full dataset: yes # miRBase build ID: NA # transcript_nrow: 195178 # exon_nrow: 575044 # cds_nrow: 291225 # Db created by: GenomicFeatures package from Bioconductor # Creation time: 2016-01-11 14:10:24 -0800 (Mon, 11 Jan 2016) # GenomicFeatures version at creation time: 1.23.16 # RSQLite version at creation time: 1.0.0 # DBSCHEMAVERSION: 1.1 Note the new "UCSC Track" field above. Cheers, H. On 01/11/2016 01:12 PM, Hervé Pagès wrote: Hi Robert and others, I looked at this and the new situation doesn't seem as disruptive as it sounds. The bulk of the data for both tracks (i.e. the "UCSC Genes" track for hg19 and the "GENCODE v22" track for hg38) is stored in the knownGene table. The hg19.knownGene table is described here: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema The hg38.knownGene table is described here: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema The 2 pages are very similar. In particular both tables are connected to the knownToLocusLink table where Entrez Gene IDs are stored. So from a makeTxDbFromUCSC() point of view everything looks the same except the name of the track: "UCSC Genes" for hg19 and "GENCODE v22" for hg38. That means it shouldn't be hard to tweak makeTxDbFromUCSC() to support: txdb <- makeTxDbFromUCSC("hg38", "knownGene") The returned 'txdb' will contain data from the "GENCODE v22" track and with transcripts mapped to Entrez Gene IDs. I'll work on this and will also investigate makeTxDbFromGRanges's failure on AnnotationHub's GFF files from GENCODE. H. On 01/11/2016 06:29 AM, Robert Castelo wrote: hi, if i'm interpreting this correctly, the news archive of the UCSC Genome Browser accessible here: http://genome.ucsc.edu/goldenPath/newsarch.html announced on June 29th, 2015, that they are discontinuing the generation of UCSC Known Genes annotations for human, and provide the Gencode annotations as default replacement. the BioC site provides as default gene annotations for human the UCSC Known Genes track and currently does not provide the Gencode annotations. the GenomicFeatures package allows one to build such an annotation package. unfortunately the current "supported" UCSC tables that can be easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version V17: library(GenomicFeatures) xx <- supportedUCSCtables() xx[grep("GENCODE Genes", xx$track), ] track subtrack wgEncodeGencodeBasicV17 GENCODE Genes V17 wgEncodeGencodeCompV17 GENCODE Genes V17 wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 wgEncodeGencodePolyaV17 GENCODE Genes V17 wgEncodeGencodeBasicV14 GENCODE Genes V14 wgEncodeGencodeCompV14 GENCODE Genes V14 wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 wgEncodeGencodePolyaV14 GENCODE Genes V14 wgEncodeGencodeBasicV7GENCODE Genes V7 wgEncodeGencodeCompV7 GENCODE Genes V7 wgEncodeGencodePseudoGeneV7 GENCODE Genes V7 wgEncodeGencode2wayConsPseudoV7 GENCO
Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
With GenomicFeatures 1.23.16: > txdb <- makeTxDbFromUCSC("hg38", "knownGene") Download the knownGene table ... OK Download the knownToLocusLink table ... OK Extract the 'transcripts' data frame ... OK Extract the 'splicings' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Warning message: In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) : UCSC data anomaly in 19942 transcript(s): the cds cumulative length is not a multiple of 3 for transcripts ‘uc057axw.1’ ‘uc031pko.2’ ‘uc057ayh.1’ ‘uc057ayr.1’ ‘uc057azl.1’ ‘uc057azn.1’ ‘uc057azp.1’ ‘uc057azt.1’ ‘uc057azy.1’ ‘uc057bad.1’ ‘uc057bap.1’ ‘uc057bav.1’ ‘uc057bay.1’ ‘uc057bbh.1’ ‘uc057bbv.1’ ‘uc057bcm.1’ ‘uc057bcp.1’ ‘uc057bcw.1’ ‘uc057bcx.1’ ‘uc057bdf.1’ ‘uc057bdj.1’ ‘uc057bdk.1’ ‘uc057bdl.1’ ‘uc057bdp.1’ ‘uc057beb.1’ ‘uc057beq.1’ ‘uc057bfn.1’ ‘uc057bfs.1’ ‘uc057bgm.1’ ‘uc057bgo.1’ ‘uc057bgt.1’ ‘uc057bhd.1’ ‘uc057bhi.1’ ‘uc057bio.1’ ‘uc057biu.1’ ‘uc057bji.1’ ‘uc057bjj.1’ ‘uc057bjk.1’ ‘uc057bkd.1’ ‘uc057bkf.1’ ‘uc057bkj.1’ ‘uc057bkl.1’ ‘uc057bkt.1’ ‘uc057bld.1’ ‘uc057bli.1’ ‘uc057blj.1’ ‘uc057blz.1’ ‘uc057bmf.1’ ‘uc057bmr.1’ ‘uc057bmv.1’ ‘uc057bni.1’ ‘u [... truncated] > txdb TxDb object: # Db type: TxDb # Supporting package: GenomicFeatures # Data source: UCSC # Genome: hg38 # Organism: Homo sapiens # Taxonomy ID: 9606 # UCSC Table: knownGene # UCSC Track: GENCODE v22 # Resource URL: http://genome.ucsc.edu/ # Type of Gene ID: Entrez Gene ID # Full dataset: yes # miRBase build ID: NA # transcript_nrow: 195178 # exon_nrow: 575044 # cds_nrow: 291225 # Db created by: GenomicFeatures package from Bioconductor # Creation time: 2016-01-11 14:10:24 -0800 (Mon, 11 Jan 2016) # GenomicFeatures version at creation time: 1.23.16 # RSQLite version at creation time: 1.0.0 # DBSCHEMAVERSION: 1.1 Note the new "UCSC Track" field above. Cheers, H. On 01/11/2016 01:12 PM, Hervé Pagès wrote: Hi Robert and others, I looked at this and the new situation doesn't seem as disruptive as it sounds. The bulk of the data for both tracks (i.e. the "UCSC Genes" track for hg19 and the "GENCODE v22" track for hg38) is stored in the knownGene table. The hg19.knownGene table is described here: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema The hg38.knownGene table is described here: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema The 2 pages are very similar. In particular both tables are connected to the knownToLocusLink table where Entrez Gene IDs are stored. So from a makeTxDbFromUCSC() point of view everything looks the same except the name of the track: "UCSC Genes" for hg19 and "GENCODE v22" for hg38. That means it shouldn't be hard to tweak makeTxDbFromUCSC() to support: txdb <- makeTxDbFromUCSC("hg38", "knownGene") The returned 'txdb' will contain data from the "GENCODE v22" track and with transcripts mapped to Entrez Gene IDs. I'll work on this and will also investigate makeTxDbFromGRanges's failure on AnnotationHub's GFF files from GENCODE. H. On 01/11/2016 06:29 AM, Robert Castelo wrote: hi, if i'm interpreting this correctly, the news archive of the UCSC Genome Browser accessible here: http://genome.ucsc.edu/goldenPath/newsarch.html announced on June 29th, 2015, that they are discontinuing the generation of UCSC Known Genes annotations for human, and provide the Gencode annotations as default replacement. the BioC site provides as default gene annotations for human the UCSC Known Genes track and currently does not provide the Gencode annotations. the GenomicFeatures package allows one to build such an annotation package. unfortunately the current "supported" UCSC tables that can be easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version V17: library(GenomicFeatures) xx <- supportedUCSCtables() xx[grep("GENCODE Genes", xx$track), ] track subtrack wgEncodeGencodeBasicV17 GENCODE Genes V17 wgEncodeGencodeCompV17 GENCODE Genes V17 wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 wgEncodeGencodePolyaV17 GENCODE Genes V17 wgEncodeGencodeBasicV14 GENCODE Genes V14 wgEncodeGencodeCompV14 GENCODE Genes V14 wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 wgEncodeGencodePolyaV14 GENCODE Genes V14 wgEncodeGencodeBasicV7GENCODE Genes V7 wgEncodeGencodeCompV7 GENCODE Genes V7 wgEncodeGencodePseudoGeneV7 GENCODE Genes V7 wgEncodeGencode2wayConsPseudoV7 GENCODE Genes V7 wgEncodeGencodePolyaV7GENCODE Genes V7
Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
Hi Robert and others, I looked at this and the new situation doesn't seem as disruptive as it sounds. The bulk of the data for both tracks (i.e. the "UCSC Genes" track for hg19 and the "GENCODE v22" track for hg38) is stored in the knownGene table. The hg19.knownGene table is described here: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema The hg38.knownGene table is described here: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema The 2 pages are very similar. In particular both tables are connected to the knownToLocusLink table where Entrez Gene IDs are stored. So from a makeTxDbFromUCSC() point of view everything looks the same except the name of the track: "UCSC Genes" for hg19 and "GENCODE v22" for hg38. That means it shouldn't be hard to tweak makeTxDbFromUCSC() to support: txdb <- makeTxDbFromUCSC("hg38", "knownGene") The returned 'txdb' will contain data from the "GENCODE v22" track and with transcripts mapped to Entrez Gene IDs. I'll work on this and will also investigate makeTxDbFromGRanges's failure on AnnotationHub's GFF files from GENCODE. H. On 01/11/2016 06:29 AM, Robert Castelo wrote: hi, if i'm interpreting this correctly, the news archive of the UCSC Genome Browser accessible here: http://genome.ucsc.edu/goldenPath/newsarch.html announced on June 29th, 2015, that they are discontinuing the generation of UCSC Known Genes annotations for human, and provide the Gencode annotations as default replacement. the BioC site provides as default gene annotations for human the UCSC Known Genes track and currently does not provide the Gencode annotations. the GenomicFeatures package allows one to build such an annotation package. unfortunately the current "supported" UCSC tables that can be easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version V17: library(GenomicFeatures) xx <- supportedUCSCtables() xx[grep("GENCODE Genes", xx$track), ] track subtrack wgEncodeGencodeBasicV17 GENCODE Genes V17 wgEncodeGencodeCompV17 GENCODE Genes V17 wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 wgEncodeGencodePolyaV17 GENCODE Genes V17 wgEncodeGencodeBasicV14 GENCODE Genes V14 wgEncodeGencodeCompV14 GENCODE Genes V14 wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 wgEncodeGencodePolyaV14 GENCODE Genes V14 wgEncodeGencodeBasicV7GENCODE Genes V7 wgEncodeGencodeCompV7 GENCODE Genes V7 wgEncodeGencodePseudoGeneV7 GENCODE Genes V7 wgEncodeGencode2wayConsPseudoV7 GENCODE Genes V7 wgEncodeGencodePolyaV7GENCODE Genes V7 which is about 2 years old. current Gencode gene annotations are V24 and at least V22 was available at: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database before the last BioC release. according to a recent announcement at the BioC support site: https://support.bioconductor.org/p/71574 AnnotationHub seems to be now the proper way to import the most recent Gencode annotations into BioC. however, at least in my hands, making the corresponding TxDb object produces an error; see the following example: library(AnnotationHub) ah <- AnnotationHub() human_gff <- query(ah, c("Gencode", "gff", "human")) gencodeV23basicGFF <- ah[["AH49556"]] metadata <- data.frame(name=c("Data source", "Genome", "Organism", "Resource URL", "Full dataset"), value=c(ah["AH49556"]$dataprovider, ah["AH49556"]$genome, ah["AH49556"]$species, ah["AH49556"]$sourceurl, "no")) txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata) Error in .merge_transcript_parts(transcripts) : The following transcripts have multiple parts that cannot be merged because of incompatible seqnames: ENST0244174.9, ENST0262640.10, ENST0286448.10, ENST0302805.6, ENST0313871.7, ENST0326153.8, ENST0331035.8, ENST0334060.7, ENST0334651.9, ENST0355432.7, ENST0355805.6, ENST0359512.7, ENST0369423.6, ENST0381180.7, ENST0381184.5, ENST0381187.7, ENST0381192.7, ENST0381218.7, ENST0381222.6, ENST0381223.8, ENST0381229.8, ENST0381233.7, ENST0381241.7, ENST0381261.7, ENST0381297.8, ENST0381317.7, ENST0381333.8, ENST0381401.9, ENST0381469.6, ENST0381500.5, ENST0381509.7, ENST0381524.7, ENST0381529.7, ENST0381566.5, ENST0381567.7, ENST0381575.5, ENST0381578.5, ENST0381657.6, ENST0381663.7, ENST0390665.7, ENST0391707.6, ENST0
Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
Tim, you always crack me up! :) I totally agree, and it would probably be good to also have the tools enabled to download directly from Ensembl, NCBI, cloud-annotation source, etc. and build/update the AnnDbBimap objects. This way the annotation sources can maintain the data and us the scripts, including the pre-built AnnDbBimap objects just in case. ~p -Original Message- From: Bioc-devel [mailto:bioc-devel-boun...@r-project.org] On Behalf Of Tim Triche, Jr. Sent: Monday, January 11, 2016 2:02 PM To: Vincent Carey Cc: bioc-devel@r-project.org Subject: Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC ENSEMBL knownGene was always a disaster. For extra amusement/horror, be sure to check out the sad saga of the TCGA GAF and its disconnection from knownGenes as well as reality. Three cheers for rendering transcript-level estimates useless (and no this was not Katie's fault) Rainer and many others have made a herculean effort to bring all the BioC annotation infrastructure into the 21st century... having worked with Kallisto extensively of late, I see no reason to use a non-ENSEMBL "conservative" reference transcriptome (I see plenty of reasons to use miTranscriptome, etc. but that is another discussion). sorry if slighting anyone/everyone, but ENSEMBL is the clear choice IMHO. $0.02 - transmission costs --t On Mon, Jan 11, 2016 at 10:57 AM, Vincent Carey wrote: > I think these are all good observations and we may benefit from a > wider discussion on the support site? > > the abandonment of knownGene seems to have clear implications for > changing our most visible txdb examples. what should we change to? > can we make a more future-proof design for these annotation > selections? > > On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo > > wrote: > > > hi, > > > > On 01/11/2016 04:07 PM, Vincent Carey wrote: > > [...] > > > >> Is it true that there is an asymmetry between Entrez gene ID and > >> Ensembl gene ID for querying org.Hs.eg.db (I tend to prefer > >> Homo.sapiens as a symbol mapping resource)? Both ENTREZID and > >> ENSEMBL are listed as keytypes. My question is whether this > >> "anchor" concept holds in the current infrastructure. > >> > > > > you're right that the infrastructure is probably symmetric at least > > between Entrez and Ensembl, so maybe i'm not using the term "anchor" > > correctly here, i'm just referring to the fact that many package > functions > > and use cases of BioC are based in, or illustrated, using Entrez IDs. > > examples are: > > > > head(org.Hs.eg.db::keys(org.Hs.eg.db)) > > [1] "1" "2" "3" "9" "10" "11" > > > > i.e., by default the 'keytype' is 'ENTREZID' > > > > genefilter::nsFilter() argument 'require.entrez' filters out > > features without an Entrez Gene ID annotation. > > > > Category::categoryToEntrezBuilder() returns a list mapping category > > ids > to > > the Entrez Gene ids annotated at the cateogry id. > > > > SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a > > keytype to map ranges to genes. By default the keytype is 'ENTREZID' > > > > some of the workflows are also based on Entrez IDs, such as: > > > > > http://www.bioconductor.org/help/workflows/annotation/Annotation_Resou > rces > > > > http://www.bioconductor.org/help/workflows/variants > > > > so if the user just replaces the txdb object in one of those > > examples or argument functions by a txdb object that does not have > > Entrez identifiers as primary gene key, those functions, examples or > > workflows will require modification. this is not necessarily bad, > > but may put more burden on the user who is learning with a "default" TxDb human gene annotation package. > > this has been so far the *.UCSC.knownGene using Entrez as gene > identifiers. > > given the apparent discontinuity of UCSC with the known gene track, > > i > would > > suggest to put available at the BioC site another default gene > > annotation package, but then one based on Entrez identifiers given > > the amount of legacy code and documentation using Entrez in one way or another. > > > > an alternative to translating the default Ensembl Gencode > > identifiers > into > > Entrez would be to just take the NCBI RefSeq annotations as human > > gene annotation package available by default, i.e., replacing > > current *.UCSC.knownGene by *.UCSC.refGene > > > > > > > > robert. > > > > [[alternative HTML version deleted]] > > ___ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
ENSEMBL knownGene was always a disaster. For extra amusement/horror, be sure to check out the sad saga of the TCGA GAF and its disconnection from knownGenes as well as reality. Three cheers for rendering transcript-level estimates useless (and no this was not Katie's fault) Rainer and many others have made a herculean effort to bring all the BioC annotation infrastructure into the 21st century... having worked with Kallisto extensively of late, I see no reason to use a non-ENSEMBL "conservative" reference transcriptome (I see plenty of reasons to use miTranscriptome, etc. but that is another discussion). sorry if slighting anyone/everyone, but ENSEMBL is the clear choice IMHO. $0.02 - transmission costs --t On Mon, Jan 11, 2016 at 10:57 AM, Vincent Carey wrote: > I think these are all good observations and we may benefit from a wider > discussion on the support site? > > the abandonment of knownGene seems to have clear implications for changing > our most visible txdb > examples. what should we change to? can we make a more future-proof > design for these annotation selections? > > On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo > wrote: > > > hi, > > > > On 01/11/2016 04:07 PM, Vincent Carey wrote: > > [...] > > > >> Is it true that there is an asymmetry between Entrez gene ID and Ensembl > >> gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens > >> as a symbol mapping resource)? Both ENTREZID and ENSEMBL are listed as > >> keytypes. My question is whether this "anchor" concept > >> holds in the current infrastructure. > >> > > > > you're right that the infrastructure is probably symmetric at least > > between Entrez and Ensembl, so maybe i'm not using the term "anchor" > > correctly here, i'm just referring to the fact that many package > functions > > and use cases of BioC are based in, or illustrated, using Entrez IDs. > > examples are: > > > > head(org.Hs.eg.db::keys(org.Hs.eg.db)) > > [1] "1" "2" "3" "9" "10" "11" > > > > i.e., by default the 'keytype' is 'ENTREZID' > > > > genefilter::nsFilter() argument 'require.entrez' filters out features > > without an Entrez Gene ID annotation. > > > > Category::categoryToEntrezBuilder() returns a list mapping category ids > to > > the Entrez Gene ids annotated at the cateogry id. > > > > SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a > > keytype to map ranges to genes. By default the keytype is 'ENTREZID' > > > > some of the workflows are also based on Entrez IDs, such as: > > > > > http://www.bioconductor.org/help/workflows/annotation/Annotation_Resources > > > > http://www.bioconductor.org/help/workflows/variants > > > > so if the user just replaces the txdb object in one of those examples or > > argument functions by a txdb object that does not have Entrez identifiers > > as primary gene key, those functions, examples or workflows will require > > modification. this is not necessarily bad, but may put more burden on the > > user who is learning with a "default" TxDb human gene annotation package. > > this has been so far the *.UCSC.knownGene using Entrez as gene > identifiers. > > given the apparent discontinuity of UCSC with the known gene track, i > would > > suggest to put available at the BioC site another default gene annotation > > package, but then one based on Entrez identifiers given the amount of > > legacy code and documentation using Entrez in one way or another. > > > > an alternative to translating the default Ensembl Gencode identifiers > into > > Entrez would be to just take the NCBI RefSeq annotations as human gene > > annotation package available by default, i.e., replacing current > > *.UCSC.knownGene by *.UCSC.refGene > > > > > > > > robert. > > > > [[alternative HTML version deleted]] > > ___ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
I think these are all good observations and we may benefit from a wider discussion on the support site? the abandonment of knownGene seems to have clear implications for changing our most visible txdb examples. what should we change to? can we make a more future-proof design for these annotation selections? On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo wrote: > hi, > > On 01/11/2016 04:07 PM, Vincent Carey wrote: > [...] > >> Is it true that there is an asymmetry between Entrez gene ID and Ensembl >> gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens >> as a symbol mapping resource)? Both ENTREZID and ENSEMBL are listed as >> keytypes. My question is whether this "anchor" concept >> holds in the current infrastructure. >> > > you're right that the infrastructure is probably symmetric at least > between Entrez and Ensembl, so maybe i'm not using the term "anchor" > correctly here, i'm just referring to the fact that many package functions > and use cases of BioC are based in, or illustrated, using Entrez IDs. > examples are: > > head(org.Hs.eg.db::keys(org.Hs.eg.db)) > [1] "1" "2" "3" "9" "10" "11" > > i.e., by default the 'keytype' is 'ENTREZID' > > genefilter::nsFilter() argument 'require.entrez' filters out features > without an Entrez Gene ID annotation. > > Category::categoryToEntrezBuilder() returns a list mapping category ids to > the Entrez Gene ids annotated at the cateogry id. > > SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a > keytype to map ranges to genes. By default the keytype is 'ENTREZID' > > some of the workflows are also based on Entrez IDs, such as: > > http://www.bioconductor.org/help/workflows/annotation/Annotation_Resources > > http://www.bioconductor.org/help/workflows/variants > > so if the user just replaces the txdb object in one of those examples or > argument functions by a txdb object that does not have Entrez identifiers > as primary gene key, those functions, examples or workflows will require > modification. this is not necessarily bad, but may put more burden on the > user who is learning with a "default" TxDb human gene annotation package. > this has been so far the *.UCSC.knownGene using Entrez as gene identifiers. > given the apparent discontinuity of UCSC with the known gene track, i would > suggest to put available at the BioC site another default gene annotation > package, but then one based on Entrez identifiers given the amount of > legacy code and documentation using Entrez in one way or another. > > an alternative to translating the default Ensembl Gencode identifiers into > Entrez would be to just take the NCBI RefSeq annotations as human gene > annotation package available by default, i.e., replacing current > *.UCSC.knownGene by *.UCSC.refGene > > > > robert. > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
hi, On 01/11/2016 04:07 PM, Vincent Carey wrote: [...] Is it true that there is an asymmetry between Entrez gene ID and Ensembl gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens as a symbol mapping resource)? Both ENTREZID and ENSEMBL are listed as keytypes. My question is whether this "anchor" concept holds in the current infrastructure. you're right that the infrastructure is probably symmetric at least between Entrez and Ensembl, so maybe i'm not using the term "anchor" correctly here, i'm just referring to the fact that many package functions and use cases of BioC are based in, or illustrated, using Entrez IDs. examples are: head(org.Hs.eg.db::keys(org.Hs.eg.db)) [1] "1" "2" "3" "9" "10" "11" i.e., by default the 'keytype' is 'ENTREZID' genefilter::nsFilter() argument 'require.entrez' filters out features without an Entrez Gene ID annotation. Category::categoryToEntrezBuilder() returns a list mapping category ids to the Entrez Gene ids annotated at the cateogry id. SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a keytype to map ranges to genes. By default the keytype is 'ENTREZID' some of the workflows are also based on Entrez IDs, such as: http://www.bioconductor.org/help/workflows/annotation/Annotation_Resources http://www.bioconductor.org/help/workflows/variants so if the user just replaces the txdb object in one of those examples or argument functions by a txdb object that does not have Entrez identifiers as primary gene key, those functions, examples or workflows will require modification. this is not necessarily bad, but may put more burden on the user who is learning with a "default" TxDb human gene annotation package. this has been so far the *.UCSC.knownGene using Entrez as gene identifiers. given the apparent discontinuity of UCSC with the known gene track, i would suggest to put available at the BioC site another default gene annotation package, but then one based on Entrez identifiers given the amount of legacy code and documentation using Entrez in one way or another. an alternative to translating the default Ensembl Gencode identifiers into Entrez would be to just take the NCBI RefSeq annotations as human gene annotation package available by default, i.e., replacing current *.UCSC.knownGene by *.UCSC.refGene robert. ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
On Mon, Jan 11, 2016 at 9:29 AM, Robert Castelo wrote: > hi, > > if i'm interpreting this correctly, the news archive of the UCSC Genome > Browser accessible here: > > http://genome.ucsc.edu/goldenPath/newsarch.html > > announced on June 29th, 2015, that they are discontinuing the generation > of UCSC Known Genes annotations for human, and provide the Gencode > annotations as default replacement. > > the BioC site provides as default gene annotations for human the UCSC > Known Genes track and currently does not provide the Gencode annotations. > > the GenomicFeatures package allows one to build such an annotation > package. unfortunately the current "supported" UCSC tables that can be > easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version > V17: > > library(GenomicFeatures) > > xx <- supportedUCSCtables() > xx[grep("GENCODE Genes", xx$track), ] > track subtrack > wgEncodeGencodeBasicV17 GENCODE Genes V17 > wgEncodeGencodeCompV17 GENCODE Genes V17 > wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 > wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 > wgEncodeGencodePolyaV17 GENCODE Genes V17 > wgEncodeGencodeBasicV14 GENCODE Genes V14 > wgEncodeGencodeCompV14 GENCODE Genes V14 > wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 > wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 > wgEncodeGencodePolyaV14 GENCODE Genes V14 > wgEncodeGencodeBasicV7GENCODE Genes V7 > wgEncodeGencodeCompV7 GENCODE Genes V7 > wgEncodeGencodePseudoGeneV7 GENCODE Genes V7 > wgEncodeGencode2wayConsPseudoV7 GENCODE Genes V7 > wgEncodeGencodePolyaV7GENCODE Genes V7 > > which is about 2 years old. current Gencode gene annotations are V24 and > at least V22 was available at: > > http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database > > before the last BioC release. > > according to a recent announcement at the BioC support site: > > https://support.bioconductor.org/p/71574 > > AnnotationHub seems to be now the proper way to import the most recent > Gencode annotations into BioC. however, at least in my hands, making the > corresponding TxDb object produces an error; see the following example: > > library(AnnotationHub) > > ah <- AnnotationHub() > human_gff <- query(ah, c("Gencode", "gff", "human")) > > gencodeV23basicGFF <- ah[["AH49556"]] > metadata <- data.frame(name=c("Data source", "Genome", "Organism", > "Resource URL", "Full dataset"), >value=c(ah["AH49556"]$dataprovider, > ah["AH49556"]$genome, >ah["AH49556"]$species, > ah["AH49556"]$sourceurl, "no")) > txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata) > Error in .merge_transcript_parts(transcripts) : > The following transcripts have multiple parts that cannot be merged > because of incompatible seqnames: ENST0244174.9, > should this be an error, or would a softer landing be more useful here? warn and exclude the offensive elements, perhaps with an option to retrieve them through some special step (option or new function)? > ENST0262640.10, ENST0286448.10, ENST0302805.6, > ENST0313871.7, ENST0326153.8, ENST0331035.8, > ENST0334060.7, ENST0334651.9, ENST0355432.7, > ENST0355805.6, ENST0359512.7, ENST0369423.6, > ENST0381180.7, ENST0381184.5, ENST0381187.7, > ENST0381192.7, ENST0381218.7, ENST0381222.6, > ENST0381223.8, ENST0381229.8, ENST0381233.7, > ENST0381241.7, ENST0381261.7, ENST0381297.8, > ENST0381317.7, ENST0381333.8, ENST0381401.9, > ENST0381469.6, ENST0381500.5, ENST0381509.7, > ENST0381524.7, ENST0381529.7, ENST0381566.5, > ENST0381567.7, ENST0381575.5, ENST0381578.5, > ENST0381657.6, ENST0381663.7, ENST0390665.7, > ENST0391707.6, ENST0399012.5, ENST0399966.8, > ENST0400841.6, ENST0411342.5, ENST0412936 > > > on top of this, even if it would work, these annotations are anchored at > Ensembl Gene identifiers while the gene-centric annotations at org.Hs.eg.db Is it true that there is an asymmetry between Entrez gene ID and Ensembl gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens as a symbol mapping resource)? Both ENTREZID and ENSEMBL are listed as keytypes. My question is whether this "anchor" concept holds in the current infrastructure. are anchored at Entrez Gene identifiers. this means that more code would > have to be involved to add the corresponding Entrez IDs (resolving > multiplicities, etc.) and produce a TxDb package that can be used across > many of the typical BioC pipelines. > > since human gene annotations are at the core of many BioC pipelines, i'd > like to suggest for the forthcoming