subject:"Re\: \[Bioc\-devel\] Known Genes replaced by GENCODE genes at UCSC"

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

2016-01-11 Thread Rainer Johannes

Just as an info… EnsDb objects/packages (from ensembldb package) provide 
similar functionality than the TxDb, are tailored to Ensembl annotations and 
can be build from the GTF files from Ensembl (which can be fetched via 
AnnotationHub; it’s all described in the ensembldb vignette).

cheers, jo
 
> On 11 Jan 2016, at 21:40, Paul Grosu  wrote:
> 
> 
> Tim, you always crack me up! :)  I totally agree, and it would probably be
> good to also have the tools enabled to download directly from Ensembl, NCBI,
> cloud-annotation source, etc. and build/update the AnnDbBimap objects.  This
> way the annotation sources can maintain the data and us the scripts,
> including the pre-built AnnDbBimap objects just in case.
> 
> ~p
> 
> -Original Message-
> From: Bioc-devel [mailto:bioc-devel-boun...@r-project.org] On Behalf Of Tim
> Triche, Jr.
> Sent: Monday, January 11, 2016 2:02 PM
> To: Vincent Carey
> Cc: bioc-devel@r-project.org
> Subject: Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
> 
> ENSEMBL
> 
> knownGene was always a disaster.  For extra amusement/horror, be sure to
> check out the sad saga of the TCGA GAF and its disconnection from knownGenes
> as well as reality.  Three cheers for rendering transcript-level estimates
> useless (and no this was not Katie's fault)
> 
> Rainer and many others have made a herculean effort to bring all the BioC
> annotation infrastructure into the 21st century... having worked with
> Kallisto extensively of late, I see no reason to use a non-ENSEMBL
> "conservative" reference transcriptome (I see plenty of reasons to use
> miTranscriptome, etc. but that is another discussion).
> 
> sorry if slighting anyone/everyone, but ENSEMBL is the clear choice IMHO.
> 
> $0.02 - transmission costs
> 
> 
> --t
> 
> On Mon, Jan 11, 2016 at 10:57 AM, Vincent Carey 
> wrote:
> 
>> I think these are all good observations and we may benefit from a 
>> wider discussion on the support site?
>> 
>> the abandonment of knownGene seems to have clear implications for 
>> changing our most visible txdb examples.  what should we change to?  
>> can we make a more future-proof design for these annotation 
>> selections?
>> 
>> On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo 
>> 
>> wrote:
>> 
>>> hi,
>>> 
>>> On 01/11/2016 04:07 PM, Vincent Carey wrote:
>>> [...]
>>> 
>>>> Is it true that there is an asymmetry between Entrez gene ID and 
>>>> Ensembl gene ID for querying org.Hs.eg.db (I tend to prefer 
>>>> Homo.sapiens as a symbol mapping resource)?  Both ENTREZID and 
>>>> ENSEMBL are listed as keytypes.  My question is whether this 
>>>> "anchor" concept holds in the current infrastructure.
>>>> 
>>> 
>>> you're right that the infrastructure is probably symmetric at least 
>>> between Entrez and Ensembl, so maybe i'm not using the term "anchor"
>>> correctly here, i'm just referring to the fact that many package
>> functions
>>> and use cases of BioC are based in, or illustrated, using Entrez IDs.
>>> examples are:
>>> 
>>> head(org.Hs.eg.db::keys(org.Hs.eg.db))
>>> [1] "1"  "2"  "3"  "9"  "10" "11"
>>> 
>>> i.e., by default the 'keytype' is 'ENTREZID'
>>> 
>>> genefilter::nsFilter() argument 'require.entrez' filters out 
>>> features without an Entrez Gene ID annotation.
>>> 
>>> Category::categoryToEntrezBuilder() returns a list mapping category 
>>> ids
>> to
>>> the Entrez Gene ids annotated at the cateogry id.
>>> 
>>> SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a 
>>> keytype to map ranges to genes. By default the keytype is 'ENTREZID'
>>> 
>>> some of the workflows are also based on Entrez IDs, such as:
>>> 
>>> 
>> http://www.bioconductor.org/help/workflows/annotation/Annotation_Resou
>> rces
>>> 
>>> http://www.bioconductor.org/help/workflows/variants
>>> 
>>> so if the user just replaces the txdb object in one of those 
>>> examples or argument functions by a txdb object that does not have 
>>> Entrez identifiers as primary gene key, those functions, examples or 
>>> workflows will require modification. this is not necessarily bad, 
>>> but may put more burden on the user who is learning with a "

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

2016-01-11 Thread Hervé Pagès

Hi Vince, Robert,

On 01/11/2016 07:07 AM, Vincent Carey wrote:

On Mon, Jan 11, 2016 at 9:29 AM, Robert Castelo 
wrote:

hi,

if i'm interpreting this correctly, the news archive of the UCSC Genome
Browser accessible here:

  http://genome.ucsc.edu/goldenPath/newsarch.html

announced on June 29th, 2015, that they are discontinuing the generation
of UCSC Known Genes annotations for human, and provide the Gencode
annotations as default replacement.

the BioC site provides as default gene annotations for human the UCSC
Known Genes track and currently does not provide the Gencode annotations.

the GenomicFeatures package allows one to build such an annotation
package. unfortunately the current "supported" UCSC tables that can be
easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version
V17:

library(GenomicFeatures)

xx <- supportedUCSCtables()
xx[grep("GENCODE Genes", xx$track), ]
  track subtrack
wgEncodeGencodeBasicV17  GENCODE Genes V17 
wgEncodeGencodeCompV17   GENCODE Genes V17 
wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 
wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 
wgEncodeGencodePolyaV17  GENCODE Genes V17 
wgEncodeGencodeBasicV14  GENCODE Genes V14 
wgEncodeGencodeCompV14   GENCODE Genes V14 
wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 
wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 
wgEncodeGencodePolyaV14  GENCODE Genes V14 
wgEncodeGencodeBasicV7GENCODE Genes V7 
wgEncodeGencodeCompV7 GENCODE Genes V7 
wgEncodeGencodePseudoGeneV7   GENCODE Genes V7 
wgEncodeGencode2wayConsPseudoV7   GENCODE Genes V7 
wgEncodeGencodePolyaV7GENCODE Genes V7 

which is about 2 years old. current Gencode gene annotations are V24 and
at least V22 was available at:

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database

before the last BioC release.

according to a recent announcement at the BioC support site:

https://support.bioconductor.org/p/71574

AnnotationHub seems to be now the proper way to import the most recent
Gencode annotations into BioC. however, at least in my hands, making the
corresponding TxDb object produces an error; see the following example:

library(AnnotationHub)

ah <- AnnotationHub()
human_gff <- query(ah, c("Gencode", "gff", "human"))

gencodeV23basicGFF <- ah[["AH49556"]]
metadata <- data.frame(name=c("Data source", "Genome", "Organism",
   "Resource URL", "Full dataset"),
value=c(ah["AH49556"]$dataprovider,
ah["AH49556"]$genome,
ah["AH49556"]$species,
ah["AH49556"]$sourceurl, "no"))
txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
Error in .merge_transcript_parts(transcripts) :
   The following transcripts have multiple parts that cannot be merged
because of incompatible seqnames: ENST0244174.9,

should this be an error, or would a softer landing be more useful here?
  warn and exclude the offensive elements, perhaps with an option
to retrieve them through some special step (option or new function)?

This was actually a bug in makeTxDbFromGRanges(). It's fixed in
GenomicFeatures 1.22.8 (release) and 1.23.17 (devel). With this
fix:

> txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: Gencode
# Genome: GRCh38
# Organism: Homo sapiens
# Resource URL: 
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/gencode.v23.basic.annotation.gff3.gz

# Full dataset: no
# transcript_nrow: 100769
# exon_nrow: 676601
# cds_nrow: 535301
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2016-01-11 22:28:51 -0800 (Mon, 11 Jan 2016)
# GenomicFeatures version at creation time: 1.23.17
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1

> transcripts(txdb)
GRanges object with 100769 ranges and 2 metadata columns:
   seqnames ranges strand   | tx_id   tx_name
 |
   [1] chr1 [11869, 14409]  +   | 1 ENST0456328.2
   [2] chr1 [12010, 13670]  +   | 2 ENST0450305.2
   [3] chr1 [29554, 31097]  +   | 3 ENST0473358.1
   [4] chr1 [30267, 31109]  +   | 4 ENST0469289.1
   [5] chr1 [30366, 30503]  +   | 5 ENST0607096.1
   ...  ......... ...   ...   ...
  [100765] chrM [ 5826,  5891]  -   |100765 ENST0387409.1
  [100766] chrM [ 7446,  7514]  -   |100766 ENST0387416.2
  [100767] chrM [14149, 14673]  -   |100767 ENST0361681.2
  [100768] chrM [14674, 14742]  -   |100768 ENST0387459.1
  [100769] chrM [15956, 16023]  -   |100769 ENST0387461.2
  -

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

2016-01-11 Thread Robert Castelo

that looks great, thanks Hervé for addressing this quickly.

robert.

On 1/11/16 11:18 PM, Hervé Pagès wrote:

With GenomicFeatures 1.23.16:

> txdb <- makeTxDbFromUCSC("hg38", "knownGene")
Download the knownGene table ... OK
Download the knownToLocusLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
  UCSC data anomaly in 19942 transcript(s): the cds cumulative length is
  not a multiple of 3 for transcripts ‘uc057axw.1’ ‘uc031pko.2’
  ‘uc057ayh.1’ ‘uc057ayr.1’ ‘uc057azl.1’ ‘uc057azn.1’ ‘uc057azp.1’
  ‘uc057azt.1’ ‘uc057azy.1’ ‘uc057bad.1’ ‘uc057bap.1’ ‘uc057bav.1’
  ‘uc057bay.1’ ‘uc057bbh.1’ ‘uc057bbv.1’ ‘uc057bcm.1’ ‘uc057bcp.1’
  ‘uc057bcw.1’ ‘uc057bcx.1’ ‘uc057bdf.1’ ‘uc057bdj.1’ ‘uc057bdk.1’
  ‘uc057bdl.1’ ‘uc057bdp.1’ ‘uc057beb.1’ ‘uc057beq.1’ ‘uc057bfn.1’
  ‘uc057bfs.1’ ‘uc057bgm.1’ ‘uc057bgo.1’ ‘uc057bgt.1’ ‘uc057bhd.1’
  ‘uc057bhi.1’ ‘uc057bio.1’ ‘uc057biu.1’ ‘uc057bji.1’ ‘uc057bjj.1’
  ‘uc057bjk.1’ ‘uc057bkd.1’ ‘uc057bkf.1’ ‘uc057bkj.1’ ‘uc057bkl.1’
  ‘uc057bkt.1’ ‘uc057bld.1’ ‘uc057bli.1’ ‘uc057blj.1’ ‘uc057blz.1’
  ‘uc057bmf.1’ ‘uc057bmr.1’ ‘uc057bmv.1’ ‘uc057bni.1’ ‘u [... truncated]

> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: knownGene
# UCSC Track: GENCODE v22
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 195178
# exon_nrow: 575044
# cds_nrow: 291225
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2016-01-11 14:10:24 -0800 (Mon, 11 Jan 2016)
# GenomicFeatures version at creation time: 1.23.16
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1

Note the new "UCSC Track" field above.

Cheers,
H.

On 01/11/2016 01:12 PM, Hervé Pagès wrote:

Hi Robert and others,

I looked at this and the new situation doesn't seem as disruptive as
it sounds. The bulk of the data for both tracks (i.e. the "UCSC Genes"
track for hg19 and the "GENCODE v22" track for hg38) is stored in the
knownGene table.

The hg19.knownGene table is described here:

https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema 

The hg38.knownGene table is described here:

https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema 

The 2 pages are very similar. In particular both tables are connected
to the knownToLocusLink table where Entrez Gene IDs are stored.

So from a makeTxDbFromUCSC() point of view everything looks the same
except the name of the track: "UCSC Genes" for hg19 and "GENCODE v22"
for hg38. That means it shouldn't be hard to tweak makeTxDbFromUCSC()
to support:

 txdb <- makeTxDbFromUCSC("hg38", "knownGene")

The returned 'txdb' will contain data from the "GENCODE v22" track
and with transcripts mapped to Entrez Gene IDs.

I'll work on this and will also investigate makeTxDbFromGRanges's
failure on AnnotationHub's GFF files from GENCODE.

H.

On 01/11/2016 06:29 AM, Robert Castelo wrote:

hi,

if i'm interpreting this correctly, the news archive of the UCSC Genome
Browser accessible here:

  http://genome.ucsc.edu/goldenPath/newsarch.html

announced on June 29th, 2015, that they are discontinuing the 
generation

of UCSC Known Genes annotations for human, and provide the Gencode
annotations as default replacement.

the BioC site provides as default gene annotations for human the UCSC
Known Genes track and currently does not provide the Gencode 
annotations.

the GenomicFeatures package allows one to build such an annotation
package. unfortunately the current "supported" UCSC tables that can be
easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode
version V17:

library(GenomicFeatures)

xx <- supportedUCSCtables()
xx[grep("GENCODE Genes", xx$track), ]
  track subtrack
wgEncodeGencodeBasicV17  GENCODE Genes V17 
wgEncodeGencodeCompV17   GENCODE Genes V17 
wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 
wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 
wgEncodeGencodePolyaV17  GENCODE Genes V17 
wgEncodeGencodeBasicV14  GENCODE Genes V14 
wgEncodeGencodeCompV14   GENCODE Genes V14 
wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 
wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 
wgEncodeGencodePolyaV14  GENCODE Genes V14 
wgEncodeGencodeBasicV7GENCODE Genes V7 
wgEncodeGencodeCompV7 GENCODE Genes V7 
wgEncodeGencodePseudoGeneV7   GENCODE Genes V7 
wgEncodeGencode2wayConsPseudoV7   GENCO

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

2016-01-11 Thread Hervé Pagès

With GenomicFeatures 1.23.16:

> txdb <- makeTxDbFromUCSC("hg38", "knownGene")
Download the knownGene table ... OK
Download the knownToLocusLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
  UCSC data anomaly in 19942 transcript(s): the cds cumulative length is
  not a multiple of 3 for transcripts ‘uc057axw.1’ ‘uc031pko.2’
  ‘uc057ayh.1’ ‘uc057ayr.1’ ‘uc057azl.1’ ‘uc057azn.1’ ‘uc057azp.1’
  ‘uc057azt.1’ ‘uc057azy.1’ ‘uc057bad.1’ ‘uc057bap.1’ ‘uc057bav.1’
  ‘uc057bay.1’ ‘uc057bbh.1’ ‘uc057bbv.1’ ‘uc057bcm.1’ ‘uc057bcp.1’
  ‘uc057bcw.1’ ‘uc057bcx.1’ ‘uc057bdf.1’ ‘uc057bdj.1’ ‘uc057bdk.1’
  ‘uc057bdl.1’ ‘uc057bdp.1’ ‘uc057beb.1’ ‘uc057beq.1’ ‘uc057bfn.1’
  ‘uc057bfs.1’ ‘uc057bgm.1’ ‘uc057bgo.1’ ‘uc057bgt.1’ ‘uc057bhd.1’
  ‘uc057bhi.1’ ‘uc057bio.1’ ‘uc057biu.1’ ‘uc057bji.1’ ‘uc057bjj.1’
  ‘uc057bjk.1’ ‘uc057bkd.1’ ‘uc057bkf.1’ ‘uc057bkj.1’ ‘uc057bkl.1’
  ‘uc057bkt.1’ ‘uc057bld.1’ ‘uc057bli.1’ ‘uc057blj.1’ ‘uc057blz.1’
  ‘uc057bmf.1’ ‘uc057bmr.1’ ‘uc057bmv.1’ ‘uc057bni.1’ ‘u [... truncated]

> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: knownGene
# UCSC Track: GENCODE v22
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# transcript_nrow: 195178
# exon_nrow: 575044
# cds_nrow: 291225
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2016-01-11 14:10:24 -0800 (Mon, 11 Jan 2016)
# GenomicFeatures version at creation time: 1.23.16
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1

Note the new "UCSC Track" field above.

Cheers,
H.

On 01/11/2016 01:12 PM, Hervé Pagès wrote:

Hi Robert and others,

I looked at this and the new situation doesn't seem as disruptive as
it sounds. The bulk of the data for both tracks (i.e. the "UCSC Genes"
track for hg19 and the "GENCODE v22" track for hg38) is stored in the
knownGene table.

The hg19.knownGene table is described here:

https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema

The hg38.knownGene table is described here:

https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema

The 2 pages are very similar. In particular both tables are connected
to the knownToLocusLink table where Entrez Gene IDs are stored.

So from a makeTxDbFromUCSC() point of view everything looks the same
except the name of the track: "UCSC Genes" for hg19 and "GENCODE v22"
for hg38. That means it shouldn't be hard to tweak makeTxDbFromUCSC()
to support:

 txdb <- makeTxDbFromUCSC("hg38", "knownGene")

The returned 'txdb' will contain data from the "GENCODE v22" track
and with transcripts mapped to Entrez Gene IDs.

I'll work on this and will also investigate makeTxDbFromGRanges's
failure on AnnotationHub's GFF files from GENCODE.

H.

On 01/11/2016 06:29 AM, Robert Castelo wrote:

hi,

if i'm interpreting this correctly, the news archive of the UCSC Genome
Browser accessible here:

  http://genome.ucsc.edu/goldenPath/newsarch.html

announced on June 29th, 2015, that they are discontinuing the generation
of UCSC Known Genes annotations for human, and provide the Gencode
annotations as default replacement.

the BioC site provides as default gene annotations for human the UCSC
Known Genes track and currently does not provide the Gencode annotations.

the GenomicFeatures package allows one to build such an annotation
package. unfortunately the current "supported" UCSC tables that can be
easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode
version V17:

library(GenomicFeatures)

xx <- supportedUCSCtables()
xx[grep("GENCODE Genes", xx$track), ]
  track subtrack
wgEncodeGencodeBasicV17  GENCODE Genes V17 
wgEncodeGencodeCompV17   GENCODE Genes V17 
wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 
wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 
wgEncodeGencodePolyaV17  GENCODE Genes V17 
wgEncodeGencodeBasicV14  GENCODE Genes V14 
wgEncodeGencodeCompV14   GENCODE Genes V14 
wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 
wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 
wgEncodeGencodePolyaV14  GENCODE Genes V14 
wgEncodeGencodeBasicV7GENCODE Genes V7 
wgEncodeGencodeCompV7 GENCODE Genes V7 
wgEncodeGencodePseudoGeneV7   GENCODE Genes V7 
wgEncodeGencode2wayConsPseudoV7   GENCODE Genes V7 
wgEncodeGencodePolyaV7GENCODE Genes V7

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

2016-01-11 Thread Hervé Pagès


Hi Robert and others,

I looked at this and the new situation doesn't seem as disruptive as
it sounds. The bulk of the data for both tracks (i.e. the "UCSC Genes"
track for hg19 and the "GENCODE v22" track for hg38) is stored in the
knownGene table.

The hg19.knownGene table is described here:


https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema

The hg38.knownGene table is described here:


https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema

The 2 pages are very similar. In particular both tables are connected
to the knownToLocusLink table where Entrez Gene IDs are stored.

So from a makeTxDbFromUCSC() point of view everything looks the same
except the name of the track: "UCSC Genes" for hg19 and "GENCODE v22"
for hg38. That means it shouldn't be hard to tweak makeTxDbFromUCSC()
to support:

txdb <- makeTxDbFromUCSC("hg38", "knownGene")

The returned 'txdb' will contain data from the "GENCODE v22" track
and with transcripts mapped to Entrez Gene IDs.

I'll work on this and will also investigate makeTxDbFromGRanges's
failure on AnnotationHub's GFF files from GENCODE.

H.


On 01/11/2016 06:29 AM, Robert Castelo wrote:

hi,

if i'm interpreting this correctly, the news archive of the UCSC Genome
Browser accessible here:

  http://genome.ucsc.edu/goldenPath/newsarch.html

announced on June 29th, 2015, that they are discontinuing the generation
of UCSC Known Genes annotations for human, and provide the Gencode
annotations as default replacement.

the BioC site provides as default gene annotations for human the UCSC
Known Genes track and currently does not provide the Gencode annotations.

the GenomicFeatures package allows one to build such an annotation
package. unfortunately the current "supported" UCSC tables that can be
easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode
version V17:

library(GenomicFeatures)

xx <- supportedUCSCtables()
xx[grep("GENCODE Genes", xx$track), ]
  track subtrack
wgEncodeGencodeBasicV17  GENCODE Genes V17 
wgEncodeGencodeCompV17   GENCODE Genes V17 
wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 
wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 
wgEncodeGencodePolyaV17  GENCODE Genes V17 
wgEncodeGencodeBasicV14  GENCODE Genes V14 
wgEncodeGencodeCompV14   GENCODE Genes V14 
wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 
wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 
wgEncodeGencodePolyaV14  GENCODE Genes V14 
wgEncodeGencodeBasicV7GENCODE Genes V7 
wgEncodeGencodeCompV7 GENCODE Genes V7 
wgEncodeGencodePseudoGeneV7   GENCODE Genes V7 
wgEncodeGencode2wayConsPseudoV7   GENCODE Genes V7 
wgEncodeGencodePolyaV7GENCODE Genes V7 

which is about 2 years old. current Gencode gene annotations are V24 and
at least V22 was available at:

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database

before the last BioC release.

according to a recent announcement at the BioC support site:

https://support.bioconductor.org/p/71574

AnnotationHub seems to be now the proper way to import the most recent
Gencode annotations into BioC. however, at least in my hands, making the
corresponding TxDb object produces an error; see the following example:

library(AnnotationHub)

ah <- AnnotationHub()
human_gff <- query(ah, c("Gencode", "gff", "human"))

gencodeV23basicGFF <- ah[["AH49556"]]
metadata <- data.frame(name=c("Data source", "Genome", "Organism",
   "Resource URL", "Full dataset"),
value=c(ah["AH49556"]$dataprovider,
ah["AH49556"]$genome,
ah["AH49556"]$species,
ah["AH49556"]$sourceurl, "no"))
txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
Error in .merge_transcript_parts(transcripts) :
   The following transcripts have multiple parts that cannot be merged
because of incompatible seqnames: ENST0244174.9,
   ENST0262640.10, ENST0286448.10, ENST0302805.6,
ENST0313871.7, ENST0326153.8, ENST0331035.8,
   ENST0334060.7, ENST0334651.9, ENST0355432.7,
ENST0355805.6, ENST0359512.7, ENST0369423.6,
   ENST0381180.7, ENST0381184.5, ENST0381187.7,
ENST0381192.7, ENST0381218.7, ENST0381222.6,
   ENST0381223.8, ENST0381229.8, ENST0381233.7,
ENST0381241.7, ENST0381261.7, ENST0381297.8,
   ENST0381317.7, ENST0381333.8, ENST0381401.9,
ENST0381469.6, ENST0381500.5, ENST0381509.7,
   ENST0381524.7, ENST0381529.7, ENST0381566.5,
ENST0381567.7, ENST0381575.5, ENST0381578.5,
   ENST0381657.6, ENST0381663.7, ENST0390665.7,
ENST0391707.6, ENST0

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

2016-01-11 Thread Paul Grosu


Tim, you always crack me up! :)  I totally agree, and it would probably be
good to also have the tools enabled to download directly from Ensembl, NCBI,
cloud-annotation source, etc. and build/update the AnnDbBimap objects.  This
way the annotation sources can maintain the data and us the scripts,
including the pre-built AnnDbBimap objects just in case.

~p

-Original Message-
From: Bioc-devel [mailto:bioc-devel-boun...@r-project.org] On Behalf Of Tim
Triche, Jr.
Sent: Monday, January 11, 2016 2:02 PM
To: Vincent Carey
Cc: bioc-devel@r-project.org
Subject: Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

ENSEMBL

knownGene was always a disaster.  For extra amusement/horror, be sure to
check out the sad saga of the TCGA GAF and its disconnection from knownGenes
as well as reality.  Three cheers for rendering transcript-level estimates
useless (and no this was not Katie's fault)

Rainer and many others have made a herculean effort to bring all the BioC
annotation infrastructure into the 21st century... having worked with
Kallisto extensively of late, I see no reason to use a non-ENSEMBL
"conservative" reference transcriptome (I see plenty of reasons to use
miTranscriptome, etc. but that is another discussion).

sorry if slighting anyone/everyone, but ENSEMBL is the clear choice IMHO.

$0.02 - transmission costs


--t

On Mon, Jan 11, 2016 at 10:57 AM, Vincent Carey 
wrote:

> I think these are all good observations and we may benefit from a 
> wider discussion on the support site?
>
> the abandonment of knownGene seems to have clear implications for 
> changing our most visible txdb examples.  what should we change to?  
> can we make a more future-proof design for these annotation 
> selections?
>
> On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo 
> 
> wrote:
>
> > hi,
> >
> > On 01/11/2016 04:07 PM, Vincent Carey wrote:
> > [...]
> >
> >> Is it true that there is an asymmetry between Entrez gene ID and 
> >> Ensembl gene ID for querying org.Hs.eg.db (I tend to prefer 
> >> Homo.sapiens as a symbol mapping resource)?  Both ENTREZID and 
> >> ENSEMBL are listed as keytypes.  My question is whether this 
> >> "anchor" concept holds in the current infrastructure.
> >>
> >
> > you're right that the infrastructure is probably symmetric at least 
> > between Entrez and Ensembl, so maybe i'm not using the term "anchor"
> > correctly here, i'm just referring to the fact that many package
> functions
> > and use cases of BioC are based in, or illustrated, using Entrez IDs.
> > examples are:
> >
> > head(org.Hs.eg.db::keys(org.Hs.eg.db))
> > [1] "1"  "2"  "3"  "9"  "10" "11"
> >
> > i.e., by default the 'keytype' is 'ENTREZID'
> >
> > genefilter::nsFilter() argument 'require.entrez' filters out 
> > features without an Entrez Gene ID annotation.
> >
> > Category::categoryToEntrezBuilder() returns a list mapping category 
> > ids
> to
> > the Entrez Gene ids annotated at the cateogry id.
> >
> > SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a 
> > keytype to map ranges to genes. By default the keytype is 'ENTREZID'
> >
> > some of the workflows are also based on Entrez IDs, such as:
> >
> >
> http://www.bioconductor.org/help/workflows/annotation/Annotation_Resou
> rces
> >
> > http://www.bioconductor.org/help/workflows/variants
> >
> > so if the user just replaces the txdb object in one of those 
> > examples or argument functions by a txdb object that does not have 
> > Entrez identifiers as primary gene key, those functions, examples or 
> > workflows will require modification. this is not necessarily bad, 
> > but may put more burden on the user who is learning with a "default"
TxDb human gene annotation package.
> > this has been so far the *.UCSC.knownGene using Entrez as gene
> identifiers.
> > given the apparent discontinuity of UCSC with the known gene track, 
> > i
> would
> > suggest to put available at the BioC site another default gene 
> > annotation package, but then one based on Entrez identifiers given 
> > the amount of legacy code and documentation using Entrez in one way or
another.
> >
> > an alternative to translating the default Ensembl Gencode 
> > identifiers
> into
> > Entrez would be to just take the NCBI RefSeq annotations as human 
> > gene annotation package available by default, i.e., replacing 
> > current *.UCSC.knownGene by *.UCSC.refGene
> >
> >
> >
> > robert.
> >
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

2016-01-11 Thread Tim Triche, Jr.

ENSEMBL

knownGene was always a disaster.  For extra amusement/horror, be sure to
check out the sad saga of the TCGA GAF and its disconnection from
knownGenes as well as reality.  Three cheers for rendering transcript-level
estimates useless (and no this was not Katie's fault)

Rainer and many others have made a herculean effort to bring all the BioC
annotation infrastructure into the 21st century... having worked with
Kallisto extensively of late, I see no reason to use a non-ENSEMBL
"conservative" reference transcriptome (I see plenty of reasons to use
miTranscriptome, etc. but that is another discussion).

sorry if slighting anyone/everyone, but ENSEMBL is the clear choice IMHO.

$0.02 - transmission costs


--t

On Mon, Jan 11, 2016 at 10:57 AM, Vincent Carey 
wrote:

> I think these are all good observations and we may benefit from a wider
> discussion on the support site?
>
> the abandonment of knownGene seems to have clear implications for changing
> our most visible txdb
> examples.  what should we change to?  can we make a more future-proof
> design for these annotation selections?
>
> On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo 
> wrote:
>
> > hi,
> >
> > On 01/11/2016 04:07 PM, Vincent Carey wrote:
> > [...]
> >
> >> Is it true that there is an asymmetry between Entrez gene ID and Ensembl
> >> gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
> >> as a symbol mapping resource)?  Both ENTREZID and ENSEMBL are listed as
> >> keytypes.  My question is whether this "anchor" concept
> >> holds in the current infrastructure.
> >>
> >
> > you're right that the infrastructure is probably symmetric at least
> > between Entrez and Ensembl, so maybe i'm not using the term "anchor"
> > correctly here, i'm just referring to the fact that many package
> functions
> > and use cases of BioC are based in, or illustrated, using Entrez IDs.
> > examples are:
> >
> > head(org.Hs.eg.db::keys(org.Hs.eg.db))
> > [1] "1"  "2"  "3"  "9"  "10" "11"
> >
> > i.e., by default the 'keytype' is 'ENTREZID'
> >
> > genefilter::nsFilter() argument 'require.entrez' filters out features
> > without an Entrez Gene ID annotation.
> >
> > Category::categoryToEntrezBuilder() returns a list mapping category ids
> to
> > the Entrez Gene ids annotated at the cateogry id.
> >
> > SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a
> > keytype to map ranges to genes. By default the keytype is 'ENTREZID'
> >
> > some of the workflows are also based on Entrez IDs, such as:
> >
> >
> http://www.bioconductor.org/help/workflows/annotation/Annotation_Resources
> >
> > http://www.bioconductor.org/help/workflows/variants
> >
> > so if the user just replaces the txdb object in one of those examples or
> > argument functions by a txdb object that does not have Entrez identifiers
> > as primary gene key, those functions, examples or workflows will require
> > modification. this is not necessarily bad, but may put more burden on the
> > user who is learning with a "default" TxDb human gene annotation package.
> > this has been so far the *.UCSC.knownGene using Entrez as gene
> identifiers.
> > given the apparent discontinuity of UCSC with the known gene track, i
> would
> > suggest to put available at the BioC site another default gene annotation
> > package, but then one based on Entrez identifiers given the amount of
> > legacy code and documentation using Entrez in one way or another.
> >
> > an alternative to translating the default Ensembl Gencode identifiers
> into
> > Entrez would be to just take the NCBI RefSeq annotations as human gene
> > annotation package available by default, i.e., replacing current
> > *.UCSC.knownGene by *.UCSC.refGene
> >
> >
> >
> > robert.
> >
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

2016-01-11 Thread Vincent Carey

I think these are all good observations and we may benefit from a wider
discussion on the support site?

the abandonment of knownGene seems to have clear implications for changing
our most visible txdb
examples.  what should we change to?  can we make a more future-proof
design for these annotation selections?

On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo 
wrote:

> hi,
>
> On 01/11/2016 04:07 PM, Vincent Carey wrote:
> [...]
>
>> Is it true that there is an asymmetry between Entrez gene ID and Ensembl
>> gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
>> as a symbol mapping resource)?  Both ENTREZID and ENSEMBL are listed as
>> keytypes.  My question is whether this "anchor" concept
>> holds in the current infrastructure.
>>
>
> you're right that the infrastructure is probably symmetric at least
> between Entrez and Ensembl, so maybe i'm not using the term "anchor"
> correctly here, i'm just referring to the fact that many package functions
> and use cases of BioC are based in, or illustrated, using Entrez IDs.
> examples are:
>
> head(org.Hs.eg.db::keys(org.Hs.eg.db))
> [1] "1"  "2"  "3"  "9"  "10" "11"
>
> i.e., by default the 'keytype' is 'ENTREZID'
>
> genefilter::nsFilter() argument 'require.entrez' filters out features
> without an Entrez Gene ID annotation.
>
> Category::categoryToEntrezBuilder() returns a list mapping category ids to
> the Entrez Gene ids annotated at the cateogry id.
>
> SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a
> keytype to map ranges to genes. By default the keytype is 'ENTREZID'
>
> some of the workflows are also based on Entrez IDs, such as:
>
> http://www.bioconductor.org/help/workflows/annotation/Annotation_Resources
>
> http://www.bioconductor.org/help/workflows/variants
>
> so if the user just replaces the txdb object in one of those examples or
> argument functions by a txdb object that does not have Entrez identifiers
> as primary gene key, those functions, examples or workflows will require
> modification. this is not necessarily bad, but may put more burden on the
> user who is learning with a "default" TxDb human gene annotation package.
> this has been so far the *.UCSC.knownGene using Entrez as gene identifiers.
> given the apparent discontinuity of UCSC with the known gene track, i would
> suggest to put available at the BioC site another default gene annotation
> package, but then one based on Entrez identifiers given the amount of
> legacy code and documentation using Entrez in one way or another.
>
> an alternative to translating the default Ensembl Gencode identifiers into
> Entrez would be to just take the NCBI RefSeq annotations as human gene
> annotation package available by default, i.e., replacing current
> *.UCSC.knownGene by *.UCSC.refGene
>
>
>
> robert.
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

2016-01-11 Thread Robert Castelo


hi,

On 01/11/2016 04:07 PM, Vincent Carey wrote:
[...]

Is it true that there is an asymmetry between Entrez gene ID and Ensembl
gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
as a symbol mapping resource)?  Both ENTREZID and ENSEMBL are listed as
keytypes.  My question is whether this "anchor" concept
holds in the current infrastructure.


you're right that the infrastructure is probably symmetric at least 
between Entrez and Ensembl, so maybe i'm not using the term "anchor" 
correctly here, i'm just referring to the fact that many package 
functions and use cases of BioC are based in, or illustrated, using 
Entrez IDs. examples are:


head(org.Hs.eg.db::keys(org.Hs.eg.db))
[1] "1"  "2"  "3"  "9"  "10" "11"

i.e., by default the 'keytype' is 'ENTREZID'

genefilter::nsFilter() argument 'require.entrez' filters out features 
without an Entrez Gene ID annotation.


Category::categoryToEntrezBuilder() returns a list mapping category ids 
to the Entrez Gene ids annotated at the cateogry id.


SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a 
keytype to map ranges to genes. By default the keytype is 'ENTREZID'


some of the workflows are also based on Entrez IDs, such as:

http://www.bioconductor.org/help/workflows/annotation/Annotation_Resources

http://www.bioconductor.org/help/workflows/variants

so if the user just replaces the txdb object in one of those examples or 
argument functions by a txdb object that does not have Entrez 
identifiers as primary gene key, those functions, examples or workflows 
will require modification. this is not necessarily bad, but may put more 
burden on the user who is learning with a "default" TxDb human gene 
annotation package. this has been so far the *.UCSC.knownGene using 
Entrez as gene identifiers. given the apparent discontinuity of UCSC 
with the known gene track, i would suggest to put available at the BioC 
site another default gene annotation package, but then one based on 
Entrez identifiers given the amount of legacy code and documentation 
using Entrez in one way or another.


an alternative to translating the default Ensembl Gencode identifiers 
into Entrez would be to just take the NCBI RefSeq annotations as human 
gene annotation package available by default, i.e., replacing current 
*.UCSC.knownGene by *.UCSC.refGene




robert.

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

2016-01-11 Thread Vincent Carey

On Mon, Jan 11, 2016 at 9:29 AM, Robert Castelo 
wrote:

> hi,
>
> if i'm interpreting this correctly, the news archive of the UCSC Genome
> Browser accessible here:
>
>  http://genome.ucsc.edu/goldenPath/newsarch.html
>
> announced on June 29th, 2015, that they are discontinuing the generation
> of UCSC Known Genes annotations for human, and provide the Gencode
> annotations as default replacement.
>
> the BioC site provides as default gene annotations for human the UCSC
> Known Genes track and currently does not provide the Gencode annotations.
>
> the GenomicFeatures package allows one to build such an annotation
> package. unfortunately the current "supported" UCSC tables that can be
> easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version
> V17:
>
> library(GenomicFeatures)
>
> xx <- supportedUCSCtables()
> xx[grep("GENCODE Genes", xx$track), ]
>  track subtrack
> wgEncodeGencodeBasicV17  GENCODE Genes V17 
> wgEncodeGencodeCompV17   GENCODE Genes V17 
> wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 
> wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 
> wgEncodeGencodePolyaV17  GENCODE Genes V17 
> wgEncodeGencodeBasicV14  GENCODE Genes V14 
> wgEncodeGencodeCompV14   GENCODE Genes V14 
> wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 
> wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 
> wgEncodeGencodePolyaV14  GENCODE Genes V14 
> wgEncodeGencodeBasicV7GENCODE Genes V7 
> wgEncodeGencodeCompV7 GENCODE Genes V7 
> wgEncodeGencodePseudoGeneV7   GENCODE Genes V7 
> wgEncodeGencode2wayConsPseudoV7   GENCODE Genes V7 
> wgEncodeGencodePolyaV7GENCODE Genes V7 
>
> which is about 2 years old. current Gencode gene annotations are V24 and
> at least V22 was available at:
>
> http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database
>
> before the last BioC release.
>
> according to a recent announcement at the BioC support site:
>
> https://support.bioconductor.org/p/71574
>
> AnnotationHub seems to be now the proper way to import the most recent
> Gencode annotations into BioC. however, at least in my hands, making the
> corresponding TxDb object produces an error; see the following example:
>
> library(AnnotationHub)
>
> ah <- AnnotationHub()
> human_gff <- query(ah, c("Gencode", "gff", "human"))
>
> gencodeV23basicGFF <- ah[["AH49556"]]
> metadata <- data.frame(name=c("Data source", "Genome", "Organism",
>   "Resource URL", "Full dataset"),
>value=c(ah["AH49556"]$dataprovider,
> ah["AH49556"]$genome,
>ah["AH49556"]$species,
> ah["AH49556"]$sourceurl, "no"))
> txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
> Error in .merge_transcript_parts(transcripts) :
>   The following transcripts have multiple parts that cannot be merged
> because of incompatible seqnames: ENST0244174.9,
>

should this be an error, or would a softer landing be more useful here?
 warn and exclude the offensive elements, perhaps with an option
to retrieve them through some special step (option or new function)?


>   ENST0262640.10, ENST0286448.10, ENST0302805.6,
> ENST0313871.7, ENST0326153.8, ENST0331035.8,
>   ENST0334060.7, ENST0334651.9, ENST0355432.7,
> ENST0355805.6, ENST0359512.7, ENST0369423.6,
>   ENST0381180.7, ENST0381184.5, ENST0381187.7,
> ENST0381192.7, ENST0381218.7, ENST0381222.6,
>   ENST0381223.8, ENST0381229.8, ENST0381233.7,
> ENST0381241.7, ENST0381261.7, ENST0381297.8,
>   ENST0381317.7, ENST0381333.8, ENST0381401.9,
> ENST0381469.6, ENST0381500.5, ENST0381509.7,
>   ENST0381524.7, ENST0381529.7, ENST0381566.5,
> ENST0381567.7, ENST0381575.5, ENST0381578.5,
>   ENST0381657.6, ENST0381663.7, ENST0390665.7,
> ENST0391707.6, ENST0399012.5, ENST0399966.8,
>   ENST0400841.6, ENST0411342.5, ENST0412936
>
>
> on top of this, even if it would work, these annotations are anchored at
> Ensembl Gene identifiers while the gene-centric annotations at org.Hs.eg.db


Is it true that there is an asymmetry between Entrez gene ID and Ensembl
gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
as a symbol mapping resource)?  Both ENTREZID and ENSEMBL are listed as
keytypes.  My question is whether this "anchor" concept
holds in the current infrastructure.

are anchored at Entrez Gene identifiers. this means that more code would
> have to be involved to add the corresponding Entrez IDs (resolving
> multiplicities, etc.) and produce a TxDb package that can be used across
> many of the typical BioC pipelines.
>
> since human gene annotations are at the core of many BioC pipelines, i'd
> like to suggest for the forthcoming

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

10 matches

Site Navigation

Mail list logo

Footer information