Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-11 Thread Rainer Johannes
update: this approach bases on Martin's suggestion and I think is ideal for 
EnsDb or any Ensembl based annotation packages. It's not required to install an 
explicit genome package, but load (and eventually cache) dynamically the 
correct genomic sequence from Ensembl via AnnotationHub. That way chromosome 
names, variants etc are perfectly aligned. Also, this guarantees that the 
genome version and gene definitions, both from Ensembl match.

Example code (is now also included in the ensembldb vignette):

  library(AnnotationHub)
  library(EnsDb.Hsapiens.v75)
  library(Rsamtools)
  ah - AnnotationHub()

  edb - EnsDb.Hsapiens.v75

  ## get the Ensembl version
  eVersion - metadata(edb)[metadata(edb)$name==ensembl_version, value]
  ## query all available files for the Ensembl version
  eData - query(ah, c(organism(edb), paste0(release-, eVersion)))
  eData

  ## retrieve the *dna.toplevel.fa file; this might take some time.
  Dna - ah[[AH20439]]
  ## generate an index if none is available
  if(is.na(index(Dna))){
  indexFa(Dna)
  Dna - FaFile(path(Dna))
  }

  ## get start/end coordinates of all genes
  genes - genes(edb)
  ## subset to all genes that are encoded on chromosomes for which
  ## we do have DNA sequence available.
  genes - genes[seqnames(genes) %in% seqnames(seqinfo(Dna))]
  ## get the gene sequences, i.e. the sequence including the sequence of
  ## all of the gene's exons and introns
  geneSeqs - getSeq(Dna, genes)

cheers, jo

On 09 Jun 2015, at 11:38, Rainer Johannes 
johannes.rai...@eurac.edumailto:johannes.rai...@eurac.edu wrote:

dear Ludwig,

On 09 Jun 2015, at 10:46, Ludwig Geistlinger 
ludwig.geistlin...@bio.ifi.lmu.demailto:ludwig.geistlin...@bio.ifi.lmu.de 
wrote:

Dear Johannes,

Thx for providing the great EnsDb packages!

One question:

As of now, I am able to choose between TxDb and EnsDb for genomic
coordinates of genomic features such as genes, transcripts, and exons.
For the sequences themselves I need the corresponding BSgenome package.

While it is easy to automatically map from a specific TxDb package (eg
TxDb.Hsapiens.UCSC.hg38.knownGene) to the corresponding BSgenome package
(here: BSgenome.Hsapiens.UCSC.hg38), I wonder how to do that for an
EnsDb package as the package name (eg EnsDb.Hsapiens.v79) contains no
information about the genome build.

A cumbersome option would be to extract the genome_build from the metadata
of the EnsDb package (which would give me for EnsDb.Hsapiens.v79:
'GRCh38') and then ask all existing BSgenome.Hsapiens packages for their
metadata release name (eg 'GRCh38' for BSgenome.Hsapiens.UCSC.hg38).

This however needs all BSgenome.Hsapiens packages installed and takes thus
too much time and space for a programmatic access.

Can you suggest a better way to map from coordinates to sequence (within
the BioC annotation functionality)?


agree, there's no easy mapping (yet). I'll implement a method 
suggestGenomePackage in the ensembldb package. In the long run I hope that 
also NCBI BSgenome packages (like the BSgenome.Hsapiens.NCBI.GRCh38) will 
become available for all species... that would make the mapping much easier...

cheers, jo

Thanks  Best,
Ludwig





dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute entrezid that is stored in the
internal database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use pure Ensembl based annotations: use
EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the
ensembldb package provides functionality to generate such packages either
using the Ensembl Perl API or using GTF files provided by Ensembl. If you
have problems building the packages, just drop me a line and I'll do
that.

cheers, jo

On 03 Jun 2015, at 15:56, Robert M. Flight 
rfligh...@gmail.commailto:rfligh...@gmail.com wrote:

Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what
is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885

If instead of genes you do transcripts, you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
ludwig.geistlin...@bio.ifi.lmu.demailto:ludwig.geistlin...@bio.ifi.lmu.de 
wrote:

Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for 

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-10 Thread Ludwig Geistlinger
Dear Johannes,

one follow-up question/comment on the EnsDb packages:

The reason they escaped my notice (and thus potentially will also others)
is that I expected such packages to be named ^TxDb

What actually argues against sticking to existing Bioc vocabulary and
naming eg EnsDb.Hsapiens.v79

TxDb.Hsapiens.Ensembl.hg38.ensGene

(or alternatively, if packages like BSgenome.Hsapiens.NCBI.GRCh38 will
indeed make it in the long run:  TxDb.Hsapiens.Ensembl.GRCh38.ensGene)

This would also have the advantage that genome build and idType could be
inferred right from the package name.

Best,
Ludwig


 dear Robert and Ludwig,

 the EnsDb packages provide all the gene/transcript etc annotations for all
 genes defined in the Ensembl database (for a given species and Ensembl
 release). Except the column/attribute entrezid that is stored in the
 internal database there is however no link to NCBI or UCSC annotations.
 So, basically, if you want to use pure Ensembl based annotations: use
 EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

 In case you need EnsDbs of other species or Ensembl versions, the
 ensembldb package provides functionality to generate such packages either
 using the Ensembl Perl API or using GTF files provided by Ensembl. If you
 have problems building the packages, just drop me a line and I'll do
 that.

 cheers, jo

 On 03 Jun 2015, at 15:56, Robert M. Flight rfligh...@gmail.com wrote:

 Ludwig,

 If you do this search on the UCSC genome browser (which this annotation
 package is built from), you will see that the longest variant is what
 is
 shown

 http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885

 If instead of genes you do transcripts, you will see 20 different
 transcripts for this gene, including the one listed by NCBI.

 I havent tried it yet (haven't upgraded R or bioconductor to latest
 version), but there is now an Ensembl based annotation package as well,
 that may work better??
 http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

 -Robert



 On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
 ludwig.geistlin...@bio.ifi.lmu.de wrote:

 Dear Bioc annotation team,

 Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
 for

 BRCA1; ENSG0012048; entrez:672

 via

 genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))

 gives me:

 GRanges object with 1 range and 1 metadata column:
  seqnames   ranges strand | gene_id
 RleIRanges  Rle | character
  672chr17 [43044295, 43170403]  - | 672
  ---
  seqinfo: 455 sequences (1 circular) from hg38 genome


 However, querying Ensembl and NCBI Gene
 http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048
 http://www.ncbi.nlm.nih.gov/gene/672

 the gene is located at (note the difference in the end position)

 Chromosome 17: 43,044,295-43,125,483 reverse strand


 How is the inconsistency explained and how to extract an ENSEMBL/NCBI
 conform annotation from the TxDb object?
 (I am aware of biomaRt, but I want to explicitely use the Bioc
 annotation
 functionality).

 Thanks!
 Ludwig


 --
 Dipl.-Bioinf. Ludwig Geistlinger

 Lehr- und Forschungseinheit für Bioinformatik
 Institut für Informatik
 Ludwig-Maximilians-Universität München
 Amalienstrasse 17, 2. Stock, Büro A201
 80333 München

 Tel.: 089-2180-4067
 eMail: ludwig.geistlin...@bio.ifi.lmu.de

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


  [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-10 Thread Rainer Johannes
update:

seems something is strange with the index... when I load the ensembl-77 DNA for 
mouse I get two files, the fasta file and the index and I can extract the 
sequences, while in the example below I just got the fasta file

On 10 Jun 2015, at 10:11, Rainer Johannes 
johannes.rai...@eurac.edumailto:johannes.rai...@eurac.edu wrote:

Dear Martin,

the AnnotationHub approach looks awesome! However, somehow it does not work for 
me, I always get an error:

library(AnnotationHub)
library(Rsamtools)
library(GenomicFeatures)
ah - AnnotationHub()

## somehow I don't see DNA sequences for release-80... thus using 75
query(ah, c(Takifugu, release-75))

## load the gtf and the dna.toplevel.fa
gtf - ah[[AH10717]]
dna - ah[[AH20637]]

## create txdb and get exons:
txdb - makeTxDbFromGRanges(gtf)
exons - exons(txdb)

## getting the sequences
getSeq(dna, exons)

and I get the following error:
Error in value[[3L]](cond) : 'open' index failed
  file: /Users/jo/~/.AnnotationHub/24732
In addition: Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
  [fai_load] fail to open FASTA index.

## creating the index
indexFa(dna)

## trying again:
getSeq(dna, exons)
Error in value[[3L]](cond) :  record 1 (MT:1-68) was truncated
  file: /Users/jo/~/.AnnotationHub/24732

This one is from the current R-devel, but I get the same error with the stable 
version... any idea?

cheers, jo


my session info:
 sessionInfo()
R Under development (unstable) (2015-06-06 r68485)
Platform: x86_64-apple-darwin14.4.0/x86_64 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
 [1] GenomicFeatures_1.20.1 AnnotationDbi_1.30.1   Biobase_2.28.0
 [4] Rsamtools_1.20.4   Biostrings_2.36.1  XVector_0.8.0
 [7] GenomicRanges_1.20.4   GenomeInfoDb_1.4.0 IRanges_2.2.3
[10] S4Vectors_0.6.0BiocGenerics_0.14.0AnnotationHub_2.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6  magrittr_1.5
 [3] GenomicAlignments_1.4.1  zlibbioc_1.14.0
 [5] BiocParallel_1.2.2   xtable_1.7-4
 [7] R6_2.0.1 stringr_1.0.0
 [9] httr_0.6.1   tools_3.3.0
[11] DBI_0.3.1lambda.r_1.1.7
[13] futile.logger_1.4.1  htmltools_0.2.6
[15] digest_0.6.8 interactiveDisplayBase_1.6.0
[17] shiny_0.12.0 rtracklayer_1.28.4
[19] futile.options_1.0.0 bitops_1.0-6
[21] biomaRt_2.24.0   RCurl_1.95-4.6
[23] RSQLite_1.0.0mime_0.3
[25] stringi_0.4-1BiocInstaller_1.18.2
[27] XML_3.98-1.2 httpuv_1.3.2


On 09 Jun 2015, at 16:25, Martin Morgan 
mtmor...@fredhutch.orgmailto:mtmor...@fredhutch.org wrote:

On 06/08/2015 11:43 PM, Rainer Johannes wrote:
dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute entrezid that is stored in the
internal database there is however no link to NCBI or UCSC annotations. So,
basically, if you want to use pure Ensembl based annotations: use EnsDb, if
you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the ensembldb
package provides functionality to generate such packages either using the
Ensembl Perl API or using GTF files provided by Ensembl. If you have problems
building the packages, just drop me a line and I'll do that.

Two other sources of Ensembl TxDb's are GenomicFeatures::makeTxDbFromBiomart() 
and AnnotationHub. For the latter, I'll add a variant of the following to the 
AnnotationHub HOWTO vignette 
http://bioconductor.org/packages/devel/bioc/html/AnnotationHub.html later today.

## Gene models

_Bioconductor_ represents gene models using 'transcript'
databases. These are available via packages such as
[TxDb.Hsapiens.UCSC.hg38.knownGene](http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.knownGene.html),
or can be constructed using functions such as
`[GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures.html)::makeTxDbFromBiomart()`
 or `GenomicFeatures::makeTxDbFromGRanges()`.

_AnnotationHub_ provides an easy way to work with gene models
published by Ensembl. Here we discover the Ensemble release 80 r
esources for pufferfish,_Takifugu rubripes_

```{r takifugu-gene-models}
query(ah, c(Takifugu, release-80))
```

We see that there is a GTF file, as well as various DNA
sequences. Let's retrieve the GTF and top-level sequence files. The
GTF file is imported as a _GRanges_ instance, the DNA sequence as a
compressed, indexed Fasta file


```{r takifugi-data}
gtf - ah[[AH47101]]
dna - ah[[AH47477]]

head(gtf, 3)
dna
head(seqlevels(dna))
```

It is trivial to make a TxDb instance


Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-09 Thread Rainer Johannes
dear Ludwig,

On 09 Jun 2015, at 10:46, Ludwig Geistlinger 
ludwig.geistlin...@bio.ifi.lmu.demailto:ludwig.geistlin...@bio.ifi.lmu.de 
wrote:

Dear Johannes,

Thx for providing the great EnsDb packages!

One question:

As of now, I am able to choose between TxDb and EnsDb for genomic
coordinates of genomic features such as genes, transcripts, and exons.
For the sequences themselves I need the corresponding BSgenome package.

While it is easy to automatically map from a specific TxDb package (eg
TxDb.Hsapiens.UCSC.hg38.knownGene) to the corresponding BSgenome package
(here: BSgenome.Hsapiens.UCSC.hg38), I wonder how to do that for an
EnsDb package as the package name (eg EnsDb.Hsapiens.v79) contains no
information about the genome build.

A cumbersome option would be to extract the genome_build from the metadata
of the EnsDb package (which would give me for EnsDb.Hsapiens.v79:
'GRCh38') and then ask all existing BSgenome.Hsapiens packages for their
metadata release name (eg 'GRCh38' for BSgenome.Hsapiens.UCSC.hg38).

This however needs all BSgenome.Hsapiens packages installed and takes thus
too much time and space for a programmatic access.

Can you suggest a better way to map from coordinates to sequence (within
the BioC annotation functionality)?


agree, there's no easy mapping (yet). I'll implement a method 
suggestGenomePackage in the ensembldb package. In the long run I hope that 
also NCBI BSgenome packages (like the BSgenome.Hsapiens.NCBI.GRCh38) will 
become available for all species... that would make the mapping much easier...

cheers, jo

Thanks  Best,
Ludwig





dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute entrezid that is stored in the
internal database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use pure Ensembl based annotations: use
EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the
ensembldb package provides functionality to generate such packages either
using the Ensembl Perl API or using GTF files provided by Ensembl. If you
have problems building the packages, just drop me a line and I'll do
that.

cheers, jo

On 03 Jun 2015, at 15:56, Robert M. Flight 
rfligh...@gmail.commailto:rfligh...@gmail.com wrote:

Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what
is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885

If instead of genes you do transcripts, you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
ludwig.geistlin...@bio.ifi.lmu.de wrote:

Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
for

BRCA1; ENSG0012048; entrez:672

via

genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))

gives me:

GRanges object with 1 range and 1 metadata column:
seqnames   ranges strand | gene_id
   RleIRanges  Rle | character
672chr17 [43044295, 43170403]  - | 672
---
seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048
http://www.ncbi.nlm.nih.gov/gene/672

the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object?
(I am aware of biomaRt, but I want to explicitely use the Bioc
annotation
functionality).

Thanks!
Ludwig


--
Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit für Bioinformatik
Institut für Informatik
Ludwig-Maximilians-Universität München
Amalienstrasse 17, 2. Stock, Büro A201
80333 München

Tel.: 089-2180-4067
eMail: ludwig.geistlin...@bio.ifi.lmu.de

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-09 Thread Ludwig Geistlinger
Dear Johannes,

Thx for providing the great EnsDb packages!

One question:

As of now, I am able to choose between TxDb and EnsDb for genomic
coordinates of genomic features such as genes, transcripts, and exons.
For the sequences themselves I need the corresponding BSgenome package.

While it is easy to automatically map from a specific TxDb package (eg
TxDb.Hsapiens.UCSC.hg38.knownGene) to the corresponding BSgenome package
(here: BSgenome.Hsapiens.UCSC.hg38), I wonder how to do that for an
EnsDb package as the package name (eg EnsDb.Hsapiens.v79) contains no
information about the genome build.

A cumbersome option would be to extract the genome_build from the metadata
of the EnsDb package (which would give me for EnsDb.Hsapiens.v79:
'GRCh38') and then ask all existing BSgenome.Hsapiens packages for their
metadata release name (eg 'GRCh38' for BSgenome.Hsapiens.UCSC.hg38).

This however needs all BSgenome.Hsapiens packages installed and takes thus
too much time and space for a programmatic access.

Can you suggest a better way to map from coordinates to sequence (within
the BioC annotation functionality)?

Thanks  Best,
Ludwig





 dear Robert and Ludwig,

 the EnsDb packages provide all the gene/transcript etc annotations for all
 genes defined in the Ensembl database (for a given species and Ensembl
 release). Except the column/attribute entrezid that is stored in the
 internal database there is however no link to NCBI or UCSC annotations.
 So, basically, if you want to use pure Ensembl based annotations: use
 EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

 In case you need EnsDbs of other species or Ensembl versions, the
 ensembldb package provides functionality to generate such packages either
 using the Ensembl Perl API or using GTF files provided by Ensembl. If you
 have problems building the packages, just drop me a line and I'll do
 that.

 cheers, jo

 On 03 Jun 2015, at 15:56, Robert M. Flight rfligh...@gmail.com wrote:

 Ludwig,

 If you do this search on the UCSC genome browser (which this annotation
 package is built from), you will see that the longest variant is what
 is
 shown

 http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885

 If instead of genes you do transcripts, you will see 20 different
 transcripts for this gene, including the one listed by NCBI.

 I havent tried it yet (haven't upgraded R or bioconductor to latest
 version), but there is now an Ensembl based annotation package as well,
 that may work better??
 http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

 -Robert



 On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
 ludwig.geistlin...@bio.ifi.lmu.de wrote:

 Dear Bioc annotation team,

 Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
 for

 BRCA1; ENSG0012048; entrez:672

 via

 genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))

 gives me:

 GRanges object with 1 range and 1 metadata column:
  seqnames   ranges strand | gene_id
 RleIRanges  Rle | character
  672chr17 [43044295, 43170403]  - | 672
  ---
  seqinfo: 455 sequences (1 circular) from hg38 genome


 However, querying Ensembl and NCBI Gene
 http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048
 http://www.ncbi.nlm.nih.gov/gene/672

 the gene is located at (note the difference in the end position)

 Chromosome 17: 43,044,295-43,125,483 reverse strand


 How is the inconsistency explained and how to extract an ENSEMBL/NCBI
 conform annotation from the TxDb object?
 (I am aware of biomaRt, but I want to explicitely use the Bioc
 annotation
 functionality).

 Thanks!
 Ludwig


 --
 Dipl.-Bioinf. Ludwig Geistlinger

 Lehr- und Forschungseinheit für Bioinformatik
 Institut für Informatik
 Ludwig-Maximilians-Universität München
 Amalienstrasse 17, 2. Stock, Büro A201
 80333 München

 Tel.: 089-2180-4067
 eMail: ludwig.geistlin...@bio.ifi.lmu.de

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


  [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-09 Thread Rainer Johannes
dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all 
genes defined in the Ensembl database (for a given species and Ensembl 
release). Except the column/attribute entrezid that is stored in the internal 
database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use pure Ensembl based annotations: use EnsDb, 
if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the ensembldb 
package provides functionality to generate such packages either using the 
Ensembl Perl API or using GTF files provided by Ensembl. If you have problems 
building the packages, just drop me a line and I'll do that.

cheers, jo

 On 03 Jun 2015, at 15:56, Robert M. Flight rfligh...@gmail.com wrote:
 
 Ludwig,
 
 If you do this search on the UCSC genome browser (which this annotation
 package is built from), you will see that the longest variant is what is
 shown
 
 http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885
 
 If instead of genes you do transcripts, you will see 20 different
 transcripts for this gene, including the one listed by NCBI.
 
 I havent tried it yet (haven't upgraded R or bioconductor to latest
 version), but there is now an Ensembl based annotation package as well,
 that may work better??
 http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html
 
 -Robert
 
 
 
 On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
 ludwig.geistlin...@bio.ifi.lmu.de wrote:
 
 Dear Bioc annotation team,
 
 Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g. for
 
 BRCA1; ENSG0012048; entrez:672
 
 via
 
 genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))
 
 gives me:
 
 GRanges object with 1 range and 1 metadata column:
  seqnames   ranges strand | gene_id
 RleIRanges  Rle | character
  672chr17 [43044295, 43170403]  - | 672
  ---
  seqinfo: 455 sequences (1 circular) from hg38 genome
 
 
 However, querying Ensembl and NCBI Gene
 http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048
 http://www.ncbi.nlm.nih.gov/gene/672
 
 the gene is located at (note the difference in the end position)
 
 Chromosome 17: 43,044,295-43,125,483 reverse strand
 
 
 How is the inconsistency explained and how to extract an ENSEMBL/NCBI
 conform annotation from the TxDb object?
 (I am aware of biomaRt, but I want to explicitely use the Bioc annotation
 functionality).
 
 Thanks!
 Ludwig
 
 
 --
 Dipl.-Bioinf. Ludwig Geistlinger
 
 Lehr- und Forschungseinheit für Bioinformatik
 Institut für Informatik
 Ludwig-Maximilians-Universität München
 Amalienstrasse 17, 2. Stock, Büro A201
 80333 München
 
 Tel.: 089-2180-4067
 eMail: ludwig.geistlin...@bio.ifi.lmu.de
 
 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel
 
 
   [[alternative HTML version deleted]]
 
 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-09 Thread Martin Morgan

On 06/08/2015 11:43 PM, Rainer Johannes wrote:

dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute entrezid that is stored in the
internal database there is however no link to NCBI or UCSC annotations. So,
basically, if you want to use pure Ensembl based annotations: use EnsDb, if
you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the ensembldb
package provides functionality to generate such packages either using the
Ensembl Perl API or using GTF files provided by Ensembl. If you have problems
building the packages, just drop me a line and I'll do that.


Two other sources of Ensembl TxDb's are GenomicFeatures::makeTxDbFromBiomart() 
and AnnotationHub. For the latter, I'll add a variant of the following to the 
AnnotationHub HOWTO vignette 
http://bioconductor.org/packages/devel/bioc/html/AnnotationHub.html later today.


## Gene models

_Bioconductor_ represents gene models using 'transcript'
databases. These are available via packages such as
[TxDb.Hsapiens.UCSC.hg38.knownGene](http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.knownGene.html),
or can be constructed using functions such as
`[GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures.html)::makeTxDbFromBiomart()` 
or `GenomicFeatures::makeTxDbFromGRanges()`.


_AnnotationHub_ provides an easy way to work with gene models
published by Ensembl. Here we discover the Ensemble release 80 r
esources for pufferfish,_Takifugu rubripes_

```{r takifugu-gene-models}
query(ah, c(Takifugu, release-80))
```

We see that there is a GTF file, as well as various DNA
sequences. Let's retrieve the GTF and top-level sequence files. The
GTF file is imported as a _GRanges_ instance, the DNA sequence as a
compressed, indexed Fasta file


```{r takifugi-data}
gtf - ah[[AH47101]]
dna - ah[[AH47477]]

head(gtf, 3)
dna
head(seqlevels(dna))
```

It is trivial to make a TxDb instance

```{r takifugi-txdb}
library(GenomicFeatures)
txdb - makeTxDbFromGRanges(gtf)


and to use that in conjunction with the DNA sequence, e.g., to find
exon sequences of all annotated genes.

```{r takifugi-exons}
library(Rsamtools) # for getSeq,FaFile-method
exons - exons(txdb)
getSeq(dna, exons)
```

Some difficulties arise when working with this partly assembled genome
that require more advanced GenomicRanges skills, see the
[GenomicRanges](http://bioconductor.org/packages/GenomicRanges.html)
vignettes, especially GenomicRanges HOWTOs and An Introduction to
GenomicRanges.




cheers, jo


On 03 Jun 2015, at 15:56, Robert M. Flight rfligh...@gmail.com wrote:

Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885




If instead of genes you do transcripts, you will see 20 different

transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html




-Robert




On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
ludwig.geistlin...@bio.ifi.lmu.de wrote:


Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
for

BRCA1; ENSG0012048; entrez:672

via


genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))


gives me:

GRanges object with 1 range and 1 metadata column: seqnames
ranges strand | gene_id RleIRanges  Rle |
character 672chr17 [43044295, 43170403]  - | 672
--- seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048



http://www.ncbi.nlm.nih.gov/gene/672


the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object? (I am aware of biomaRt, but I
want to explicitely use the Bioc annotation functionality).

Thanks! Ludwig


-- Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit für Bioinformatik Institut für Informatik
Ludwig-Maximilians-Universität München Amalienstrasse 17, 2. Stock, Büro
A201 80333 München

Tel.: 089-2180-4067 eMail: ludwig.geistlin...@bio.ifi.lmu.de

___ Bioc-devel@r-project.org
mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel




[Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-03 Thread Ludwig Geistlinger
Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g. for

BRCA1; ENSG0012048; entrez:672

via

 genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))

gives me:

GRanges object with 1 range and 1 metadata column:
  seqnames   ranges strand | gene_id
 RleIRanges  Rle | character
  672chr17 [43044295, 43170403]  - | 672
  ---
  seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048
http://www.ncbi.nlm.nih.gov/gene/672

the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object?
(I am aware of biomaRt, but I want to explicitely use the Bioc annotation
functionality).

Thanks!
Ludwig


-- 
Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit für Bioinformatik
Institut für Informatik
Ludwig-Maximilians-Universität München
Amalienstrasse 17, 2. Stock, Büro A201
80333 München

Tel.: 089-2180-4067
eMail: ludwig.geistlin...@bio.ifi.lmu.de

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-03 Thread Robert M. Flight
Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885

If instead of genes you do transcripts, you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
ludwig.geistlin...@bio.ifi.lmu.de wrote:

 Dear Bioc annotation team,

 Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g. for

 BRCA1; ENSG0012048; entrez:672

 via

  genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))

 gives me:

 GRanges object with 1 range and 1 metadata column:
   seqnames   ranges strand | gene_id
  RleIRanges  Rle | character
   672chr17 [43044295, 43170403]  - | 672
   ---
   seqinfo: 455 sequences (1 circular) from hg38 genome


 However, querying Ensembl and NCBI Gene
 http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048
 http://www.ncbi.nlm.nih.gov/gene/672

 the gene is located at (note the difference in the end position)

 Chromosome 17: 43,044,295-43,125,483 reverse strand


 How is the inconsistency explained and how to extract an ENSEMBL/NCBI
 conform annotation from the TxDb object?
 (I am aware of biomaRt, but I want to explicitely use the Bioc annotation
 functionality).

 Thanks!
 Ludwig


 --
 Dipl.-Bioinf. Ludwig Geistlinger

 Lehr- und Forschungseinheit für Bioinformatik
 Institut für Informatik
 Ludwig-Maximilians-Universität München
 Amalienstrasse 17, 2. Stock, Büro A201
 80333 München

 Tel.: 089-2180-4067
 eMail: ludwig.geistlin...@bio.ifi.lmu.de

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel