date:20150609

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-09 Thread Rainer Johannes

dear Ludwig,

On 09 Jun 2015, at 10:46, Ludwig Geistlinger 
ludwig.geistlin...@bio.ifi.lmu.demailto:ludwig.geistlin...@bio.ifi.lmu.de 
wrote:

Dear Johannes,

Thx for providing the great EnsDb packages!

One question:

As of now, I am able to choose between TxDb and EnsDb for genomic
coordinates of genomic features such as genes, transcripts, and exons.
For the sequences themselves I need the corresponding BSgenome package.

While it is easy to automatically map from a specific TxDb package (eg
TxDb.Hsapiens.UCSC.hg38.knownGene) to the corresponding BSgenome package
(here: BSgenome.Hsapiens.UCSC.hg38), I wonder how to do that for an
EnsDb package as the package name (eg EnsDb.Hsapiens.v79) contains no
information about the genome build.

A cumbersome option would be to extract the genome_build from the metadata
of the EnsDb package (which would give me for EnsDb.Hsapiens.v79:
'GRCh38') and then ask all existing BSgenome.Hsapiens packages for their
metadata release name (eg 'GRCh38' for BSgenome.Hsapiens.UCSC.hg38).

This however needs all BSgenome.Hsapiens packages installed and takes thus
too much time and space for a programmatic access.

Can you suggest a better way to map from coordinates to sequence (within
the BioC annotation functionality)?


agree, there's no easy mapping (yet). I'll implement a method 
suggestGenomePackage in the ensembldb package. In the long run I hope that 
also NCBI BSgenome packages (like the BSgenome.Hsapiens.NCBI.GRCh38) will 
become available for all species... that would make the mapping much easier...

cheers, jo

Thanks  Best,
Ludwig





dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute entrezid that is stored in the
internal database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use pure Ensembl based annotations: use
EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the
ensembldb package provides functionality to generate such packages either
using the Ensembl Perl API or using GTF files provided by Ensembl. If you
have problems building the packages, just drop me a line and I'll do
that.

cheers, jo

On 03 Jun 2015, at 15:56, Robert M. Flight 
rfligh...@gmail.commailto:rfligh...@gmail.com wrote:

Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what
is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885

If instead of genes you do transcripts, you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
ludwig.geistlin...@bio.ifi.lmu.de wrote:

Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
for

BRCA1; ENSG0012048; entrez:672

via

genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))

gives me:

GRanges object with 1 range and 1 metadata column:
seqnames   ranges strand | gene_id
   RleIRanges  Rle | character
672chr17 [43044295, 43170403]  - | 672
---
seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048
http://www.ncbi.nlm.nih.gov/gene/672

the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object?
(I am aware of biomaRt, but I want to explicitely use the Bioc
annotation
functionality).

Thanks!
Ludwig


--
Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit für Bioinformatik
Institut für Informatik
Ludwig-Maximilians-Universität München
Amalienstrasse 17, 2. Stock, Büro A201
80333 München

Tel.: 089-2180-4067
eMail: ludwig.geistlin...@bio.ifi.lmu.de

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list

Re: [Bioc-devel] Changes in AnnotationDbi

2015-06-09 Thread Simon Anders


Hi

My two cents:

On 04/06/15 19:50, James W. MacDonald wrote:

In other words, for me it is a common practice to do something like this:

fit - lmFit(eset, design)
fit2 - eBayes(fit)
gns - select(chippackage, featureNames(eset), c(ENTREZID,SYMBOL))
gns - gns[!duplicated(gns[,1]),]
fit2$genes - gns

I add in the step where dups are removed because I already know they are
there. But a naive user might instead do

fit2$genes - select(chippackage, featureNames(eset),
c(ENTREZID,SYMBOL))


I'm not even that happy with James' first solution, as it relies on the 
order being correct after removing the duplicates. I'd feel safer to use 
'match' to ensure that. (What if an EntrezId is not found in the 
Annotation DB? Will we have a line with NA, or is the line simply 
missing? The latter would break James' code.)


What users really want here is a way to get the preferred symbol for 
an entrezId, and for lack of this, they accept simply a random one or 
the first one (in some unspecified collation). So, we should have a 
function, maybe 'select1', to select one and only one hit for each query 
value.


  select1(x, keys, columns, keytype, requireUnique=FALSE, ... )

This would query the AnnotationDbi object 'x' as does 'select', but 
return a data frame with the columns specified in 'columns', and the 
vector that was passed as 'keys' as row names, thus guaranteeing that 
each line in the data frame corresponds to one query key. If there were 
multiple records for a key, the first one is used, unless 
'requireUnique' is set, in which case an error is issued. And if no 
record is present for a key, the data frame contains a row of NAs for 
this key.


This would be quite convenient for any kind of ID conversion issues.

  Simon

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-09 Thread Ludwig Geistlinger

Dear Johannes,

Thx for providing the great EnsDb packages!

One question:

As of now, I am able to choose between TxDb and EnsDb for genomic
coordinates of genomic features such as genes, transcripts, and exons.
For the sequences themselves I need the corresponding BSgenome package.

While it is easy to automatically map from a specific TxDb package (eg
TxDb.Hsapiens.UCSC.hg38.knownGene) to the corresponding BSgenome package
(here: BSgenome.Hsapiens.UCSC.hg38), I wonder how to do that for an
EnsDb package as the package name (eg EnsDb.Hsapiens.v79) contains no
information about the genome build.

A cumbersome option would be to extract the genome_build from the metadata
of the EnsDb package (which would give me for EnsDb.Hsapiens.v79:
'GRCh38') and then ask all existing BSgenome.Hsapiens packages for their
metadata release name (eg 'GRCh38' for BSgenome.Hsapiens.UCSC.hg38).

This however needs all BSgenome.Hsapiens packages installed and takes thus
too much time and space for a programmatic access.

Can you suggest a better way to map from coordinates to sequence (within
the BioC annotation functionality)?

Thanks  Best,
Ludwig





 dear Robert and Ludwig,

 the EnsDb packages provide all the gene/transcript etc annotations for all
 genes defined in the Ensembl database (for a given species and Ensembl
 release). Except the column/attribute entrezid that is stored in the
 internal database there is however no link to NCBI or UCSC annotations.
 So, basically, if you want to use pure Ensembl based annotations: use
 EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

 In case you need EnsDbs of other species or Ensembl versions, the
 ensembldb package provides functionality to generate such packages either
 using the Ensembl Perl API or using GTF files provided by Ensembl. If you
 have problems building the packages, just drop me a line and I'll do
 that.

 cheers, jo

 On 03 Jun 2015, at 15:56, Robert M. Flight rfligh...@gmail.com wrote:

 Ludwig,

 If you do this search on the UCSC genome browser (which this annotation
 package is built from), you will see that the longest variant is what
 is
 shown

 http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885

 If instead of genes you do transcripts, you will see 20 different
 transcripts for this gene, including the one listed by NCBI.

 I havent tried it yet (haven't upgraded R or bioconductor to latest
 version), but there is now an Ensembl based annotation package as well,
 that may work better??
 http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

 -Robert



 On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
 ludwig.geistlin...@bio.ifi.lmu.de wrote:

 Dear Bioc annotation team,

 Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
 for

 BRCA1; ENSG0012048; entrez:672

 via

 genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))

 gives me:

 GRanges object with 1 range and 1 metadata column:
  seqnames   ranges strand | gene_id
 RleIRanges  Rle | character
  672chr17 [43044295, 43170403]  - | 672
  ---
  seqinfo: 455 sequences (1 circular) from hg38 genome


 However, querying Ensembl and NCBI Gene
 http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048
 http://www.ncbi.nlm.nih.gov/gene/672

 the gene is located at (note the difference in the end position)

 Chromosome 17: 43,044,295-43,125,483 reverse strand


 How is the inconsistency explained and how to extract an ENSEMBL/NCBI
 conform annotation from the TxDb object?
 (I am aware of biomaRt, but I want to explicitely use the Bioc
 annotation
 functionality).

 Thanks!
 Ludwig


 --
 Dipl.-Bioinf. Ludwig Geistlinger

 Lehr- und Forschungseinheit fÃ¼r Bioinformatik
 Institut fÃ¼r Informatik
 Ludwig-Maximilians-UniversitÃ¤t MÃ¼nchen
 Amalienstrasse 17, 2. Stock, BÃ¼ro A201
 80333 MÃ¼nchen

 Tel.: 089-2180-4067
 eMail: ludwig.geistlin...@bio.ifi.lmu.de

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


  [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-09 Thread Rainer Johannes

dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all 
genes defined in the Ensembl database (for a given species and Ensembl 
release). Except the column/attribute entrezid that is stored in the internal 
database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use pure Ensembl based annotations: use EnsDb, 
if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the ensembldb 
package provides functionality to generate such packages either using the 
Ensembl Perl API or using GTF files provided by Ensembl. If you have problems 
building the packages, just drop me a line and I'll do that.

cheers, jo

 On 03 Jun 2015, at 15:56, Robert M. Flight rfligh...@gmail.com wrote:
 
 Ludwig,
 
 If you do this search on the UCSC genome browser (which this annotation
 package is built from), you will see that the longest variant is what is
 shown
 
 http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885
 
 If instead of genes you do transcripts, you will see 20 different
 transcripts for this gene, including the one listed by NCBI.
 
 I havent tried it yet (haven't upgraded R or bioconductor to latest
 version), but there is now an Ensembl based annotation package as well,
 that may work better??
 http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html
 
 -Robert
 
 
 
 On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
 ludwig.geistlin...@bio.ifi.lmu.de wrote:
 
 Dear Bioc annotation team,
 
 Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g. for
 
 BRCA1; ENSG0012048; entrez:672
 
 via
 
 genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))
 
 gives me:
 
 GRanges object with 1 range and 1 metadata column:
  seqnames   ranges strand | gene_id
 RleIRanges  Rle | character
  672chr17 [43044295, 43170403]  - | 672
  ---
  seqinfo: 455 sequences (1 circular) from hg38 genome
 
 
 However, querying Ensembl and NCBI Gene
 http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048
 http://www.ncbi.nlm.nih.gov/gene/672
 
 the gene is located at (note the difference in the end position)
 
 Chromosome 17: 43,044,295-43,125,483 reverse strand
 
 
 How is the inconsistency explained and how to extract an ENSEMBL/NCBI
 conform annotation from the TxDb object?
 (I am aware of biomaRt, but I want to explicitely use the Bioc annotation
 functionality).
 
 Thanks!
 Ludwig
 
 
 --
 Dipl.-Bioinf. Ludwig Geistlinger
 
 Lehr- und Forschungseinheit für Bioinformatik
 Institut für Informatik
 Ludwig-Maximilians-Universität München
 Amalienstrasse 17, 2. Stock, Büro A201
 80333 München
 
 Tel.: 089-2180-4067
 eMail: ludwig.geistlin...@bio.ifi.lmu.de
 
 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel
 
 
   [[alternative HTML version deleted]]
 
 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-09 Thread Martin Morgan


On 06/08/2015 11:43 PM, Rainer Johannes wrote:

dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute entrezid that is stored in the
internal database there is however no link to NCBI or UCSC annotations. So,
basically, if you want to use pure Ensembl based annotations: use EnsDb, if
you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the ensembldb
package provides functionality to generate such packages either using the
Ensembl Perl API or using GTF files provided by Ensembl. If you have problems
building the packages, just drop me a line and I'll do that.


Two other sources of Ensembl TxDb's are GenomicFeatures::makeTxDbFromBiomart() 
and AnnotationHub. For the latter, I'll add a variant of the following to the 
AnnotationHub HOWTO vignette 
http://bioconductor.org/packages/devel/bioc/html/AnnotationHub.html later today.


## Gene models

_Bioconductor_ represents gene models using 'transcript'
databases. These are available via packages such as
[TxDb.Hsapiens.UCSC.hg38.knownGene](http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.knownGene.html),
or can be constructed using functions such as
`[GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures.html)::makeTxDbFromBiomart()` 
or `GenomicFeatures::makeTxDbFromGRanges()`.


_AnnotationHub_ provides an easy way to work with gene models
published by Ensembl. Here we discover the Ensemble release 80 r
esources for pufferfish,_Takifugu rubripes_

```{r takifugu-gene-models}
query(ah, c(Takifugu, release-80))
```

We see that there is a GTF file, as well as various DNA
sequences. Let's retrieve the GTF and top-level sequence files. The
GTF file is imported as a _GRanges_ instance, the DNA sequence as a
compressed, indexed Fasta file


```{r takifugi-data}
gtf - ah[[AH47101]]
dna - ah[[AH47477]]

head(gtf, 3)
dna
head(seqlevels(dna))
```

It is trivial to make a TxDb instance

```{r takifugi-txdb}
library(GenomicFeatures)
txdb - makeTxDbFromGRanges(gtf)


and to use that in conjunction with the DNA sequence, e.g., to find
exon sequences of all annotated genes.

```{r takifugi-exons}
library(Rsamtools) # for getSeq,FaFile-method
exons - exons(txdb)
getSeq(dna, exons)
```

Some difficulties arise when working with this partly assembled genome
that require more advanced GenomicRanges skills, see the
[GenomicRanges](http://bioconductor.org/packages/GenomicRanges.html)
vignettes, especially GenomicRanges HOWTOs and An Introduction to
GenomicRanges.




cheers, jo


On 03 Jun 2015, at 15:56, Robert M. Flight rfligh...@gmail.com wrote:

Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885




If instead of genes you do transcripts, you will see 20 different

transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html




-Robert




On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
ludwig.geistlin...@bio.ifi.lmu.de wrote:


Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
for

BRCA1; ENSG0012048; entrez:672

via


genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))


gives me:

GRanges object with 1 range and 1 metadata column: seqnames
ranges strand | gene_id RleIRanges  Rle |
character 672chr17 [43044295, 43170403]  - | 672
--- seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048



http://www.ncbi.nlm.nih.gov/gene/672


the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object? (I am aware of biomaRt, but I
want to explicitely use the Bioc annotation functionality).

Thanks! Ludwig


-- Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit für Bioinformatik Institut für Informatik
Ludwig-Maximilians-Universität München Amalienstrasse 17, 2. Stock, Büro
A201 80333 München

Tel.: 089-2180-4067 eMail: ludwig.geistlin...@bio.ifi.lmu.de

___ Bioc-devel@r-project.org
mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Changes in AnnotationDbi

2015-06-09 Thread Simon Anders


Hi Martin

On 09/06/15 15:35, Martin Morgan wrote:

In case  you missed it in Marc's reply, and acknowledging that this is
different from your suggestion, there is mapIds() for doing this on a
single column basis, which is the common use case where one doesn't care
too much about multiple mapping ids


I have indeed missed this point in Marc's reply -- and you are right, 
the single column case is the only one where it is common that one does 
not care for multiple mapping. So, sorry for the noise.


How comes I never knew 'mapIds' even though it is clearly mentioned in 
the AnnotationDb help page? Maybe the page is too long, or --more 
likely-- I'm to impatient when browsing through help pages.


  Simon

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Changes in AnnotationDbi

2015-06-09 Thread Martin Morgan


On 06/09/2015 02:52 AM, Simon Anders wrote:

Hi

My two cents:

On 04/06/15 19:50, James W. MacDonald wrote:

In other words, for me it is a common practice to do something like this:

fit - lmFit(eset, design)
fit2 - eBayes(fit)
gns - select(chippackage, featureNames(eset), c(ENTREZID,SYMBOL))
gns - gns[!duplicated(gns[,1]),]
fit2$genes - gns

I add in the step where dups are removed because I already know they are
there. But a naive user might instead do

fit2$genes - select(chippackage, featureNames(eset),
c(ENTREZID,SYMBOL))


I'm not even that happy with James' first solution, as it relies on the order
being correct after removing the duplicates. I'd feel safer to use 'match' to
ensure that. (What if an EntrezId is not found in the Annotation DB? Will we
have a line with NA, or is the line simply missing? The latter would break
James' code.)

What users really want here is a way to get the preferred symbol for an
entrezId, and for lack of this, they accept simply a random one or the first one
(in some unspecified collation). So, we should have a function, maybe 'select1',
to select one and only one hit for each query value.

   select1(x, keys, columns, keytype, requireUnique=FALSE, ... )

This would query the AnnotationDbi object 'x' as does 'select', but return a
data frame with the columns specified in 'columns', and the vector that was
passed as 'keys' as row names, thus guaranteeing that each line in the data
frame corresponds to one query key. If there were multiple records for a key,
the first one is used, unless 'requireUnique' is set, in which case an error is
issued. And if no record is present for a key, the data frame contains a row of
NAs for this key.

This would be quite convenient for any kind of ID conversion issues.


In case  you missed it in Marc's reply, and acknowledging that this is different 
from your suggestion, there is mapIds() for doing this on a single column basis, 
which is the common use case where one doesn't care too much about multiple 
mapping ids


 org = org.Hs.eg.db
 head(select(org, keys(org), ALIAS))
  ENTREZIDALIAS
11  A1B
21  ABG
31  GAB
41 HYST2477
51 A1BG
62 A2MD
 head(mapIds(org, keys(org), ALIAS, ENTREZID))
 1  2  3  9 10 11
 A1B A2MD A2MP AAC1 AAC2 AACP
 head(mapIds(org, keys(org), ALIAS, ENTREZID, multiVals=CharacterList))
CharacterList of length 6
[[1]] A1B ABG GAB HYST2477 A1BG
[[2]] A2MD CPAMD5 FWP007 S863-7 A2M
[[3]] A2MP A2MP1
[[9]] AAC1 MNAT NAT-1 NATI NAT1
[[10]] AAC2 NAT-2 PNAT NAT2
[[11]] AACP NATP1 NATP
 str(head(mapIds(org, keys(org), ALIAS, ENTREZID, multiVals=list)))
List of 6
 $ 1 : chr [1:5] A1B ABG GAB HYST2477 ...
 $ 2 : chr [1:5] A2MD CPAMD5 FWP007 S863-7 ...
 $ 3 : chr [1:2] A2MP A2MP1
 $ 9 : chr [1:5] AAC1 MNAT NAT-1 NATI ...
 $ 10: chr [1:4] AAC2 NAT-2 PNAT NAT2
 $ 11: chr [1:3] AACP NATP1 NATP

Also since this is the devel list, there is

 library(dplyr)
 d = src_sqlite(org.Hs.eg_dbfile())
 d
src:  sqlite 3.8.6 
[/home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.2-BiocDevel/org.Hs.eg.db/extdata/org.Hs.eg.sqlite]

tbls: accessions, alias, chrlengths, chromosome_locations, chromosomes,
  cytogenetic_locations, ec, ensembl, ensembl_prot, ensembl_trans,
  ensembl2ncbi, gene_info, genes, go, go_all, go_bp, go_bp_all, go_cc,
  go_cc_all, go_mf, go_mf_all, kegg, map_counts, map_metadata, metadata,
  ncbi2ensembl, omim, pfam, prosite, pubmed, refseq, sqlite_stat1, ucsc,
  unigene, uniprot
 d %% tbl(alias) %% group_by(`_id`) %% summarize(alias_symbol)
Source: sqlite 3.8.6 
[/home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.2-BiocDevel/org.Hs.eg.db/extdata/org.Hs.eg.sqlite]

From: derived table [?? x 2]

   _id alias_symbol
11 A1BG
22  A2M
33A2MP1
44 NAT1
55 NAT2
66 NATP
77 SERPINA3
88AADAC
99 AAMP
10  10AANAT
.. ...  ...

(with lots of nice confusion there, including extensive masking of symbols 
between dplyr / AnnotationDbi, need for knowledge of the schema (basically a 
central id, ENTREZID for org packages, and tables of mappings from the central 
id to other ids), and the more-or-less arbitrary choice of alias_symbol).


Martin



   Simon

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Re: [Bioc-devel] Changes in AnnotationDbi

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Re: [Bioc-devel] Changes in AnnotationDbi

Re: [Bioc-devel] Changes in AnnotationDbi

7 matches

Site Navigation

Mail list logo

Footer information