Looks like the ensembl table of the human.db0 package got polluted with *Pan troglodytes* genes:
> con <- dbConnect(SQLite(), "/R-devel/lib64/R/library/human.db0/extdata/chipsrc_human.sqlite") > dbGetQuery(con, "select count(*) from ensembl where ensid like 'ENSPTR%';") count(*) 1 16207 > dbGetQuery(con, "select count(*) from ensembl where ensid like 'ENSG%';") count(*) 1 28973 On Mon, Apr 22, 2019 at 11:54 PM Aaron Lun < infinite.monkeys.with.keyboa...@gmail.com> wrote: > Playing around with org.Hs.eg.db 3.8.0. What on earth is ENSPTRG0000...? > > > library(org.Hs.eg.db) > > mapIds(org.Hs.eg.db, key="GCG", keytype="SYMBOL", column="ENSEMBL") > 'select()' returned 1:many mapping between keys and columns > GCG > "ENSPTRG00000000777" > > Well, at least it still recovers the right identifier... eventually. > > > select(org.Hs.eg.db, key="GCG", keytype="SYMBOL", columns="ENSEMBL") > 'select()' returned 1:many mapping between keys and columns > SYMBOL ENSEMBL > 1 GCG ENSPTRG00000000777 > 2 GCG ENSG00000115263 > > The SYMBOL->Entrez ID relational table seems to be okay: > > > Y <- toTable(org.Hs.egSYMBOL) > > Y[which(Y[,2]=="GCG"),] > gene_id symbol > 2152 2641 GCG > > So the cause is the Ensembl->Entrez mappings: > > > Z <- toTable(org.Hs.egENSEMBL2EG) > > Z[Z[,1]==2641,] > gene_id ensembl_id > 3028 2641 ENSPTRG00000000777 > 3029 2641 ENSG00000115263 > > Googling suggests that ENSPTRG00000000777 is an identifier for some > other gene in one of the other monkeys. Hardly "Hs" stuff. > > Session info (not technically R 3.6, but I didn't think that would have > been the cause): > > > R Under development (unstable) (2019-04-11 r76379) > > Platform: x86_64-pc-linux-gnu (64-bit) > > Running under: Ubuntu 18.04.2 LTS > > > > Matrix products: default > > BLAS: /home/luna/Software/R/trunk/lib/libRblas.so > > LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] parallel stats4 stats graphics grDevices utils datasets > > [8] methods base > > > > other attached packages: > > [1] org.Hs.eg.db_3.8.0 AnnotationDbi_1.45.1 IRanges_2.17.5 > > [4] S4Vectors_0.21.23 Biobase_2.43.1 BiocGenerics_0.29.2 > > > > loaded via a namespace (and not attached): > > [1] Rcpp_1.0.1 digest_0.6.18 DBI_1.0.0 RSQLite_2.1.1 > > [5] blob_1.1.1 bit64_0.9-7 bit_1.1-14 compiler_3.7.0 > > [9] pkgconfig_2.0.2 memoise_1.1.0 > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel