Hi Michael,

On 06/17/2015 12:35 AM, Michael Love wrote:
Background:

With previous approaches that I would recommend to users for building
txdb along the way of making count tables, it was desirable that the
GTF release information would *automatically* be passed into the
metadata of the rowRanges of the SummarizedExperiment.

for example, in parathyroidSE, using makeTxDbFromBiomart:

...
BioMart database version : chr "ENSEMBL GENES 72 (SANGER UK)"
BioMart dataset : chr "hsapiens_gene_ensembl"
BioMart dataset description : chr "Homo sapiens genes (GRCh37.p11)"

or, alternatively, using makeTxDbFromGFF, it would be present in the
name of the GTF file:

...
Data source: chr "Homo_sapiens.GRCh37.75.gtf"


Question:

I'm now interested in switching over to using GTF files available
through AnnotationHub, and wondering how we can maintain this
automatic propagation of metadata.

here's some example code:

library(GenomicFeatures)
library(AnnotationHub)
ah <- AnnotationHub()
z <- query(ah, c("Ensembl","gtf","Caenorhabditis elegans","release-80"))
stopifnot(length(z) == 1)
z$title

  [1] "Caenorhabditis_elegans.WBcel235.80.gtf"

gtf <- ah[[names(z)]]
metadata(gtf)

  $AnnotationHubName
  [1] "AH47045"

A first (and hopefully easy) improvement would be that the GRanges and
other objects on AnnotationHub had more information in their metadata().

In the mean time, here is a workaround (in 2 steps):

1) Add some useful metadata to 'gtf' (grabbed from the AnnotationHub
   object):

  addMetadataFromAnnotationHub <- function(x, ah)
  {
    stopifnot(length(ah) == 1L)
    metadata0 <- list(
        `Data source`=ah$sourceurl,
        `Provider`=ah$dataprovider,
        `Organism`=ah$species,
        `Taxonomy ID`=ah$taxonomyid
    )
    metadata(x) <- c(metadata0, metadata(x))
    x
  }

  gtf <- addMetadataFromAnnotationHub(gtf, z)

  metadata(gtf)
  # $`Data source`
# [1] "ftp://ftp.ensembl.org/pub/release-80/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.80.gtf.gz";
  #
  # $Provider
  # [1] "Ensembl"
  #
  # $Organism
  # [1] "Caenorhabditis elegans"
  #
  # $`Taxonomy ID`
  # [1] 6239
  #
  # $AnnotationHubName
  # [1] "AH47045"

2) Pass the metadata to makeTxDbFromGRanges():

  txdb <- makeTxDbFromGRanges(gtf,
              metadata=data.frame(
                  name=names(metadata(gtf)),
                  value=as.character(metadata(gtf))))
  txdb
  # TxDb object:
  # Db type: TxDb
  # Supporting package: GenomicFeatures
# Data source: ftp://ftp.ensembl.org/pub/release-80/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.80.gtf.gz
  # Provider: Ensembl
  # Organism: Caenorhabditis elegans
  # Taxonomy ID: 6239
  # AnnotationHubName: AH47045
  # Genome: WBcel235
  # transcript_nrow: 57834
  # exon_nrow: 173506
  # cds_nrow: 131562
  # Db created by: GenomicFeatures package from Bioconductor
  # Creation time: 2015-06-17 12:24:51 -0700 (Wed, 17 Jun 2015)
  # GenomicFeatures version at creation time: 1.21.13
  # RSQLite version at creation time: 1.0.0
  # DBSCHEMAVERSION: 1.1

  organism(txdb)
  # [1] "Caenorhabditis elegans"
  taxonomyId(txdb)
  # [1] 6239

Step 2) should be made easier because the metadata is already in 'gtf'
so there is no reason why the user would need to pass it again thru the
'metadata' argument. I'll made that change to makeTxDbFromGRanges().

H.


txdb <- makeTxDbFromGRanges(gtf)
metadata(txdb)

                                         name
  1                                   Db type
  2                        Supporting package
  3                                    Genome
  4                           transcript_nrow
  5                                 exon_nrow
  6                                  cds_nrow
  7                             Db created by
  8                             Creation time
  9  GenomicFeatures version at creation time
  10         RSQLite version at creation time
  11                          DBSCHEMAVERSION
                                            value
  1                                          TxDb
  2                               GenomicFeatures
  3                                      WBcel235
  4                                         57834
  5                                        173506
  6                                        131562
  7     GenomicFeatures package from Bioconductor
  8  2015-06-17 09:31:10 +0200 (Wed, 17 Jun 2015)
  9                                       1.21.13
  10                                        1.0.0
  11                                          1.1


sessionInfo()

  R Under development (unstable) (2015-04-29 r68278)
  Platform: x86_64-apple-darwin12.5.0 (64-bit)
  Running under: OS X 10.10.3 (Yosemite)

  locale:
  [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

  attached base packages:
  [1] stats4    parallel  stats     graphics  grDevices datasets  utils
  [8] methods   base

  other attached packages:
   [1] AnnotationHub_2.1.26    GenomicFeatures_1.21.13 AnnotationDbi_1.31.16
   [4] Biobase_2.29.1          GenomicRanges_1.21.15   GenomeInfoDb_1.5.7
   [7] IRanges_2.3.11          S4Vectors_0.7.5         BiocGenerics_0.15.2
  [10] testthat_0.10.0         knitr_1.10              BiocInstaller_1.19.6

  loaded via a namespace (and not attached):
   [1] Rcpp_0.11.6                  compiler_3.3.0
   [3] futile.logger_1.4.1          XVector_0.9.1
   [5] bitops_1.0-6                 futile.options_1.0.0
   [7] tools_3.3.0                  zlibbioc_1.15.0
   [9] biomaRt_2.25.1               digest_0.6.8
  [11] RSQLite_1.0.0                memoise_0.2.1
  [13] shiny_0.12.1                 DBI_0.3.1
  [15] rtracklayer_1.29.10          httr_0.6.1
  [17] stringr_1.0.0                Biostrings_2.37.2
  [19] R6_2.0.1                     XML_3.98-1.2
  [21] BiocParallel_1.3.26          lambda.r_1.1.7
  [23] magrittr_1.5                 Rsamtools_1.21.8
  [25] htmltools_0.2.6              GenomicAlignments_1.5.9
  [27] SummarizedExperiment_0.1.5   xtable_1.7-4
  [29] mime_0.3                     interactiveDisplayBase_1.7.0
  [31] httpuv_1.3.2                 stringi_0.4-1
  [33] RCurl_1.95-4.6               crayon_1.3.0

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to