[Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem
hi there, in my package VariantFiltering i have an example VCF file from a Hapmap CEU trio including three chromosomes only to illustrate its vignette. i've come across a problem with the function readVcf() in VariantAnnotation that may be specific of the situation of a VCF file not having all chromosomes, but which it will be great for me if this could be addressed. the problem is reproduced as follows: === library(VariantAnnotation) fl - file.path(system.file(extdata, package=VariantFiltering), CEUtrio.vcf.bgz) vcf - readVcf(fl, seqinfo(scanVcfHeader(fl))) Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old, seqlevels(value)) : when 'new2old' is NULL, the first elements in the supplied 'seqlevels' must be identical to 'seqlevels(x)' this is caused because although i'm providing the Seqinfo object derived from the header of the VCF file itself, at some point the ordering of the seqlevels between the header and the rest of the VCF file differs due to the smaller subset of chromosomes in the VCF file. This can be easily fixed by replacing the line: if (length(newsi) length(oldsi)) { within the .scanVcfToVCF() function in methods-readVcf.R, by if (length(newsi) = length(oldsi)) { this is happening both in release and devel. i'm pasting below my sessionInfo() for the release. let me know if you think this fix is feasible or i'm wrongly using the function readVcf(). i'm basically trying to use readVcf() without having to figure out the appropriate value for the argument 'genome', i.e., without knowing beforehand what version of the genome was used to produce the VCF file. thanks!! robert. sessionInfo() R version 3.1.1 Patched (2014-10-13 r66751) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF8LC_COLLATE=en_US.UTF8 [5] LC_MONETARY=en_US.UTF8LC_MESSAGES=en_US.UTF8 [7] LC_PAPER=en_US.UTF8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices [6] utils datasets methods base other attached packages: [1] VariantAnnotation_1.12.0 Rsamtools_1.18.0 [3] Biostrings_2.34.0XVector_0.6.0 [5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0 [7] IRanges_2.0.0S4Vectors_0.4.0 [9] BiocGenerics_0.12.0 vimcom_1.0-0 [11] setwidth_1.0-3 colorout_1.0-3 loaded via a namespace (and not attached): [1] AnnotationDbi_1.28.0base64enc_0.1-2 [3] BatchJobs_1.4 BBmisc_1.7 [5] Biobase_2.26.0 BiocParallel_1.0.0 [7] biomaRt_2.22.0 bitops_1.0-6 [9] brew_1.0-6 BSgenome_1.34.0 [11] checkmate_1.4 codetools_0.2-9 [13] DBI_0.3.1 digest_0.6.4 [15] fail_1.2foreach_1.4.2 [17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0 [19] iterators_1.0.7 RCurl_1.95-4.3 [21] RSQLite_0.11.4 rtracklayer_1.26.0 [23] sendmailR_1.2-1 stringr_0.6.2 [25] tools_3.1.1 XML_3.98-1.1 [27] zlibbioc_1.12.0 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem
Hi Robert, Thanks for the bug report and reproducible example. Now fixed in release 1.12.2 and devel 1.13.4. I've also updated the docs to better explain how the Seqinfo objects are propagated / merged when supplied as 'genome'. Valerie On 10/23/2014 06:45 AM, Robert Castelo wrote: hi there, in my package VariantFiltering i have an example VCF file from a Hapmap CEU trio including three chromosomes only to illustrate its vignette. i've come across a problem with the function readVcf() in VariantAnnotation that may be specific of the situation of a VCF file not having all chromosomes, but which it will be great for me if this could be addressed. the problem is reproduced as follows: === library(VariantAnnotation) fl - file.path(system.file(extdata, package=VariantFiltering), CEUtrio.vcf.bgz) vcf - readVcf(fl, seqinfo(scanVcfHeader(fl))) Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old, seqlevels(value)) : when 'new2old' is NULL, the first elements in the supplied 'seqlevels' must be identical to 'seqlevels(x)' this is caused because although i'm providing the Seqinfo object derived from the header of the VCF file itself, at some point the ordering of the seqlevels between the header and the rest of the VCF file differs due to the smaller subset of chromosomes in the VCF file. This can be easily fixed by replacing the line: if (length(newsi) length(oldsi)) { within the .scanVcfToVCF() function in methods-readVcf.R, by if (length(newsi) = length(oldsi)) { this is happening both in release and devel. i'm pasting below my sessionInfo() for the release. let me know if you think this fix is feasible or i'm wrongly using the function readVcf(). i'm basically trying to use readVcf() without having to figure out the appropriate value for the argument 'genome', i.e., without knowing beforehand what version of the genome was used to produce the VCF file. thanks!! robert. sessionInfo() R version 3.1.1 Patched (2014-10-13 r66751) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF8LC_COLLATE=en_US.UTF8 [5] LC_MONETARY=en_US.UTF8LC_MESSAGES=en_US.UTF8 [7] LC_PAPER=en_US.UTF8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices [6] utils datasets methods base other attached packages: [1] VariantAnnotation_1.12.0 Rsamtools_1.18.0 [3] Biostrings_2.34.0XVector_0.6.0 [5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0 [7] IRanges_2.0.0S4Vectors_0.4.0 [9] BiocGenerics_0.12.0 vimcom_1.0-0 [11] setwidth_1.0-3 colorout_1.0-3 loaded via a namespace (and not attached): [1] AnnotationDbi_1.28.0base64enc_0.1-2 [3] BatchJobs_1.4 BBmisc_1.7 [5] Biobase_2.26.0 BiocParallel_1.0.0 [7] biomaRt_2.22.0 bitops_1.0-6 [9] brew_1.0-6 BSgenome_1.34.0 [11] checkmate_1.4 codetools_0.2-9 [13] DBI_0.3.1 digest_0.6.4 [15] fail_1.2foreach_1.4.2 [17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0 [19] iterators_1.0.7 RCurl_1.95-4.3 [21] RSQLite_0.11.4 rtracklayer_1.26.0 [23] sendmailR_1.2-1 stringr_0.6.2 [25] tools_3.1.1 XML_3.98-1.1 [27] zlibbioc_1.12.0 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] [devteam-bioc] A more flexible GenomeInfoDb::mapSeqlevels(): used supported info but don't break with new organisms/toy examples
Hi, I just found a tiny bug in .guessSpeciesStyle(). Basically, if the style name has a dot, this function incorrectly returns the name of the style. See below: ## Lets guess the species and style for 'T' GenomeInfoDb:::.guessSpeciesStyle('T') $species [1] Populus trichocarpa $style [1] JGI2 ## Same style with the higher lvl function seqlevelsStyle('T') [1] JGI2 ## Ok, style 'JGI2' ## In more dtail, this matches 'LGI' in NCBI style for this organism GenomeInfoDb:::.supportedSeqnameMappings()[Populus_trichocarpa] $Populus_trichocarpa NCBI JGI2.F 1 LGI T 2LGII 2 3 LGIII 3 4LGIV 4 5 LGV 5 6LGVI 6 7 LGVII 7 8 LGVIII 8 9LGIX 9 10LGX 10 11 LGXI 11 12 LGXII 12 13 LGXIII 13 14 LGIV 14 15 LGXV 15 16 LGVI 16 17 LGVII 17 18 LGVIII 18 19 LGXIX 19 20 Pltd NA ## Guessing 'LGI' with the higher lvl function works seqlevelsStyle('LGI') [1] NCBI ## Mapping from 'LGI' in 'NCBI' style to 'JGI2' style doesn't work mapSeqlevels('LGI', 'JGI2') Error in mapSeqlevels(LGI, JGI2) : supplied seqname style JGI2 is not supported ## Using the correct 'style' name works. ## However a user might have expected it to work with 'JGI2' given ## that this was the output from seqlevelsStyle('T') mapSeqlevels('LGI', 'JGI2.F') [1] T ## Although a user could find the correct name head(genomeStyles('Populus trichocarpa'), 1) circular auto sex NCBI JGI2.F 1FALSE TRUE FALSE LGI T packageVersion('GenomeInfoDb') [1] ‘1.3.3’ Cheers, Leo On Wed, Oct 22, 2014 at 7:03 PM, Leonardo Collado Torres lcoll...@jhu.edu wrote: On Wed, Oct 22, 2014 at 6:59 PM, Leonardo Collado Torres lcoll...@jhu.edu wrote: Hi Sonali, The recent changes in GenomeInfoDb (1.3.3) look great! On Wed, Oct 22, 2014 at 1:53 PM, Sonali Arora sar...@fhcrc.org wrote: Hi Leo, To summarize, These are the concerns raised in your previous email : a) mapSeqlevels() should return current style along with other mapped styles? look at extendedMapSeqlevels() I am looking into this. I have a few thoughts/ questions for you. The main idea of mapSeqlevels() is to map the current seqlevelStyle to other seqlevelStyles. Totally! I was also expecting mapSeqlevels() to return the same input if the specified 'style' was the same as the one currently being used. Returning the existing seqlevelStyle along with the mapped Seqlevel style sounds a little redundant. (given that the user is already supplying it while calling mapSeqlevels() so he already knows it!) For example, if I was working with Homo sapiens NCBI style and attempted to map to NCBI, I was expecting the same output as the input. why would you want to map it back to NCBI ? Please explain your use case more concretely. I agree with you that my example doesn't make sense if a user is manually calling mapSeqlevels(). In `derfinder`, several functions split the work by sequence name (chromosomes). Others take two GRanges objects and match them. Others create GRanges objects. Others load data from BAM or BigWig files. In my use case, I wanted to make sure things would work even if the user didn't make sure that (in the 2 GRanges case) that the objects had the same sequence naming style. That is why the user doesn't necessarily specify the `style` and I was using GenomeInfoDb to guess it via `seqlevelsStyle()` as well as mapping the sequence names to the default `style` to make sure the playing field was the same before finding overlaps, etc via `mapSeqlevels()`. This means that sometimes I am working with a single sequence name. Hence the `seqlevelsStyle('2')` example which is a different scenario from working with multiple sequence names. seqlevelsStyle('2') warning! Multiple seqnameStyles found. [1] NCBI TAIR10 MSU6 JGI2 AGPvF seqlevelsStyle(as.character(1:13)) [1] NCBI I'm aware that I could simply dodge using mapSeqlevels() when the default `style` is the same as the `style` currently used in the input as shown below: foo - function(seqnames, style = 'UCSC') { + if(seqlevelsStyle(seqnames) != style) { + res - mapSeqlevels(seqnames, style) + } else { + res - seqnames + } + return(res) + } ## Works with multiple seqnames foo(as.character(1:22)) 1 2 3 4 5 6 7 8 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 9 10 11 12 13 14 15 16 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 17 18 19 20 21 22 chr17 chr18 chr19 chr20 chr21 chr22 ## You also have to guess the species in using a small number of sequence names ## Or use only the first result: seqlevelsStyle(seqnames)[1] foo('2') warning! Multiple seqnameStyles found. [1] chr2 Warning message: In if (seqlevelsStyle(seqnames) != style) { : the condition has length 1
[Bioc-devel] Documentation on how to update a BioC experimental package?
It's been a while since I worked with experimental packages. Where can I find documentation on how to (Subversion) update our AffymetrixDataTestFiles package with additional data files? All I know is that the SVN repository only contains a stub of the package and http://www.bioconductor.org/developers/package-guidelines/#package-types provides little information and basically only point to the devel mailing list. Thanks, Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Documentation on how to update a BioC experimental package?
There is a README.txt in the pkgs folder. I will attach it. I think this is accurate, but there may be something else on the site. On Thu, Oct 23, 2014 at 10:23 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: It's been a while since I worked with experimental packages. Where can I find documentation on how to (Subversion) update our AffymetrixDataTestFiles package with additional data files? All I know is that the SVN repository only contains a stub of the package and http://www.bioconductor.org/developers/package-guidelines/#package-types provides little information and basically only point to the devel mailing list. Thanks, Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel BioC Experiment Data Package SVN Repos :Date: 2006-08-10 :Author: S. Falcon :svn URL: https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs Background == This svn directory contains BioC experiment data packages. Data packages contain potentially large binary files that do not change often. Most updates to these packages involve the package infrastructure files. Obtaining a working copy of a data package over a slow connection can be frustrating, especially when all that is needed is the infrastructure files and not the actual data. We have implemented a scheme that allows separate checkout of infrastructure files and data files. This document describes the scheme and provides instructions for checkout and update of existing data packages as well as for adding new packages. How to Create an Infrastructure-Only Workingcopy You can obtain a checkout of all experiment data package infrastructure files as follows:: svn checkout \ https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs To obtain the files for a particular package, say ``ALL``:: svn checkout \ https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs/ALL If you want to preview what is available, you might try the following:: # get the top-level scripts, but don't recurse into subdirs svn checkout -N \ https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs # see what is there svn ls https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs # get a particular package's infrastructure files cd pkgs svn up ALL # see next section for getting complete working copy w/ data How to Create a Complete Workingcopy First create a workingcopy of the infrastructure files as described above. Next use the helper script ``add_data.py`` (you will need Python). It is located here: https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs/add_data.py Here's a complete example for ``ALL``:: svn checkout \ https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs/ALL python add_data.py ALL This will add the big data directories (usually data/, but sometimes also dirs under inst/) to your working copy. Usually, the svn:ignore property has been set so that you won't accidentally add these dirs when working with the package, but please take care anyway. A note about committing changes to the data --- If you want to modify the actual data, cd into the appropriate dir after having run add_data.py and do your commit from there. The script adds a full working copy inside the infrastructure working copy. How to Add a New Data Package = 1. Add the infrastructure files under ``pkgs``. 2. Add any large data directories to ../data_store/PKGNAME/. For example, if there is large data in PKGNAME/data and PKGNAME/inst/extdata, you would add PKGNAME/data and PKGNAME/inst/extdata to ../data_store. 3. Create a file 'external_data_store.txt' listing each dir that is stored externally (each on a separate line). Contining the example above, the file would contain:: data inst/extdata This should go in the top-level of the package dir. 4. Add svn ignore properties. Continuing the example:: cd PKGNAME svn propset svn:ignore '*' ./data/ ## property 'svn:ignore' set on '.' ## in the data folder svn propset svn:ignore '*' ./inst/ or (this might not work anymore) svn propedit svn:ignore . ## add 'data' here svn propedit svn:ignore inst ## add 'extdata' here 5. Commit. Details of Storage Scheme = Experiment data package infrastructure files live in ``experiment/pkgs``. Package subdirectories that contain large files are stored under ``experiment/data_store``. There is no mechanism to support separate storage of individual files. Here is an example of how data for the
Re: [Bioc-devel] Documentation on how to update a BioC experimental package?
- Original Message - From: Vincent Carey st...@channing.harvard.edu To: Henrik Bengtsson h...@biostat.ucsf.edu Cc: bioC-devel bioc-de...@stat.math.ethz.ch Sent: Thursday, October 23, 2014 7:45:13 PM Subject: Re: [Bioc-devel] Documentation on how to update a BioC experimental package? There is a README.txt in the pkgs folder. I will attach it. I think this is accurate, but there may be something else on the site. See the Experiment Data Packages section of http://www.bioconductor.org/developers/how-to/source-control/ Dan On Thu, Oct 23, 2014 at 10:23 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: It's been a while since I worked with experimental packages. Where can I find documentation on how to (Subversion) update our AffymetrixDataTestFiles package with additional data files? All I know is that the SVN repository only contains a stub of the package and http://www.bioconductor.org/developers/package-guidelines/#package-types provides little information and basically only point to the devel mailing list. Thanks, Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Documentation on how to update a BioC experimental package?
Hi Henrik, FWIW there is some additional information here: https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005156.html In particular the section starting at One important difference with the svn software repository provides some useful hints on how to commit changes to the data (see step 4.). H. On 10/23/2014 07:46 PM, Dan Tenenbaum wrote: - Original Message - From: Vincent Carey st...@channing.harvard.edu To: Henrik Bengtsson h...@biostat.ucsf.edu Cc: bioC-devel bioc-de...@stat.math.ethz.ch Sent: Thursday, October 23, 2014 7:45:13 PM Subject: Re: [Bioc-devel] Documentation on how to update a BioC experimental package? There is a README.txt in the pkgs folder. I will attach it. I think this is accurate, but there may be something else on the site. See the Experiment Data Packages section of http://www.bioconductor.org/developers/how-to/source-control/ Dan On Thu, Oct 23, 2014 at 10:23 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: It's been a while since I worked with experimental packages. Where can I find documentation on how to (Subversion) update our AffymetrixDataTestFiles package with additional data files? All I know is that the SVN repository only contains a stub of the package and http://www.bioconductor.org/developers/package-guidelines/#package-types provides little information and basically only point to the devel mailing list. Thanks, Henrik ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel