[Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem

2014-10-23 Thread Robert Castelo

hi there,

in my package VariantFiltering i have an example VCF file from a Hapmap 
CEU trio including three chromosomes only to illustrate its vignette. 
i've come across a problem with the function readVcf() in 
VariantAnnotation that may be specific of the situation of a VCF file 
not having all chromosomes, but which it will be great for me if this 
could be addressed.


the problem is reproduced as follows:

===
library(VariantAnnotation)

fl - file.path(system.file(extdata, package=VariantFiltering), 
CEUtrio.vcf.bgz)


vcf - readVcf(fl, seqinfo(scanVcfHeader(fl)))
Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old, 
seqlevels(value)) :

  when 'new2old' is NULL, the first elements in the
  supplied 'seqlevels' must be identical to 'seqlevels(x)'


this is caused because although i'm providing the Seqinfo object derived 
from the header of the VCF file itself, at some point the ordering of 
the seqlevels between the header and the rest of the VCF file differs 
due to the smaller subset of chromosomes in the VCF file.


This can be easily fixed by replacing the line:

if (length(newsi)  length(oldsi)) {

within the .scanVcfToVCF() function in methods-readVcf.R, by

if (length(newsi) = length(oldsi)) {

this is happening both in release and devel. i'm pasting below my 
sessionInfo() for the release.


let me know if you think this fix is feasible or i'm wrongly using the 
function readVcf(). i'm basically trying to use readVcf() without having 
to figure out the appropriate value for the argument 'genome', i.e., 
without knowing beforehand what version of the genome was used to 
produce the VCF file.


thanks!!
robert.


sessionInfo()
R version 3.1.1 Patched (2014-10-13 r66751)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF8LC_COLLATE=en_US.UTF8
 [5] LC_MONETARY=en_US.UTF8LC_MESSAGES=en_US.UTF8
 [7] LC_PAPER=en_US.UTF8   LC_NAME=C
 [9] LC_ADDRESS=C  LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices
[6] utils datasets  methods   base

other attached packages:
 [1] VariantAnnotation_1.12.0 Rsamtools_1.18.0
 [3] Biostrings_2.34.0XVector_0.6.0
 [5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0
 [7] IRanges_2.0.0S4Vectors_0.4.0
 [9] BiocGenerics_0.12.0  vimcom_1.0-0
[11] setwidth_1.0-3   colorout_1.0-3

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.28.0base64enc_0.1-2
 [3] BatchJobs_1.4   BBmisc_1.7
 [5] Biobase_2.26.0  BiocParallel_1.0.0
 [7] biomaRt_2.22.0  bitops_1.0-6
 [9] brew_1.0-6  BSgenome_1.34.0
[11] checkmate_1.4   codetools_0.2-9
[13] DBI_0.3.1   digest_0.6.4
[15] fail_1.2foreach_1.4.2
[17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0
[19] iterators_1.0.7 RCurl_1.95-4.3
[21] RSQLite_0.11.4  rtracklayer_1.26.0
[23] sendmailR_1.2-1 stringr_0.6.2
[25] tools_3.1.1 XML_3.98-1.1
[27] zlibbioc_1.12.0

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem

2014-10-23 Thread Valerie Obenchain

Hi Robert,

Thanks for the bug report and reproducible example. Now fixed in release 
1.12.2 and devel 1.13.4.


I've also updated the docs to better explain how the Seqinfo objects are 
propagated / merged when supplied as 'genome'.


Valerie


On 10/23/2014 06:45 AM, Robert Castelo wrote:

hi there,

in my package VariantFiltering i have an example VCF file from a Hapmap
CEU trio including three chromosomes only to illustrate its vignette.
i've come across a problem with the function readVcf() in
VariantAnnotation that may be specific of the situation of a VCF file
not having all chromosomes, but which it will be great for me if this
could be addressed.

the problem is reproduced as follows:

===
library(VariantAnnotation)

fl - file.path(system.file(extdata, package=VariantFiltering),
CEUtrio.vcf.bgz)

vcf - readVcf(fl, seqinfo(scanVcfHeader(fl)))
Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old,
seqlevels(value)) :
   when 'new2old' is NULL, the first elements in the
   supplied 'seqlevels' must be identical to 'seqlevels(x)'


this is caused because although i'm providing the Seqinfo object derived
from the header of the VCF file itself, at some point the ordering of
the seqlevels between the header and the rest of the VCF file differs
due to the smaller subset of chromosomes in the VCF file.

This can be easily fixed by replacing the line:

 if (length(newsi)  length(oldsi)) {

within the .scanVcfToVCF() function in methods-readVcf.R, by

 if (length(newsi) = length(oldsi)) {

this is happening both in release and devel. i'm pasting below my
sessionInfo() for the release.

let me know if you think this fix is feasible or i'm wrongly using the
function readVcf(). i'm basically trying to use readVcf() without having
to figure out the appropriate value for the argument 'genome', i.e.,
without knowing beforehand what version of the genome was used to
produce the VCF file.

thanks!!
robert.


sessionInfo()
R version 3.1.1 Patched (2014-10-13 r66751)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF8LC_COLLATE=en_US.UTF8
  [5] LC_MONETARY=en_US.UTF8LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=en_US.UTF8   LC_NAME=C
  [9] LC_ADDRESS=C  LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices
[6] utils datasets  methods   base

other attached packages:
  [1] VariantAnnotation_1.12.0 Rsamtools_1.18.0
  [3] Biostrings_2.34.0XVector_0.6.0
  [5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0
  [7] IRanges_2.0.0S4Vectors_0.4.0
  [9] BiocGenerics_0.12.0  vimcom_1.0-0
[11] setwidth_1.0-3   colorout_1.0-3

loaded via a namespace (and not attached):
  [1] AnnotationDbi_1.28.0base64enc_0.1-2
  [3] BatchJobs_1.4   BBmisc_1.7
  [5] Biobase_2.26.0  BiocParallel_1.0.0
  [7] biomaRt_2.22.0  bitops_1.0-6
  [9] brew_1.0-6  BSgenome_1.34.0
[11] checkmate_1.4   codetools_0.2-9
[13] DBI_0.3.1   digest_0.6.4
[15] fail_1.2foreach_1.4.2
[17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0
[19] iterators_1.0.7 RCurl_1.95-4.3
[21] RSQLite_0.11.4  rtracklayer_1.26.0
[23] sendmailR_1.2-1 stringr_0.6.2
[25] tools_3.1.1 XML_3.98-1.1
[27] zlibbioc_1.12.0

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] [devteam-bioc] A more flexible GenomeInfoDb::mapSeqlevels(): used supported info but don't break with new organisms/toy examples

2014-10-23 Thread Leonardo Collado Torres
Hi,

I just found a tiny bug in .guessSpeciesStyle(). Basically, if the
style name has a dot, this function incorrectly returns the name of
the style.

See below:

## Lets guess the species and style for 'T'
 GenomeInfoDb:::.guessSpeciesStyle('T')
$species
[1] Populus trichocarpa

$style
[1] JGI2

## Same style with the higher lvl function
 seqlevelsStyle('T')
[1] JGI2

## Ok, style 'JGI2'

## In more dtail, this matches 'LGI' in NCBI style for this organism

 GenomeInfoDb:::.supportedSeqnameMappings()[Populus_trichocarpa]
$Populus_trichocarpa
 NCBI JGI2.F
1 LGI  T
2LGII  2
3   LGIII  3
4LGIV  4
5 LGV  5
6LGVI  6
7   LGVII  7
8  LGVIII  8
9LGIX  9
10LGX 10
11   LGXI 11
12  LGXII 12
13 LGXIII 13
14   LGIV 14
15   LGXV 15
16   LGVI 16
17  LGVII 17
18 LGVIII 18
19  LGXIX 19
20   Pltd   NA

## Guessing 'LGI' with the higher lvl function works
 seqlevelsStyle('LGI')
[1] NCBI

## Mapping from 'LGI' in 'NCBI' style to 'JGI2' style doesn't work
 mapSeqlevels('LGI', 'JGI2')
Error in mapSeqlevels(LGI, JGI2) :
  supplied seqname style JGI2 is not supported

## Using the correct 'style' name works.
## However a user might have expected it to work with 'JGI2' given
## that this was the output from seqlevelsStyle('T')
 mapSeqlevels('LGI', 'JGI2.F')
[1] T

## Although a user could find the correct name
 head(genomeStyles('Populus trichocarpa'), 1)
  circular auto   sex NCBI JGI2.F
1FALSE TRUE FALSE  LGI  T

 packageVersion('GenomeInfoDb')
[1] ‘1.3.3’

Cheers,
Leo

On Wed, Oct 22, 2014 at 7:03 PM, Leonardo Collado Torres
lcoll...@jhu.edu wrote:
 On Wed, Oct 22, 2014 at 6:59 PM, Leonardo Collado Torres
 lcoll...@jhu.edu wrote:
 Hi Sonali,

 The recent changes in GenomeInfoDb (1.3.3) look great!



 On Wed, Oct 22, 2014 at 1:53 PM, Sonali Arora sar...@fhcrc.org wrote:

 Hi Leo,

 To summarize, These are the concerns raised in your previous email :

 a) mapSeqlevels() should return current style along with other mapped 
 styles?
 look at extendedMapSeqlevels()

 I am looking into this. I have a few thoughts/ questions for you.

 The main idea of mapSeqlevels() is to map the current seqlevelStyle to 
 other seqlevelStyles.


 Totally!

 I was also expecting mapSeqlevels() to return the same input if the
 specified 'style' was the same as the one currently being used.


 Returning the existing seqlevelStyle along with the mapped Seqlevel style 
 sounds a
 little redundant. (given that the user is already supplying it while 
 calling mapSeqlevels()
 so he already knows it!)

  For
 example, if I was working with Homo sapiens NCBI style and attempted
 to map to NCBI, I was expecting the same output as the input.


 why would you want to map it back to NCBI ? Please explain your use case 
 more
 concretely.

 I agree with you that my example doesn't make sense if a user is
 manually calling mapSeqlevels().

 In `derfinder`, several functions split the work by sequence name
 (chromosomes). Others take two GRanges objects and match them. Others
 create GRanges objects. Others load data from BAM or BigWig files.

 In my use case, I wanted to make sure things would work even if the
 user didn't make sure that (in the 2 GRanges case) that the objects
 had the same sequence naming style. That is why the user doesn't
 necessarily specify the `style` and I was using GenomeInfoDb to guess
 it via `seqlevelsStyle()` as well as mapping the sequence names to the
 default `style` to make sure the playing field was the same before
 finding overlaps, etc via `mapSeqlevels()`.

 This means that sometimes I am working with a single sequence name.
 Hence the `seqlevelsStyle('2')` example which is a different scenario
 from working with multiple sequence names.

 seqlevelsStyle('2')
 warning! Multiple seqnameStyles found.
 [1] NCBI   TAIR10 MSU6   JGI2   AGPvF
 seqlevelsStyle(as.character(1:13))
 [1] NCBI

 I'm aware that I could simply dodge using mapSeqlevels() when the
 default `style` is the same as the `style` currently used in the input
 as shown below:

 foo - function(seqnames, style = 'UCSC') {
 + if(seqlevelsStyle(seqnames) != style) {
 + res - mapSeqlevels(seqnames, style)
 + } else {
 + res - seqnames
 + }
 + return(res)
 + }

 ## Works with multiple seqnames
 foo(as.character(1:22))
   1   2   3   4   5   6   7   8
  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8
   9  10  11  12  13  14  15  16
  chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16
  17  18  19  20  21  22
 chr17 chr18 chr19 chr20 chr21 chr22

 ## You also have to guess the species in using a small number of sequence 
 names
 ## Or use only the first result: seqlevelsStyle(seqnames)[1]
 foo('2')
 warning! Multiple seqnameStyles found.
 [1] chr2
 Warning message:
 In if (seqlevelsStyle(seqnames) != style) { :
   the condition has length  1 

[Bioc-devel] Documentation on how to update a BioC experimental package?

2014-10-23 Thread Henrik Bengtsson
It's been a while since I worked with experimental packages.  Where
can I find documentation on how to (Subversion) update our
AffymetrixDataTestFiles package with additional data files?  All I
know is that the SVN repository only contains a stub of the package
and http://www.bioconductor.org/developers/package-guidelines/#package-types
provides little information and basically only point to the devel
mailing list.

Thanks,

Henrik

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



Re: [Bioc-devel] Documentation on how to update a BioC experimental package?

2014-10-23 Thread Vincent Carey
There is a README.txt in the pkgs folder.

I will attach it.  I think this is accurate, but there may be something
else on the site.



On Thu, Oct 23, 2014 at 10:23 PM, Henrik Bengtsson h...@biostat.ucsf.edu
wrote:

 It's been a while since I worked with experimental packages.  Where
 can I find documentation on how to (Subversion) update our
 AffymetrixDataTestFiles package with additional data files?  All I
 know is that the SVN repository only contains a stub of the package
 and
 http://www.bioconductor.org/developers/package-guidelines/#package-types
 provides little information and basically only point to the devel
 mailing list.

 Thanks,

 Henrik

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


 BioC Experiment Data Package SVN Repos


:Date: 2006-08-10
:Author: S. Falcon
:svn URL: https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs


Background
==

This svn directory contains BioC experiment data packages.  Data
packages contain potentially large binary files that do not change
often.  Most updates to these packages involve the package
infrastructure files.

Obtaining a working copy of a data package over a slow connection can
be frustrating, especially when all that is needed is the
infrastructure files and not the actual data.  We have implemented a
scheme that allows separate checkout of infrastructure files and data
files.  This document describes the scheme and provides instructions
for checkout and update of existing data packages as well as for
adding new packages.


How to Create an Infrastructure-Only Workingcopy


You can obtain a checkout of all experiment data package
infrastructure files as follows::

svn checkout \
  https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs

To obtain the files for a particular package, say ``ALL``::

svn checkout \
  https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs/ALL

If you want to preview what is available, you might try the
following::

# get the top-level scripts, but don't recurse into subdirs
svn checkout -N \
  https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs

# see what is there
svn ls https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs

# get a particular package's infrastructure files
cd pkgs
svn up ALL
# see next section for getting complete working copy w/ data


How to Create a Complete Workingcopy


First create a workingcopy of the infrastructure files as described
above.

Next use the helper script ``add_data.py`` (you will need Python).  It
is located here:

https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs/add_data.py

Here's a complete example for ``ALL``::

svn checkout \
  https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs/ALL

python add_data.py ALL

This will add the big data directories (usually data/, but sometimes
also dirs under inst/) to your working copy.  Usually, the svn:ignore
property has been set so that you won't accidentally add these dirs
when working with the package, but please take care anyway.


A note about committing changes to the data
---

If you want to modify the actual data, cd into the appropriate dir
after having run add_data.py and do your commit from there.  The
script adds a full working copy inside the infrastructure working
copy.


How to Add a New Data Package
=

1. Add the infrastructure files under ``pkgs``.

2. Add any large data directories to ../data_store/PKGNAME/.  For
   example, if there is large data in PKGNAME/data and
   PKGNAME/inst/extdata, you would add PKGNAME/data and
   PKGNAME/inst/extdata to ../data_store.

3. Create a file 'external_data_store.txt' listing each dir that is
   stored externally (each on a separate line).  Contining the example
   above, the file would contain::

data
inst/extdata

   This should go in the top-level of the package dir.

4. Add svn ignore properties.  Continuing the example::

   cd PKGNAME
   svn propset svn:ignore '*' ./data/ ## property 'svn:ignore' set on '.'
  ## in the data folder 
   svn propset svn:ignore '*' ./inst/

   or (this might not work anymore)
   svn propedit svn:ignore .   ## add 'data' here
   svn propedit svn:ignore inst  ## add 'extdata' here

5. Commit.


Details of Storage Scheme
=

Experiment data package infrastructure files live in
``experiment/pkgs``.  Package subdirectories that contain large files
are stored under ``experiment/data_store``.  There is no mechanism to
support separate storage of individual files.

Here is an example of how data for the 

Re: [Bioc-devel] Documentation on how to update a BioC experimental package?

2014-10-23 Thread Dan Tenenbaum


- Original Message -
 From: Vincent Carey st...@channing.harvard.edu
 To: Henrik Bengtsson h...@biostat.ucsf.edu
 Cc: bioC-devel bioc-de...@stat.math.ethz.ch
 Sent: Thursday, October 23, 2014 7:45:13 PM
 Subject: Re: [Bioc-devel] Documentation on how to update a BioC experimental  
 package?
 
 There is a README.txt in the pkgs folder.
 
 I will attach it.  I think this is accurate, but there may be
 something
 else on the site.
 

See the Experiment Data Packages section of

http://www.bioconductor.org/developers/how-to/source-control/

Dan


 
 
 On Thu, Oct 23, 2014 at 10:23 PM, Henrik Bengtsson
 h...@biostat.ucsf.edu
 wrote:
 
  It's been a while since I worked with experimental packages.  Where
  can I find documentation on how to (Subversion) update our
  AffymetrixDataTestFiles package with additional data files?  All I
  know is that the SVN repository only contains a stub of the package
  and
  http://www.bioconductor.org/developers/package-guidelines/#package-types
  provides little information and basically only point to the devel
  mailing list.
 
  Thanks,
 
  Henrik
 
  ___
  Bioc-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/bioc-devel
 
 
 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Documentation on how to update a BioC experimental package?

2014-10-23 Thread Hervé Pagès

Hi Henrik,

FWIW there is some additional information here:

  https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005156.html

In particular the section starting at One important difference with
the svn software repository provides some useful hints on how to
commit changes to the data (see step 4.).

H.

On 10/23/2014 07:46 PM, Dan Tenenbaum wrote:



- Original Message -

From: Vincent Carey st...@channing.harvard.edu
To: Henrik Bengtsson h...@biostat.ucsf.edu
Cc: bioC-devel bioc-de...@stat.math.ethz.ch
Sent: Thursday, October 23, 2014 7:45:13 PM
Subject: Re: [Bioc-devel] Documentation on how to update a BioC experimental
package?

There is a README.txt in the pkgs folder.

I will attach it.  I think this is accurate, but there may be
something
else on the site.



See the Experiment Data Packages section of

http://www.bioconductor.org/developers/how-to/source-control/

Dan





On Thu, Oct 23, 2014 at 10:23 PM, Henrik Bengtsson
h...@biostat.ucsf.edu
wrote:


It's been a while since I worked with experimental packages.  Where
can I find documentation on how to (Subversion) update our
AffymetrixDataTestFiles package with additional data files?  All I
know is that the SVN repository only contains a stub of the package
and
http://www.bioconductor.org/developers/package-guidelines/#package-types
provides little information and basically only point to the devel
mailing list.

Thanks,

Henrik

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel