Re: [Bioc-devel] Importing classes into NAMESPACE

2015-02-25 Thread Gabe Becker
Karolis,

Do you really not need any of the methods for GRanges and ExpressionSet
objects? import(GenomicRanges) might be better, even though the package
isn't exactly small.

~G

On Wed, Feb 25, 2015 at 6:27 AM, Thomas Sandmann sandmann.tho...@gene.com
wrote:

 Hi Karolis,

 These classes have constructor functions of the same name as the class. For
 example, the constructor function for GRanges is called GRanges().

 If you use the constructors you need to import them separately, e.g.

 importFrom GenomicRanges GRanges

 Best,
 Thomas

 [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel




-- 
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Importing classes into NAMESPACE

2015-02-25 Thread Michael Lawrence
On the surface, it sounds like you just need to import() the symbols for
those functions.

On Wed, Feb 25, 2015 at 3:00 AM, Karolis Uziela 
karolis.uzi...@scilifelab.se wrote:

 Hi,

 R 3.2 check results give this NOTE for my package (prebs):
 granges_from_cdf: no visible global function definition for ‘GRanges’
 granges_from_cdf: no visible global function definition for ‘Rle’
 granges_from_cdf: no visible global function definition for ‘IRanges’
 perform_rpa: no visible global function definition for ‘ExpressionSet’

 I get this note for all the classes that I use in my package. Does it mean
 that I incorrectly import classes into NAMESPACE file?

 In the NAMESPACE file I have:

 importClassesFrom(Biobase,ExpressionSet)

 importClassesFrom(GenomicRanges,GRanges)

 importClassesFrom(IRanges,IRanges)

 importClassesFrom(S4Vectors,Rle)
 Regards,
 Karolis

 --

 Karolis Uziela

 PhD student
 Science for Life Laboratory
 Box 1031
 17121 Solna, Sweden
 Mob. +46 729 120 395

 [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-25 Thread Gabe Becker
I think we need to be a little careful of trying to know the users
intentions better than they do here.  Reduce is a (very) easy operation to
do on a GRanges, so if the user didn't, are we really safe assuming they
meant to when passing the GRanges as a which?

I would argue for the samtools way, not because samtools does it (though
consistency is good) but because it allows the user to do more things,
while not making it that painful to do the thing that they might want most
often.

I agree with Michael that an additional argument might be a good middle
ground.

~G

On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres lcoll...@jhu.edu
wrote:

 Related to my post on a separate thread
 (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
 I think that if 'which' is not being reduced by default, a simple
 example showing the effects of this could be included in the functions
 that have such an argument. Also note that 'reducing' could lead to
 unintended results.

 For example, in the help page for GenomicAlignments::readGAlignments,
 after the 'gal4' example it would be nice to add something like this:


 ## Note that if overlapping ranges are provided in 'which'
 ## reads could be selected more than once. This would artificually
 ## increase the coverage or affect other downstream results.
 ## If you 'reduce' the ranges, reads that originally overlapped
 ## two disjoint segments will be included.

 which_dups - RangesList(seq1=rep(IRanges(1000, 2000), 2),
 seq2=IRanges(c(100, 1000), c(1000, 2000)))
 param_dups - ScanBamParam(which=which_dups)
 param_reduced - ScanBamParam(which=reduce(which_dups))
 gal4_dups - readGAlignments(bamfile, param=param_dups)
 gal4_reduced - readGAlignments(bamfile, param=param_reduced)


 length(gal4)

 ## Duplicates some reads. In this case, all the ones between
 ## bases 1000 and 2000 on seq1.
 length(gal4_dups)

 ## Includes some reads that mapped around base 1000 in seq2
 ## that were excluded in gal4.
 length(gal4_reduced)








 Here's the output:



  library('GenomicAlignments')
 
  ## Code already included in ?readGAlignments
  bamfile - system.file(extdata, ex1.bam, package=Rsamtools,
 +mustWork=TRUE)
  which - RangesList(seq1=IRanges(1000, 2000),
 + seq2=IRanges(c(100, 1000), c(1000, 2000)))
  param - ScanBamParam(which=which)
  gal4 - readGAlignments(bamfile, param=param)
  gal4
 GAlignments object with 2404 alignments and 0 metadata columns:
  seqnames strand   cigarqwidth start   end
 width njunc
 Rle  Rle character integer integer integer
 integer integer
  [1] seq1  + 35M35   970  1004
35 0
  [2] seq1  + 35M35   971  1005
35 0
  [3] seq1  + 35M35   972  1006
35 0
  [4] seq1  + 35M35   973  1007
35 0
  [5] seq1  + 35M35   974  1008
35 0
  ...  ...... ...   ...   ...   ...
   ...   ...
   [2400] seq2  + 35M35  1524  1558
35 0
   [2401] seq2  + 35M35  1524  1558
35 0
   [2402] seq2  - 35M35  1528  1562
35 0
   [2403] seq2  - 35M35  1532  1566
35 0
   [2404] seq2  - 35M35  1533  1567
35 0
   ---
   seqinfo: 2 sequences from an unspecified genome
 
  ## Note that if overlapping ranges are provided in 'which'
  ## reads could be selected more than once. This would artificually
  ## increase the coverage or affect other downstream results.
  ## If you 'reduce' the ranges, reads that originally overlapped
  ## two disjoint segments will be included.
 
  which_dups - RangesList(seq1=rep(IRanges(1000, 2000), 2),
 + seq2=IRanges(c(100, 1000), c(1000, 2000)))
  param_dups - ScanBamParam(which=which_dups)
  param_reduced - ScanBamParam(which=reduce(which_dups))
  gal4_dups - readGAlignments(bamfile, param=param_dups)
  gal4_reduced - readGAlignments(bamfile, param=param_reduced)
 
 
  length(gal4)
 [1] 2404
 
  ## Duplicates some reads. In this case, all the ones between
  ## bases 1000 and 2000 on seq1.
  length(gal4_dups)
 [1] 3014
 
  ## Includes some reads that mapped around base 1000 in seq2
  ## that were excluded in gal4.
  length(gal4_reduced)
 [1] 2343
 
 
 
 
 
  options(width = 120)
  devtools::session_info()
 Session
 info---
  setting  value
  version  R Under development (unstable) (2014-11-01 r66923)
  system   x86_64, darwin10.8.0
  ui   AQUA
  language (EN)
  collate  en_US.UTF-8
  tz   America/New_York


 

Re: [Bioc-devel] show method for CompressedVRangesList-class

2015-02-25 Thread Robert Castelo
i see you point, the logic i was thinking about is to use a list of 
VRanges objects to hold separately the variants of multiple individuals, 
with one VRanges object per individual.


if i type the name of such a list object on the R shell, having the 
GRangesList show method, i feel i do not see much information because 
the screen just scrolls up tens or hundreds of lines specifiying 
variants per individual. however, the concise appearance of something 
like a VRangesList:


 vrl
VRangesList of length 10
names(32): S1 S2 S3 S4 ... S7 S8 S9 S10

at least suggests the user that the object holding the variants has 
information for 10 samples and belongs to the class 'VRangesList'.


i thought this made general sense but i'm fine if you feel this 
interpretation does not warrant such a change.


cheers,

robert.

On 02/25/2015 01:25 AM, Michael Lawrence wrote:

Why not have the SimpleVRangesList be shown like CompressedVRangesList,
for consistency with GRangesList? In other words, the opposite of what
you propose. A strong argument could also be made that a
SimpleGenomicRangesList should be shown like a GRangesList. Unless there
is some aversion to the more verbose output

On Tue, Feb 24, 2015 at 2:36 PM, Robert Castelo robert.cast...@upf.edu
mailto:robert.cast...@upf.edu wrote:

so, yes, but IMO rather than inheriting the show method from a
GRangesList, i think that the show method for CompressedVRangesList
objects should be inherited from a VRangesList object. right now
this is the situation:

library(VariantAnnotation)

example(VRangesList)
vrl
VRangesList of length 2
names(2): sampleA sampleB

cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr)))
cvrl
CompressedVRangesList object of length 2:
$a
VRanges object with 1 range and 1 metadata column:
   seqnamesranges strand ref  alt
totalDepth   refDepth   altDepth
Rle IRanges Rle character characterOrRle integerOrRle
integerOrRle integerOrRle
   [1] chr1[1, 5]  +   T
C 12  5  7
 sampleNames softFilterMatrix | tumorSpecific
factorOrRle matrix | logical
   [1] a TRUE | FALSE

$b
VRanges object with 1 range and 1 metadata column:
   seqnames   ranges strand ref alt totalDepth refDepth altDepth
sampleNames softFilterMatrix |
   [1] chr2 [10, 20]  +   A   T 17   10
6   bFALSE |
   tumorSpecific
   [1]  TRUE

---
seqinfo: 2 sequences from an unspecified genome; no seqlengths

would it be possible to have the VRangesList show method for
CompressedVRangesList objects?

robert.



On 2/24/15 7:24 PM, Michael Lawrence wrote:

I think you might be missing an import. It should inherit the
method for GRangesList.

On Tue, Feb 24, 2015 at 9:53 AM, Robert Castelo
robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote:

hi,

i'm using the CompressedVRangesList class in VariantFiltering
to hold variants and their annotations across multiple samples
and found that there was no show method for this class (unless
i'm missing the right import here) so i made one within
VariantFiltering by copyingpasting from other similar classes:

setMethod(show, signature(object=CompressedVRangesList),
  function(object) {
lo - length(object)
cat(classNameForDisplay(object),  of length ,
lo, \n,
sep = )
if (!is.null(names(object)))
  cat(BiocGenerics:::labeledLine(names,
names(object)))
  })

i guess, however, that the right home for this would be
VariantAnnotation. let me know if you consider adding it there
(or somewhere else) and i'll remove it from VariantFiltering.

thanks,

robert.

___
Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org
mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel







--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] show method for CompressedVRangesList-class

2015-02-25 Thread Robert Castelo
my current reason to prefer a CompressedVRangesList object over a 
SimpleVRangesList object is that i find one order of magnitude 
difference in creation time in each of these classes of objects:


library(VariantAnnotation)

fl - system.file(extdata, CEUtrio.vcf.bgz,
  package=VariantFiltering)

vcf - readVcf(fl, genome=hg19)
vr - as(vcf, VRanges)
length(vr)
[1] 15000

## create a VRangesList object
system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr
   user  system elapsed
  0.247   0.004   0.252

## create a CompressedVRangesList object
system.time(cvrl - new(CompressedVRangesList, split(vr, 
sampleNames(vr

   user  system elapsed
  0.019   0.000   0.019

0.252/0.019
[1] 13.26316

with a larger vcf differences increase:

[... load vcf, coerce to VRanges ...]
length(vr)
[1] 25916

system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr
   user  system elapsed
  2.672   0.000   2.676

system.time(cvrl - new(CompressedVRangesList, split(vr, 
sampleNames(vr

   user  system elapsed
  0.014   0.000   0.014

2.676 / 0.014
[1] 191.1429


so maybe i'm using the wrong way to construct a VRangesList object, but 
according to our last conversation about this, there was no obvious 
default fast way to do it, starting from a VRanges object:


https://stat.ethz.ch/pipermail/bioc-devel/2015-January/006905.html

it would be great if there's a fast way to do this kind of construction.

thanks,

robert.

On 02/25/2015 04:42 PM, Michael Lawrence wrote:

If you're storing data on a relatively small number of individuals (say,
hundreds), you should use SimpleVRangesList, not CompressedVRangesList.

On Wed, Feb 25, 2015 at 7:10 AM, Robert Castelo robert.cast...@upf.edu
mailto:robert.cast...@upf.edu wrote:

i see you point, the logic i was thinking about is to use a list of
VRanges objects to hold separately the variants of multiple
individuals, with one VRanges object per individual.

if i type the name of such a list object on the R shell, having the
GRangesList show method, i feel i do not see much information
because the screen just scrolls up tens or hundreds of lines
specifiying variants per individual. however, the concise appearance
of something like a VRangesList:

  vrl
VRangesList of length 10
names(32): S1 S2 S3 S4 ... S7 S8 S9 S10

at least suggests the user that the object holding the variants has
information for 10 samples and belongs to the class 'VRangesList'.

i thought this made general sense but i'm fine if you feel this
interpretation does not warrant such a change.

cheers,

robert.

On 02/25/2015 01:25 AM, Michael Lawrence wrote:

Why not have the SimpleVRangesList be shown like
CompressedVRangesList,
for consistency with GRangesList? In other words, the opposite
of what
you propose. A strong argument could also be made that a
SimpleGenomicRangesList should be shown like a GRangesList.
Unless there
is some aversion to the more verbose output

On Tue, Feb 24, 2015 at 2:36 PM, Robert Castelo
robert.cast...@upf.edu mailto:robert.cast...@upf.edu
mailto:robert.cast...@upf.edu
mailto:robert.cast...@upf.edu__ wrote:

 so, yes, but IMO rather than inheriting the show method from a
 GRangesList, i think that the show method for
CompressedVRangesList
 objects should be inherited from a VRangesList object.
right now
 this is the situation:

 library(VariantAnnotation)

 example(VRangesList)
 vrl
 VRangesList of length 2
 names(2): sampleA sampleB

 cvrl - new(CompressedVRangesList, split(vr,
sampleNames(vr)))
 cvrl
 CompressedVRangesList object of length 2:
 $a
 VRanges object with 1 range and 1 metadata column:
seqnamesranges strand ref  alt
 totalDepth   refDepth   altDepth
Rle IRanges Rle character characterOrRle integerOrRle
integerOrRle integerOrRle
[1] chr1[1, 5]  +   T
 C 12  5  7
  sampleNames softFilterMatrix | tumorSpecific
factorOrRle matrix | logical
[1] a TRUE | FALSE

 $b
 VRanges object with 1 range and 1 metadata column:
seqnames   ranges strand ref alt totalDepth refDepth
altDepth
 sampleNames softFilterMatrix |
[1] chr2 [10, 20]  +   A   T 17   10
 6   bFALSE |
tumorSpecific
[1]  TRUE

 ---
 seqinfo: 2 sequences from an unspecified genome; no 

Re: [Bioc-devel] show method for CompressedVRangesList-class

2015-02-25 Thread Michael Lawrence
Yea, I know, just need to get around to that one. Technically, it works,
but it's obviously not ideal.

On Wed, Feb 25, 2015 at 8:52 AM, Gabe Becker becker.g...@gene.com wrote:

 Why does splitting a VRanges give a GRangesList with VRanges objects as
 elements? Seems like it should return a VRangesList.

  spl = split(vr, sampleNames(vr))
  class(spl)
 [1] GRangesList
 attr(,package)
 [1] GenomicRanges
  class(spl[[1]])
 [1] VRanges
 attr(,package)
 [1] VariantAnnotation


 ~G

 On Wed, Feb 25, 2015 at 8:39 AM, Michael Lawrence 
 lawrence.mich...@gene.com wrote:

 Construction will take longer; the savings are in the accessing of the
 elements. But this seems like too much longer, so I will look into it.

 On Wed, Feb 25, 2015 at 8:12 AM, Robert Castelo robert.cast...@upf.edu
 wrote:

  my current reason to prefer a CompressedVRangesList object over a
  SimpleVRangesList object is that i find one order of magnitude
 difference
  in creation time in each of these classes of objects:
 
  library(VariantAnnotation)
 
  fl - system.file(extdata, CEUtrio.vcf.bgz,
package=VariantFiltering)
 
  vcf - readVcf(fl, genome=hg19)
  vr - as(vcf, VRanges)
  length(vr)
  [1] 15000
 
  ## create a VRangesList object
  system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr
 user  system elapsed
0.247   0.004   0.252
 
  ## create a CompressedVRangesList object
  system.time(cvrl - new(CompressedVRangesList, split(vr,
  sampleNames(vr
 user  system elapsed
0.019   0.000   0.019
 
  0.252/0.019
  [1] 13.26316
 
  with a larger vcf differences increase:
 
  [... load vcf, coerce to VRanges ...]
  length(vr)
  [1] 25916
 
  system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr
 user  system elapsed
2.672   0.000   2.676
 
  system.time(cvrl - new(CompressedVRangesList, split(vr,
  sampleNames(vr
 user  system elapsed
0.014   0.000   0.014
 
  2.676 / 0.014
  [1] 191.1429
 
 
  so maybe i'm using the wrong way to construct a VRangesList object, but
  according to our last conversation about this, there was no obvious
 default
  fast way to do it, starting from a VRanges object:
 
  https://stat.ethz.ch/pipermail/bioc-devel/2015-January/006905.html
 
  it would be great if there's a fast way to do this kind of construction.
 
  thanks,
 
  robert.
 
  On 02/25/2015 04:42 PM, Michael Lawrence wrote:
 
  If you're storing data on a relatively small number of individuals
 (say,
  hundreds), you should use SimpleVRangesList, not CompressedVRangesList.
 
  On Wed, Feb 25, 2015 at 7:10 AM, Robert Castelo 
 robert.cast...@upf.edu
  mailto:robert.cast...@upf.edu wrote:
 
  i see you point, the logic i was thinking about is to use a list of
  VRanges objects to hold separately the variants of multiple
  individuals, with one VRanges object per individual.
 
  if i type the name of such a list object on the R shell, having the
  GRangesList show method, i feel i do not see much information
  because the screen just scrolls up tens or hundreds of lines
  specifiying variants per individual. however, the concise
 appearance
  of something like a VRangesList:
 
vrl
  VRangesList of length 10
  names(32): S1 S2 S3 S4 ... S7 S8 S9 S10
 
  at least suggests the user that the object holding the variants has
  information for 10 samples and belongs to the class 'VRangesList'.
 
  i thought this made general sense but i'm fine if you feel this
  interpretation does not warrant such a change.
 
  cheers,
 
  robert.
 
  On 02/25/2015 01:25 AM, Michael Lawrence wrote:
 
  Why not have the SimpleVRangesList be shown like
  CompressedVRangesList,
  for consistency with GRangesList? In other words, the opposite
  of what
  you propose. A strong argument could also be made that a
  SimpleGenomicRangesList should be shown like a GRangesList.
  Unless there
  is some aversion to the more verbose output
 
  On Tue, Feb 24, 2015 at 2:36 PM, Robert Castelo
  robert.cast...@upf.edu mailto:robert.cast...@upf.edu
  mailto:robert.cast...@upf.edu
 
  mailto:robert.cast...@upf.edu__ wrote:
 
   so, yes, but IMO rather than inheriting the show method
 from
  a
   GRangesList, i think that the show method for
  CompressedVRangesList
   objects should be inherited from a VRangesList object.
  right now
   this is the situation:
 
   library(VariantAnnotation)
 
   example(VRangesList)
   vrl
   VRangesList of length 2
   names(2): sampleA sampleB
 
   cvrl - new(CompressedVRangesList, split(vr,
  sampleNames(vr)))
   cvrl
   CompressedVRangesList object of length 2:
   $a
   VRanges object with 1 range and 1 metadata 

Re: [Bioc-devel] show method for CompressedVRangesList-class

2015-02-25 Thread Michael Lawrence
Construction will take longer; the savings are in the accessing of the
elements. But this seems like too much longer, so I will look into it.

On Wed, Feb 25, 2015 at 8:12 AM, Robert Castelo robert.cast...@upf.edu
wrote:

 my current reason to prefer a CompressedVRangesList object over a
 SimpleVRangesList object is that i find one order of magnitude difference
 in creation time in each of these classes of objects:

 library(VariantAnnotation)

 fl - system.file(extdata, CEUtrio.vcf.bgz,
   package=VariantFiltering)

 vcf - readVcf(fl, genome=hg19)
 vr - as(vcf, VRanges)
 length(vr)
 [1] 15000

 ## create a VRangesList object
 system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr
user  system elapsed
   0.247   0.004   0.252

 ## create a CompressedVRangesList object
 system.time(cvrl - new(CompressedVRangesList, split(vr,
 sampleNames(vr
user  system elapsed
   0.019   0.000   0.019

 0.252/0.019
 [1] 13.26316

 with a larger vcf differences increase:

 [... load vcf, coerce to VRanges ...]
 length(vr)
 [1] 25916

 system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr
user  system elapsed
   2.672   0.000   2.676

 system.time(cvrl - new(CompressedVRangesList, split(vr,
 sampleNames(vr
user  system elapsed
   0.014   0.000   0.014

 2.676 / 0.014
 [1] 191.1429


 so maybe i'm using the wrong way to construct a VRangesList object, but
 according to our last conversation about this, there was no obvious default
 fast way to do it, starting from a VRanges object:

 https://stat.ethz.ch/pipermail/bioc-devel/2015-January/006905.html

 it would be great if there's a fast way to do this kind of construction.

 thanks,

 robert.

 On 02/25/2015 04:42 PM, Michael Lawrence wrote:

 If you're storing data on a relatively small number of individuals (say,
 hundreds), you should use SimpleVRangesList, not CompressedVRangesList.

 On Wed, Feb 25, 2015 at 7:10 AM, Robert Castelo robert.cast...@upf.edu
 mailto:robert.cast...@upf.edu wrote:

 i see you point, the logic i was thinking about is to use a list of
 VRanges objects to hold separately the variants of multiple
 individuals, with one VRanges object per individual.

 if i type the name of such a list object on the R shell, having the
 GRangesList show method, i feel i do not see much information
 because the screen just scrolls up tens or hundreds of lines
 specifiying variants per individual. however, the concise appearance
 of something like a VRangesList:

   vrl
 VRangesList of length 10
 names(32): S1 S2 S3 S4 ... S7 S8 S9 S10

 at least suggests the user that the object holding the variants has
 information for 10 samples and belongs to the class 'VRangesList'.

 i thought this made general sense but i'm fine if you feel this
 interpretation does not warrant such a change.

 cheers,

 robert.

 On 02/25/2015 01:25 AM, Michael Lawrence wrote:

 Why not have the SimpleVRangesList be shown like
 CompressedVRangesList,
 for consistency with GRangesList? In other words, the opposite
 of what
 you propose. A strong argument could also be made that a
 SimpleGenomicRangesList should be shown like a GRangesList.
 Unless there
 is some aversion to the more verbose output

 On Tue, Feb 24, 2015 at 2:36 PM, Robert Castelo
 robert.cast...@upf.edu mailto:robert.cast...@upf.edu
 mailto:robert.cast...@upf.edu

 mailto:robert.cast...@upf.edu__ wrote:

  so, yes, but IMO rather than inheriting the show method from
 a
  GRangesList, i think that the show method for
 CompressedVRangesList
  objects should be inherited from a VRangesList object.
 right now
  this is the situation:

  library(VariantAnnotation)

  example(VRangesList)
  vrl
  VRangesList of length 2
  names(2): sampleA sampleB

  cvrl - new(CompressedVRangesList, split(vr,
 sampleNames(vr)))
  cvrl
  CompressedVRangesList object of length 2:
  $a
  VRanges object with 1 range and 1 metadata column:
 seqnamesranges strand ref  alt
  totalDepth   refDepth   altDepth
 Rle IRanges Rle character characterOrRle integerOrRle
 integerOrRle integerOrRle
 [1] chr1[1, 5]  +   T
  C 12  5  7
   sampleNames softFilterMatrix | tumorSpecific
 factorOrRle matrix | logical
 [1] a TRUE | FALSE

  $b
  VRanges object with 1 range and 1 metadata column:
 seqnames   ranges strand ref alt totalDepth refDepth
 altDepth
   

Re: [Bioc-devel] Importing classes into NAMESPACE

2015-02-25 Thread Michael Lawrence
On Wed, Feb 25, 2015 at 9:13 AM, Gabe Becker becker.g...@gene.com wrote:

 On Wed, Feb 25, 2015 at 9:02 AM, Karolis Uziela 
 karolis.uzi...@scilifelab.se wrote:

  Thank you for your help everyone! Importing constructors separately (as
  Thomas suggested), has solved the problem.
 
  @Gabe: I am not using any methods from Biobase, GenomicRanges
  and S4Vectors. I am only using their constructors. Does it mean that I
 can
  skip importing the classes and only import the constructors? Or did you
  have something else in your mind?
 

 So the code in your package never uses the objects, only returning them
 directly to the user for their use? If so, I think that's actually a case
 where Depends is called for, since that functionality is not useful to the
 user if they don't have those classes and methods available.


Have them listed in Depends, but still import in the NAMESPACE, of course.




  I thought it is better to avoid importing whole packages, especially if
  they are large. Or do you have some arguments, why I should do that?
 


If GenomicRanges forms the foundation of your package, then it's more
practical to import(GenomicRanges). If you're only cherry-picking a few key
functions from a package, then individual importFrom() declarations better
describe the specifics of that dependency.


 I haven't seen timings that tell me that targeted importing is worth the
 extra maintenance cost when methods move around between packages, functions
 change names, etc. It's largely a preference thing AFAIK, but I like to
 make sure I get everything I will need without worrying about when I do an
 import, rather than needing to be sure of the exact subset of symbols I
 want to use before hand. This, of course, is less compelling when you
 really do only need one thing, but I don't find that to be particularly
 common situation myself.



 ~G


 
  Regards,
  Karolis
 
  On Wed, Feb 25, 2015 at 3:31 PM, Gabe Becker becker.g...@gene.com
 wrote:
 
  Karolis,
 
  Do you really not need any of the methods for GRanges and ExpressionSet
  objects? import(GenomicRanges) might be better, even though the package
  isn't exactly small.
 
  ~G
 
  On Wed, Feb 25, 2015 at 6:27 AM, Thomas Sandmann 
  sandmann.tho...@gene.com wrote:
 
  Hi Karolis,
 
  These classes have constructor functions of the same name as the class.
  For
  example, the constructor function for GRanges is called GRanges().
 
  If you use the constructors you need to import them separately, e.g.
 
  importFrom GenomicRanges GRanges
 
  Best,
  Thomas
 
  [[alternative HTML version deleted]]
 
  ___
  Bioc-devel@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/bioc-devel
 
 
 
 
  --
  Gabriel Becker, Ph.D
  Computational Biologist
  Genentech Research
 
 
 
 
  --
 
  Karolis Uziela
 
  PhD student
  Science for Life Laboratory
  Box 1031
  17121 Solna, Sweden
  Mob. +46 729 120 395
 



 --
 Gabriel Becker, Ph.D
 Computational Biologist
 Genentech Research

 [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] show method for CompressedVRangesList-class

2015-02-25 Thread Michael Lawrence
I checked in a fix for the splitting to CompressedVRangesList. The slowness
of creating a SimpleVRangesList is due to the cost of extracting a VRanges
for each sample. Depending your exact use case, it might be better to pay
that cost up-front, instead of deferring it to when the user wants to
extract an element, which happens with the compressed list. As long as the
number of samples is small, the memory overhead should be minimal.

Michael

On Wed, Feb 25, 2015 at 9:59 AM, Michael Lawrence micha...@gene.com wrote:

 Yea, I know, just need to get around to that one. Technically, it works,
 but it's obviously not ideal.

 On Wed, Feb 25, 2015 at 8:52 AM, Gabe Becker becker.g...@gene.com wrote:

 Why does splitting a VRanges give a GRangesList with VRanges objects as
 elements? Seems like it should return a VRangesList.

  spl = split(vr, sampleNames(vr))
  class(spl)
 [1] GRangesList
 attr(,package)
 [1] GenomicRanges
  class(spl[[1]])
 [1] VRanges
 attr(,package)
 [1] VariantAnnotation


 ~G

 On Wed, Feb 25, 2015 at 8:39 AM, Michael Lawrence 
 lawrence.mich...@gene.com wrote:

 Construction will take longer; the savings are in the accessing of the
 elements. But this seems like too much longer, so I will look into it.

 On Wed, Feb 25, 2015 at 8:12 AM, Robert Castelo robert.cast...@upf.edu
 wrote:

  my current reason to prefer a CompressedVRangesList object over a
  SimpleVRangesList object is that i find one order of magnitude
 difference
  in creation time in each of these classes of objects:
 
  library(VariantAnnotation)
 
  fl - system.file(extdata, CEUtrio.vcf.bgz,
package=VariantFiltering)
 
  vcf - readVcf(fl, genome=hg19)
  vr - as(vcf, VRanges)
  length(vr)
  [1] 15000
 
  ## create a VRangesList object
  system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr
 user  system elapsed
0.247   0.004   0.252
 
  ## create a CompressedVRangesList object
  system.time(cvrl - new(CompressedVRangesList, split(vr,
  sampleNames(vr
 user  system elapsed
0.019   0.000   0.019
 
  0.252/0.019
  [1] 13.26316
 
  with a larger vcf differences increase:
 
  [... load vcf, coerce to VRanges ...]
  length(vr)
  [1] 25916
 
  system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr
 user  system elapsed
2.672   0.000   2.676
 
  system.time(cvrl - new(CompressedVRangesList, split(vr,
  sampleNames(vr
 user  system elapsed
0.014   0.000   0.014
 
  2.676 / 0.014
  [1] 191.1429
 
 
  so maybe i'm using the wrong way to construct a VRangesList object, but
  according to our last conversation about this, there was no obvious
 default
  fast way to do it, starting from a VRanges object:
 
  https://stat.ethz.ch/pipermail/bioc-devel/2015-January/006905.html
 
  it would be great if there's a fast way to do this kind of
 construction.
 
  thanks,
 
  robert.
 
  On 02/25/2015 04:42 PM, Michael Lawrence wrote:
 
  If you're storing data on a relatively small number of individuals
 (say,
  hundreds), you should use SimpleVRangesList, not
 CompressedVRangesList.
 
  On Wed, Feb 25, 2015 at 7:10 AM, Robert Castelo 
 robert.cast...@upf.edu
  mailto:robert.cast...@upf.edu wrote:
 
  i see you point, the logic i was thinking about is to use a list
 of
  VRanges objects to hold separately the variants of multiple
  individuals, with one VRanges object per individual.
 
  if i type the name of such a list object on the R shell, having
 the
  GRangesList show method, i feel i do not see much information
  because the screen just scrolls up tens or hundreds of lines
  specifiying variants per individual. however, the concise
 appearance
  of something like a VRangesList:
 
vrl
  VRangesList of length 10
  names(32): S1 S2 S3 S4 ... S7 S8 S9 S10
 
  at least suggests the user that the object holding the variants
 has
  information for 10 samples and belongs to the class 'VRangesList'.
 
  i thought this made general sense but i'm fine if you feel this
  interpretation does not warrant such a change.
 
  cheers,
 
  robert.
 
  On 02/25/2015 01:25 AM, Michael Lawrence wrote:
 
  Why not have the SimpleVRangesList be shown like
  CompressedVRangesList,
  for consistency with GRangesList? In other words, the opposite
  of what
  you propose. A strong argument could also be made that a
  SimpleGenomicRangesList should be shown like a GRangesList.
  Unless there
  is some aversion to the more verbose output
 
  On Tue, Feb 24, 2015 at 2:36 PM, Robert Castelo
  robert.cast...@upf.edu mailto:robert.cast...@upf.edu
  mailto:robert.cast...@upf.edu
 
  mailto:robert.cast...@upf.edu__ wrote:
 
   so, yes, but IMO rather than inheriting the show method
 from
  a
   GRangesList, i think that the show method for
  CompressedVRangesList
   objects 

Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-25 Thread Hervé Pagès

FWIW, and after checking how pileup() handles this, I thought I should
report:

  library(GenomicAlignments)
  bamfile - system.file(extdata, ex1.bam, package=Rsamtools)

Then:

   stackStringsFromBam(bamfile, param=seq1:1-6)
A DNAStringSet instance of length 4
  width seq
  [1] 6 CACTAG
  [2] 6 ++CTAG
  [3] 6 AG
  [4] 6 +G

   which - GRanges(seq1, IRanges(c(1, 5, 1), c(2, 6, 3)))
   pileup(bamfile, scanBamParam=ScanBamParam(which=which))
seqnames pos strand nucleotide count which_label
  1 seq1   1  +  C 1seq1:1-2
  2 seq1   2  +  A 1seq1:1-2
  3 seq1   5  +  A 3seq1:5-6
  4 seq1   6  +  G 4seq1:5-6
  5 seq1   1  +  C 1seq1:1-3
  6 seq1   2  +  A 1seq1:1-3
  7 seq1   3  +  C 2seq1:1-3

This output looks clean to me: (a) Nucleotide positions that belong
to more than 1 range in 'which' are treated as distinct positions,
and (b) reads that overlap with 2 non-overlapping ranges (e.g.
read #1 and ranges #1 and #2) only contribute at most once at each
position covered by these ranges.

H.


On 02/25/2015 01:21 PM, Hervé Pagès wrote:

Hi,

On 02/25/2015 06:37 AM, Gabe Becker wrote:

I think we need to be a little careful of trying to know the users
intentions better than they do here.  Reduce is a (very) easy operation
to do on a GRanges, so if the user didn't, are we really safe assuming
they meant to when passing the GRanges as a which?

I would argue for the samtools way, not because samtools does it
(though consistency is good) but because it allows the user to do more
things, while not making it that painful to do the thing that they might
want most often.

I agree with Michael that an additional argument might be a good middle
ground.


Maybe another good middle ground is to issue a warning when 'which'
contains overlapping ranges. The warning could suggest to the user
to reduce() the ranges first. Maybe the warning should also point
out that reducing doesn't completely eliminate the risk of selecting
the same record more than once (at least that's the case for the
readGAlignment* functions, but I don't know if it holds for
BamTallyParam). The risk is actually higher with paired-end reads
where the same pair is selected once for each range in 'which' that
overlaps with at least one of the 2 mates in the pair.

But as already discussed here (or maybe it was on the old bioconductor
list, don't remember), better solutions to the duplicated selection
problem could be achieved via one of the following:

   (1) Expose some sort of unique id for the records in a BAM file.
   I agree with Michael that duplicated selection is incompatible
   with summarization tools like BamTallyParam or pileup(). Having
   access to a unique id would completely solve the problem.

   (2) Introduce an argument similar to 'which' but with a slightly
   different semantic e.g. select records that *start* in one of
   the specified ranges (for single-end reads), or select pairs of
   records for which the *first* mate starts in one of the specified
   ranges.
   Advantages of this semantic:

 (a) If the ranges don't overlap, then no duplicates.

 (b) It can be used in the context of tiling the genome and
 processing the reads by tile. This plays well with
 parallelization (the semantic of 'which' does not).

   Inconvenient: the arbitrary nature of the selection criteria
   and its lack of symmetry is incompatible with summarization
   tools like BamTallyParam or pileup().

Note that (1) and (2) are not exclusive.

Cheers,
H.



~G

On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres
lcoll...@jhu.edu mailto:lcoll...@jhu.edu wrote:

Related to my post on a separate thread

(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
I think that if 'which' is not being reduced by default, a simple
example showing the effects of this could be included in the
functions
that have such an argument. Also note that 'reducing' could lead to
unintended results.

For example, in the help page for GenomicAlignments::readGAlignments,
after the 'gal4' example it would be nice to add something like this:


## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.

which_dups - RangesList(seq1=rep(IRanges(1000, 2000), 2),
 seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups - ScanBamParam(which=which_dups)
param_reduced - ScanBamParam(which=reduce(which_dups))
gal4_dups - readGAlignments(bamfile, param=param_dups)
gal4_reduced - 

Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-25 Thread Hervé Pagès

Hi,

On 02/25/2015 06:37 AM, Gabe Becker wrote:

I think we need to be a little careful of trying to know the users
intentions better than they do here.  Reduce is a (very) easy operation
to do on a GRanges, so if the user didn't, are we really safe assuming
they meant to when passing the GRanges as a which?

I would argue for the samtools way, not because samtools does it
(though consistency is good) but because it allows the user to do more
things, while not making it that painful to do the thing that they might
want most often.

I agree with Michael that an additional argument might be a good middle
ground.


Maybe another good middle ground is to issue a warning when 'which'
contains overlapping ranges. The warning could suggest to the user
to reduce() the ranges first. Maybe the warning should also point
out that reducing doesn't completely eliminate the risk of selecting
the same record more than once (at least that's the case for the
readGAlignment* functions, but I don't know if it holds for
BamTallyParam). The risk is actually higher with paired-end reads
where the same pair is selected once for each range in 'which' that
overlaps with at least one of the 2 mates in the pair.

But as already discussed here (or maybe it was on the old bioconductor
list, don't remember), better solutions to the duplicated selection
problem could be achieved via one of the following:

  (1) Expose some sort of unique id for the records in a BAM file.
  I agree with Michael that duplicated selection is incompatible
  with summarization tools like BamTallyParam or pileup(). Having
  access to a unique id would completely solve the problem.

  (2) Introduce an argument similar to 'which' but with a slightly
  different semantic e.g. select records that *start* in one of
  the specified ranges (for single-end reads), or select pairs of
  records for which the *first* mate starts in one of the specified
  ranges.
  Advantages of this semantic:

(a) If the ranges don't overlap, then no duplicates.

(b) It can be used in the context of tiling the genome and
processing the reads by tile. This plays well with
parallelization (the semantic of 'which' does not).

  Inconvenient: the arbitrary nature of the selection criteria
  and its lack of symmetry is incompatible with summarization
  tools like BamTallyParam or pileup().

Note that (1) and (2) are not exclusive.

Cheers,
H.



~G

On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres
lcoll...@jhu.edu mailto:lcoll...@jhu.edu wrote:

Related to my post on a separate thread
(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
I think that if 'which' is not being reduced by default, a simple
example showing the effects of this could be included in the functions
that have such an argument. Also note that 'reducing' could lead to
unintended results.

For example, in the help page for GenomicAlignments::readGAlignments,
after the 'gal4' example it would be nice to add something like this:


## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.

which_dups - RangesList(seq1=rep(IRanges(1000, 2000), 2),
 seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups - ScanBamParam(which=which_dups)
param_reduced - ScanBamParam(which=reduce(which_dups))
gal4_dups - readGAlignments(bamfile, param=param_dups)
gal4_reduced - readGAlignments(bamfile, param=param_reduced)


length(gal4)

## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)

## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)








Here's the output:



  library('GenomicAlignments')
 
  ## Code already included in ?readGAlignments
  bamfile - system.file(extdata, ex1.bam, package=Rsamtools,
+mustWork=TRUE)
  which - RangesList(seq1=IRanges(1000, 2000),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
  param - ScanBamParam(which=which)
  gal4 - readGAlignments(bamfile, param=param)
  gal4
GAlignments object with 2404 alignments and 0 metadata columns:
  seqnames strand   cigarqwidth start   end
width njunc
 Rle  Rle character integer integer integer
integer integer
  [1] seq1  + 35M35   970  1004
35 0
  [2] seq1  + 35M35   971  1005
35 0
  [3] seq1  + 35M35   972  1006