Re: [Bioc-devel] Importing classes into NAMESPACE
Karolis, Do you really not need any of the methods for GRanges and ExpressionSet objects? import(GenomicRanges) might be better, even though the package isn't exactly small. ~G On Wed, Feb 25, 2015 at 6:27 AM, Thomas Sandmann sandmann.tho...@gene.com wrote: Hi Karolis, These classes have constructor functions of the same name as the class. For example, the constructor function for GRanges is called GRanges(). If you use the constructors you need to import them separately, e.g. importFrom GenomicRanges GRanges Best, Thomas [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Gabriel Becker, Ph.D Computational Biologist Genentech Research [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Importing classes into NAMESPACE
On the surface, it sounds like you just need to import() the symbols for those functions. On Wed, Feb 25, 2015 at 3:00 AM, Karolis Uziela karolis.uzi...@scilifelab.se wrote: Hi, R 3.2 check results give this NOTE for my package (prebs): granges_from_cdf: no visible global function definition for ‘GRanges’ granges_from_cdf: no visible global function definition for ‘Rle’ granges_from_cdf: no visible global function definition for ‘IRanges’ perform_rpa: no visible global function definition for ‘ExpressionSet’ I get this note for all the classes that I use in my package. Does it mean that I incorrectly import classes into NAMESPACE file? In the NAMESPACE file I have: importClassesFrom(Biobase,ExpressionSet) importClassesFrom(GenomicRanges,GRanges) importClassesFrom(IRanges,IRanges) importClassesFrom(S4Vectors,Rle) Regards, Karolis -- Karolis Uziela PhD student Science for Life Laboratory Box 1031 17121 Solna, Sweden Mob. +46 729 120 395 [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BamTallyParam argument 'which'
I think we need to be a little careful of trying to know the users intentions better than they do here. Reduce is a (very) easy operation to do on a GRanges, so if the user didn't, are we really safe assuming they meant to when passing the GRanges as a which? I would argue for the samtools way, not because samtools does it (though consistency is good) but because it allows the user to do more things, while not making it that painful to do the thing that they might want most often. I agree with Michael that an additional argument might be a good middle ground. ~G On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres lcoll...@jhu.edu wrote: Related to my post on a separate thread (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html), I think that if 'which' is not being reduced by default, a simple example showing the effects of this could be included in the functions that have such an argument. Also note that 'reducing' could lead to unintended results. For example, in the help page for GenomicAlignments::readGAlignments, after the 'gal4' example it would be nice to add something like this: ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups - RangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups - ScanBamParam(which=which_dups) param_reduced - ScanBamParam(which=reduce(which_dups)) gal4_dups - readGAlignments(bamfile, param=param_dups) gal4_reduced - readGAlignments(bamfile, param=param_reduced) length(gal4) ## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups) ## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced) Here's the output: library('GenomicAlignments') ## Code already included in ?readGAlignments bamfile - system.file(extdata, ex1.bam, package=Rsamtools, +mustWork=TRUE) which - RangesList(seq1=IRanges(1000, 2000), + seq2=IRanges(c(100, 1000), c(1000, 2000))) param - ScanBamParam(which=which) gal4 - readGAlignments(bamfile, param=param) gal4 GAlignments object with 2404 alignments and 0 metadata columns: seqnames strand cigarqwidth start end width njunc Rle Rle character integer integer integer integer integer [1] seq1 + 35M35 970 1004 35 0 [2] seq1 + 35M35 971 1005 35 0 [3] seq1 + 35M35 972 1006 35 0 [4] seq1 + 35M35 973 1007 35 0 [5] seq1 + 35M35 974 1008 35 0 ... ...... ... ... ... ... ... ... [2400] seq2 + 35M35 1524 1558 35 0 [2401] seq2 + 35M35 1524 1558 35 0 [2402] seq2 - 35M35 1528 1562 35 0 [2403] seq2 - 35M35 1532 1566 35 0 [2404] seq2 - 35M35 1533 1567 35 0 --- seqinfo: 2 sequences from an unspecified genome ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups - RangesList(seq1=rep(IRanges(1000, 2000), 2), + seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups - ScanBamParam(which=which_dups) param_reduced - ScanBamParam(which=reduce(which_dups)) gal4_dups - readGAlignments(bamfile, param=param_dups) gal4_reduced - readGAlignments(bamfile, param=param_reduced) length(gal4) [1] 2404 ## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups) [1] 3014 ## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced) [1] 2343 options(width = 120) devtools::session_info() Session info--- setting value version R Under development (unstable) (2014-11-01 r66923) system x86_64, darwin10.8.0 ui AQUA language (EN) collate en_US.UTF-8 tz America/New_York
Re: [Bioc-devel] show method for CompressedVRangesList-class
i see you point, the logic i was thinking about is to use a list of VRanges objects to hold separately the variants of multiple individuals, with one VRanges object per individual. if i type the name of such a list object on the R shell, having the GRangesList show method, i feel i do not see much information because the screen just scrolls up tens or hundreds of lines specifiying variants per individual. however, the concise appearance of something like a VRangesList: vrl VRangesList of length 10 names(32): S1 S2 S3 S4 ... S7 S8 S9 S10 at least suggests the user that the object holding the variants has information for 10 samples and belongs to the class 'VRangesList'. i thought this made general sense but i'm fine if you feel this interpretation does not warrant such a change. cheers, robert. On 02/25/2015 01:25 AM, Michael Lawrence wrote: Why not have the SimpleVRangesList be shown like CompressedVRangesList, for consistency with GRangesList? In other words, the opposite of what you propose. A strong argument could also be made that a SimpleGenomicRangesList should be shown like a GRangesList. Unless there is some aversion to the more verbose output On Tue, Feb 24, 2015 at 2:36 PM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote: so, yes, but IMO rather than inheriting the show method from a GRangesList, i think that the show method for CompressedVRangesList objects should be inherited from a VRangesList object. right now this is the situation: library(VariantAnnotation) example(VRangesList) vrl VRangesList of length 2 names(2): sampleA sampleB cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr))) cvrl CompressedVRangesList object of length 2: $a VRanges object with 1 range and 1 metadata column: seqnamesranges strand ref alt totalDepth refDepth altDepth Rle IRanges Rle character characterOrRle integerOrRle integerOrRle integerOrRle [1] chr1[1, 5] + T C 12 5 7 sampleNames softFilterMatrix | tumorSpecific factorOrRle matrix | logical [1] a TRUE | FALSE $b VRanges object with 1 range and 1 metadata column: seqnames ranges strand ref alt totalDepth refDepth altDepth sampleNames softFilterMatrix | [1] chr2 [10, 20] + A T 17 10 6 bFALSE | tumorSpecific [1] TRUE --- seqinfo: 2 sequences from an unspecified genome; no seqlengths would it be possible to have the VRangesList show method for CompressedVRangesList objects? robert. On 2/24/15 7:24 PM, Michael Lawrence wrote: I think you might be missing an import. It should inherit the method for GRangesList. On Tue, Feb 24, 2015 at 9:53 AM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote: hi, i'm using the CompressedVRangesList class in VariantFiltering to hold variants and their annotations across multiple samples and found that there was no show method for this class (unless i'm missing the right import here) so i made one within VariantFiltering by copyingpasting from other similar classes: setMethod(show, signature(object=CompressedVRangesList), function(object) { lo - length(object) cat(classNameForDisplay(object), of length , lo, \n, sep = ) if (!is.null(names(object))) cat(BiocGenerics:::labeledLine(names, names(object))) }) i guess, however, that the right home for this would be VariantAnnotation. let me know if you consider adding it there (or somewhere else) and i'll remove it from VariantFiltering. thanks, robert. ___ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Robert Castelo, PhD Associate Professor Dept. of Experimental and Health Sciences Universitat Pompeu Fabra (UPF) Barcelona Biomedical Research Park (PRBB) Dr Aiguader 88 E-08003 Barcelona, Spain telf: +34.933.160.514 fax: +34.933.160.550 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] show method for CompressedVRangesList-class
my current reason to prefer a CompressedVRangesList object over a SimpleVRangesList object is that i find one order of magnitude difference in creation time in each of these classes of objects: library(VariantAnnotation) fl - system.file(extdata, CEUtrio.vcf.bgz, package=VariantFiltering) vcf - readVcf(fl, genome=hg19) vr - as(vcf, VRanges) length(vr) [1] 15000 ## create a VRangesList object system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr user system elapsed 0.247 0.004 0.252 ## create a CompressedVRangesList object system.time(cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr user system elapsed 0.019 0.000 0.019 0.252/0.019 [1] 13.26316 with a larger vcf differences increase: [... load vcf, coerce to VRanges ...] length(vr) [1] 25916 system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr user system elapsed 2.672 0.000 2.676 system.time(cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr user system elapsed 0.014 0.000 0.014 2.676 / 0.014 [1] 191.1429 so maybe i'm using the wrong way to construct a VRangesList object, but according to our last conversation about this, there was no obvious default fast way to do it, starting from a VRanges object: https://stat.ethz.ch/pipermail/bioc-devel/2015-January/006905.html it would be great if there's a fast way to do this kind of construction. thanks, robert. On 02/25/2015 04:42 PM, Michael Lawrence wrote: If you're storing data on a relatively small number of individuals (say, hundreds), you should use SimpleVRangesList, not CompressedVRangesList. On Wed, Feb 25, 2015 at 7:10 AM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote: i see you point, the logic i was thinking about is to use a list of VRanges objects to hold separately the variants of multiple individuals, with one VRanges object per individual. if i type the name of such a list object on the R shell, having the GRangesList show method, i feel i do not see much information because the screen just scrolls up tens or hundreds of lines specifiying variants per individual. however, the concise appearance of something like a VRangesList: vrl VRangesList of length 10 names(32): S1 S2 S3 S4 ... S7 S8 S9 S10 at least suggests the user that the object holding the variants has information for 10 samples and belongs to the class 'VRangesList'. i thought this made general sense but i'm fine if you feel this interpretation does not warrant such a change. cheers, robert. On 02/25/2015 01:25 AM, Michael Lawrence wrote: Why not have the SimpleVRangesList be shown like CompressedVRangesList, for consistency with GRangesList? In other words, the opposite of what you propose. A strong argument could also be made that a SimpleGenomicRangesList should be shown like a GRangesList. Unless there is some aversion to the more verbose output On Tue, Feb 24, 2015 at 2:36 PM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu mailto:robert.cast...@upf.edu mailto:robert.cast...@upf.edu__ wrote: so, yes, but IMO rather than inheriting the show method from a GRangesList, i think that the show method for CompressedVRangesList objects should be inherited from a VRangesList object. right now this is the situation: library(VariantAnnotation) example(VRangesList) vrl VRangesList of length 2 names(2): sampleA sampleB cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr))) cvrl CompressedVRangesList object of length 2: $a VRanges object with 1 range and 1 metadata column: seqnamesranges strand ref alt totalDepth refDepth altDepth Rle IRanges Rle character characterOrRle integerOrRle integerOrRle integerOrRle [1] chr1[1, 5] + T C 12 5 7 sampleNames softFilterMatrix | tumorSpecific factorOrRle matrix | logical [1] a TRUE | FALSE $b VRanges object with 1 range and 1 metadata column: seqnames ranges strand ref alt totalDepth refDepth altDepth sampleNames softFilterMatrix | [1] chr2 [10, 20] + A T 17 10 6 bFALSE | tumorSpecific [1] TRUE --- seqinfo: 2 sequences from an unspecified genome; no
Re: [Bioc-devel] show method for CompressedVRangesList-class
Yea, I know, just need to get around to that one. Technically, it works, but it's obviously not ideal. On Wed, Feb 25, 2015 at 8:52 AM, Gabe Becker becker.g...@gene.com wrote: Why does splitting a VRanges give a GRangesList with VRanges objects as elements? Seems like it should return a VRangesList. spl = split(vr, sampleNames(vr)) class(spl) [1] GRangesList attr(,package) [1] GenomicRanges class(spl[[1]]) [1] VRanges attr(,package) [1] VariantAnnotation ~G On Wed, Feb 25, 2015 at 8:39 AM, Michael Lawrence lawrence.mich...@gene.com wrote: Construction will take longer; the savings are in the accessing of the elements. But this seems like too much longer, so I will look into it. On Wed, Feb 25, 2015 at 8:12 AM, Robert Castelo robert.cast...@upf.edu wrote: my current reason to prefer a CompressedVRangesList object over a SimpleVRangesList object is that i find one order of magnitude difference in creation time in each of these classes of objects: library(VariantAnnotation) fl - system.file(extdata, CEUtrio.vcf.bgz, package=VariantFiltering) vcf - readVcf(fl, genome=hg19) vr - as(vcf, VRanges) length(vr) [1] 15000 ## create a VRangesList object system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr user system elapsed 0.247 0.004 0.252 ## create a CompressedVRangesList object system.time(cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr user system elapsed 0.019 0.000 0.019 0.252/0.019 [1] 13.26316 with a larger vcf differences increase: [... load vcf, coerce to VRanges ...] length(vr) [1] 25916 system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr user system elapsed 2.672 0.000 2.676 system.time(cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr user system elapsed 0.014 0.000 0.014 2.676 / 0.014 [1] 191.1429 so maybe i'm using the wrong way to construct a VRangesList object, but according to our last conversation about this, there was no obvious default fast way to do it, starting from a VRanges object: https://stat.ethz.ch/pipermail/bioc-devel/2015-January/006905.html it would be great if there's a fast way to do this kind of construction. thanks, robert. On 02/25/2015 04:42 PM, Michael Lawrence wrote: If you're storing data on a relatively small number of individuals (say, hundreds), you should use SimpleVRangesList, not CompressedVRangesList. On Wed, Feb 25, 2015 at 7:10 AM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote: i see you point, the logic i was thinking about is to use a list of VRanges objects to hold separately the variants of multiple individuals, with one VRanges object per individual. if i type the name of such a list object on the R shell, having the GRangesList show method, i feel i do not see much information because the screen just scrolls up tens or hundreds of lines specifiying variants per individual. however, the concise appearance of something like a VRangesList: vrl VRangesList of length 10 names(32): S1 S2 S3 S4 ... S7 S8 S9 S10 at least suggests the user that the object holding the variants has information for 10 samples and belongs to the class 'VRangesList'. i thought this made general sense but i'm fine if you feel this interpretation does not warrant such a change. cheers, robert. On 02/25/2015 01:25 AM, Michael Lawrence wrote: Why not have the SimpleVRangesList be shown like CompressedVRangesList, for consistency with GRangesList? In other words, the opposite of what you propose. A strong argument could also be made that a SimpleGenomicRangesList should be shown like a GRangesList. Unless there is some aversion to the more verbose output On Tue, Feb 24, 2015 at 2:36 PM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu mailto:robert.cast...@upf.edu mailto:robert.cast...@upf.edu__ wrote: so, yes, but IMO rather than inheriting the show method from a GRangesList, i think that the show method for CompressedVRangesList objects should be inherited from a VRangesList object. right now this is the situation: library(VariantAnnotation) example(VRangesList) vrl VRangesList of length 2 names(2): sampleA sampleB cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr))) cvrl CompressedVRangesList object of length 2: $a VRanges object with 1 range and 1 metadata
Re: [Bioc-devel] show method for CompressedVRangesList-class
Construction will take longer; the savings are in the accessing of the elements. But this seems like too much longer, so I will look into it. On Wed, Feb 25, 2015 at 8:12 AM, Robert Castelo robert.cast...@upf.edu wrote: my current reason to prefer a CompressedVRangesList object over a SimpleVRangesList object is that i find one order of magnitude difference in creation time in each of these classes of objects: library(VariantAnnotation) fl - system.file(extdata, CEUtrio.vcf.bgz, package=VariantFiltering) vcf - readVcf(fl, genome=hg19) vr - as(vcf, VRanges) length(vr) [1] 15000 ## create a VRangesList object system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr user system elapsed 0.247 0.004 0.252 ## create a CompressedVRangesList object system.time(cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr user system elapsed 0.019 0.000 0.019 0.252/0.019 [1] 13.26316 with a larger vcf differences increase: [... load vcf, coerce to VRanges ...] length(vr) [1] 25916 system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr user system elapsed 2.672 0.000 2.676 system.time(cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr user system elapsed 0.014 0.000 0.014 2.676 / 0.014 [1] 191.1429 so maybe i'm using the wrong way to construct a VRangesList object, but according to our last conversation about this, there was no obvious default fast way to do it, starting from a VRanges object: https://stat.ethz.ch/pipermail/bioc-devel/2015-January/006905.html it would be great if there's a fast way to do this kind of construction. thanks, robert. On 02/25/2015 04:42 PM, Michael Lawrence wrote: If you're storing data on a relatively small number of individuals (say, hundreds), you should use SimpleVRangesList, not CompressedVRangesList. On Wed, Feb 25, 2015 at 7:10 AM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote: i see you point, the logic i was thinking about is to use a list of VRanges objects to hold separately the variants of multiple individuals, with one VRanges object per individual. if i type the name of such a list object on the R shell, having the GRangesList show method, i feel i do not see much information because the screen just scrolls up tens or hundreds of lines specifiying variants per individual. however, the concise appearance of something like a VRangesList: vrl VRangesList of length 10 names(32): S1 S2 S3 S4 ... S7 S8 S9 S10 at least suggests the user that the object holding the variants has information for 10 samples and belongs to the class 'VRangesList'. i thought this made general sense but i'm fine if you feel this interpretation does not warrant such a change. cheers, robert. On 02/25/2015 01:25 AM, Michael Lawrence wrote: Why not have the SimpleVRangesList be shown like CompressedVRangesList, for consistency with GRangesList? In other words, the opposite of what you propose. A strong argument could also be made that a SimpleGenomicRangesList should be shown like a GRangesList. Unless there is some aversion to the more verbose output On Tue, Feb 24, 2015 at 2:36 PM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu mailto:robert.cast...@upf.edu mailto:robert.cast...@upf.edu__ wrote: so, yes, but IMO rather than inheriting the show method from a GRangesList, i think that the show method for CompressedVRangesList objects should be inherited from a VRangesList object. right now this is the situation: library(VariantAnnotation) example(VRangesList) vrl VRangesList of length 2 names(2): sampleA sampleB cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr))) cvrl CompressedVRangesList object of length 2: $a VRanges object with 1 range and 1 metadata column: seqnamesranges strand ref alt totalDepth refDepth altDepth Rle IRanges Rle character characterOrRle integerOrRle integerOrRle integerOrRle [1] chr1[1, 5] + T C 12 5 7 sampleNames softFilterMatrix | tumorSpecific factorOrRle matrix | logical [1] a TRUE | FALSE $b VRanges object with 1 range and 1 metadata column: seqnames ranges strand ref alt totalDepth refDepth altDepth
Re: [Bioc-devel] Importing classes into NAMESPACE
On Wed, Feb 25, 2015 at 9:13 AM, Gabe Becker becker.g...@gene.com wrote: On Wed, Feb 25, 2015 at 9:02 AM, Karolis Uziela karolis.uzi...@scilifelab.se wrote: Thank you for your help everyone! Importing constructors separately (as Thomas suggested), has solved the problem. @Gabe: I am not using any methods from Biobase, GenomicRanges and S4Vectors. I am only using their constructors. Does it mean that I can skip importing the classes and only import the constructors? Or did you have something else in your mind? So the code in your package never uses the objects, only returning them directly to the user for their use? If so, I think that's actually a case where Depends is called for, since that functionality is not useful to the user if they don't have those classes and methods available. Have them listed in Depends, but still import in the NAMESPACE, of course. I thought it is better to avoid importing whole packages, especially if they are large. Or do you have some arguments, why I should do that? If GenomicRanges forms the foundation of your package, then it's more practical to import(GenomicRanges). If you're only cherry-picking a few key functions from a package, then individual importFrom() declarations better describe the specifics of that dependency. I haven't seen timings that tell me that targeted importing is worth the extra maintenance cost when methods move around between packages, functions change names, etc. It's largely a preference thing AFAIK, but I like to make sure I get everything I will need without worrying about when I do an import, rather than needing to be sure of the exact subset of symbols I want to use before hand. This, of course, is less compelling when you really do only need one thing, but I don't find that to be particularly common situation myself. ~G Regards, Karolis On Wed, Feb 25, 2015 at 3:31 PM, Gabe Becker becker.g...@gene.com wrote: Karolis, Do you really not need any of the methods for GRanges and ExpressionSet objects? import(GenomicRanges) might be better, even though the package isn't exactly small. ~G On Wed, Feb 25, 2015 at 6:27 AM, Thomas Sandmann sandmann.tho...@gene.com wrote: Hi Karolis, These classes have constructor functions of the same name as the class. For example, the constructor function for GRanges is called GRanges(). If you use the constructors you need to import them separately, e.g. importFrom GenomicRanges GRanges Best, Thomas [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Gabriel Becker, Ph.D Computational Biologist Genentech Research -- Karolis Uziela PhD student Science for Life Laboratory Box 1031 17121 Solna, Sweden Mob. +46 729 120 395 -- Gabriel Becker, Ph.D Computational Biologist Genentech Research [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] show method for CompressedVRangesList-class
I checked in a fix for the splitting to CompressedVRangesList. The slowness of creating a SimpleVRangesList is due to the cost of extracting a VRanges for each sample. Depending your exact use case, it might be better to pay that cost up-front, instead of deferring it to when the user wants to extract an element, which happens with the compressed list. As long as the number of samples is small, the memory overhead should be minimal. Michael On Wed, Feb 25, 2015 at 9:59 AM, Michael Lawrence micha...@gene.com wrote: Yea, I know, just need to get around to that one. Technically, it works, but it's obviously not ideal. On Wed, Feb 25, 2015 at 8:52 AM, Gabe Becker becker.g...@gene.com wrote: Why does splitting a VRanges give a GRangesList with VRanges objects as elements? Seems like it should return a VRangesList. spl = split(vr, sampleNames(vr)) class(spl) [1] GRangesList attr(,package) [1] GenomicRanges class(spl[[1]]) [1] VRanges attr(,package) [1] VariantAnnotation ~G On Wed, Feb 25, 2015 at 8:39 AM, Michael Lawrence lawrence.mich...@gene.com wrote: Construction will take longer; the savings are in the accessing of the elements. But this seems like too much longer, so I will look into it. On Wed, Feb 25, 2015 at 8:12 AM, Robert Castelo robert.cast...@upf.edu wrote: my current reason to prefer a CompressedVRangesList object over a SimpleVRangesList object is that i find one order of magnitude difference in creation time in each of these classes of objects: library(VariantAnnotation) fl - system.file(extdata, CEUtrio.vcf.bgz, package=VariantFiltering) vcf - readVcf(fl, genome=hg19) vr - as(vcf, VRanges) length(vr) [1] 15000 ## create a VRangesList object system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr user system elapsed 0.247 0.004 0.252 ## create a CompressedVRangesList object system.time(cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr user system elapsed 0.019 0.000 0.019 0.252/0.019 [1] 13.26316 with a larger vcf differences increase: [... load vcf, coerce to VRanges ...] length(vr) [1] 25916 system.time(vrl - do.call(VRangesList, split(vr, sampleNames(vr user system elapsed 2.672 0.000 2.676 system.time(cvrl - new(CompressedVRangesList, split(vr, sampleNames(vr user system elapsed 0.014 0.000 0.014 2.676 / 0.014 [1] 191.1429 so maybe i'm using the wrong way to construct a VRangesList object, but according to our last conversation about this, there was no obvious default fast way to do it, starting from a VRanges object: https://stat.ethz.ch/pipermail/bioc-devel/2015-January/006905.html it would be great if there's a fast way to do this kind of construction. thanks, robert. On 02/25/2015 04:42 PM, Michael Lawrence wrote: If you're storing data on a relatively small number of individuals (say, hundreds), you should use SimpleVRangesList, not CompressedVRangesList. On Wed, Feb 25, 2015 at 7:10 AM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu wrote: i see you point, the logic i was thinking about is to use a list of VRanges objects to hold separately the variants of multiple individuals, with one VRanges object per individual. if i type the name of such a list object on the R shell, having the GRangesList show method, i feel i do not see much information because the screen just scrolls up tens or hundreds of lines specifiying variants per individual. however, the concise appearance of something like a VRangesList: vrl VRangesList of length 10 names(32): S1 S2 S3 S4 ... S7 S8 S9 S10 at least suggests the user that the object holding the variants has information for 10 samples and belongs to the class 'VRangesList'. i thought this made general sense but i'm fine if you feel this interpretation does not warrant such a change. cheers, robert. On 02/25/2015 01:25 AM, Michael Lawrence wrote: Why not have the SimpleVRangesList be shown like CompressedVRangesList, for consistency with GRangesList? In other words, the opposite of what you propose. A strong argument could also be made that a SimpleGenomicRangesList should be shown like a GRangesList. Unless there is some aversion to the more verbose output On Tue, Feb 24, 2015 at 2:36 PM, Robert Castelo robert.cast...@upf.edu mailto:robert.cast...@upf.edu mailto:robert.cast...@upf.edu mailto:robert.cast...@upf.edu__ wrote: so, yes, but IMO rather than inheriting the show method from a GRangesList, i think that the show method for CompressedVRangesList objects
Re: [Bioc-devel] BamTallyParam argument 'which'
FWIW, and after checking how pileup() handles this, I thought I should report: library(GenomicAlignments) bamfile - system.file(extdata, ex1.bam, package=Rsamtools) Then: stackStringsFromBam(bamfile, param=seq1:1-6) A DNAStringSet instance of length 4 width seq [1] 6 CACTAG [2] 6 ++CTAG [3] 6 AG [4] 6 +G which - GRanges(seq1, IRanges(c(1, 5, 1), c(2, 6, 3))) pileup(bamfile, scanBamParam=ScanBamParam(which=which)) seqnames pos strand nucleotide count which_label 1 seq1 1 + C 1seq1:1-2 2 seq1 2 + A 1seq1:1-2 3 seq1 5 + A 3seq1:5-6 4 seq1 6 + G 4seq1:5-6 5 seq1 1 + C 1seq1:1-3 6 seq1 2 + A 1seq1:1-3 7 seq1 3 + C 2seq1:1-3 This output looks clean to me: (a) Nucleotide positions that belong to more than 1 range in 'which' are treated as distinct positions, and (b) reads that overlap with 2 non-overlapping ranges (e.g. read #1 and ranges #1 and #2) only contribute at most once at each position covered by these ranges. H. On 02/25/2015 01:21 PM, Hervé Pagès wrote: Hi, On 02/25/2015 06:37 AM, Gabe Becker wrote: I think we need to be a little careful of trying to know the users intentions better than they do here. Reduce is a (very) easy operation to do on a GRanges, so if the user didn't, are we really safe assuming they meant to when passing the GRanges as a which? I would argue for the samtools way, not because samtools does it (though consistency is good) but because it allows the user to do more things, while not making it that painful to do the thing that they might want most often. I agree with Michael that an additional argument might be a good middle ground. Maybe another good middle ground is to issue a warning when 'which' contains overlapping ranges. The warning could suggest to the user to reduce() the ranges first. Maybe the warning should also point out that reducing doesn't completely eliminate the risk of selecting the same record more than once (at least that's the case for the readGAlignment* functions, but I don't know if it holds for BamTallyParam). The risk is actually higher with paired-end reads where the same pair is selected once for each range in 'which' that overlaps with at least one of the 2 mates in the pair. But as already discussed here (or maybe it was on the old bioconductor list, don't remember), better solutions to the duplicated selection problem could be achieved via one of the following: (1) Expose some sort of unique id for the records in a BAM file. I agree with Michael that duplicated selection is incompatible with summarization tools like BamTallyParam or pileup(). Having access to a unique id would completely solve the problem. (2) Introduce an argument similar to 'which' but with a slightly different semantic e.g. select records that *start* in one of the specified ranges (for single-end reads), or select pairs of records for which the *first* mate starts in one of the specified ranges. Advantages of this semantic: (a) If the ranges don't overlap, then no duplicates. (b) It can be used in the context of tiling the genome and processing the reads by tile. This plays well with parallelization (the semantic of 'which' does not). Inconvenient: the arbitrary nature of the selection criteria and its lack of symmetry is incompatible with summarization tools like BamTallyParam or pileup(). Note that (1) and (2) are not exclusive. Cheers, H. ~G On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres lcoll...@jhu.edu mailto:lcoll...@jhu.edu wrote: Related to my post on a separate thread (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html), I think that if 'which' is not being reduced by default, a simple example showing the effects of this could be included in the functions that have such an argument. Also note that 'reducing' could lead to unintended results. For example, in the help page for GenomicAlignments::readGAlignments, after the 'gal4' example it would be nice to add something like this: ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups - RangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups - ScanBamParam(which=which_dups) param_reduced - ScanBamParam(which=reduce(which_dups)) gal4_dups - readGAlignments(bamfile, param=param_dups) gal4_reduced -
Re: [Bioc-devel] BamTallyParam argument 'which'
Hi, On 02/25/2015 06:37 AM, Gabe Becker wrote: I think we need to be a little careful of trying to know the users intentions better than they do here. Reduce is a (very) easy operation to do on a GRanges, so if the user didn't, are we really safe assuming they meant to when passing the GRanges as a which? I would argue for the samtools way, not because samtools does it (though consistency is good) but because it allows the user to do more things, while not making it that painful to do the thing that they might want most often. I agree with Michael that an additional argument might be a good middle ground. Maybe another good middle ground is to issue a warning when 'which' contains overlapping ranges. The warning could suggest to the user to reduce() the ranges first. Maybe the warning should also point out that reducing doesn't completely eliminate the risk of selecting the same record more than once (at least that's the case for the readGAlignment* functions, but I don't know if it holds for BamTallyParam). The risk is actually higher with paired-end reads where the same pair is selected once for each range in 'which' that overlaps with at least one of the 2 mates in the pair. But as already discussed here (or maybe it was on the old bioconductor list, don't remember), better solutions to the duplicated selection problem could be achieved via one of the following: (1) Expose some sort of unique id for the records in a BAM file. I agree with Michael that duplicated selection is incompatible with summarization tools like BamTallyParam or pileup(). Having access to a unique id would completely solve the problem. (2) Introduce an argument similar to 'which' but with a slightly different semantic e.g. select records that *start* in one of the specified ranges (for single-end reads), or select pairs of records for which the *first* mate starts in one of the specified ranges. Advantages of this semantic: (a) If the ranges don't overlap, then no duplicates. (b) It can be used in the context of tiling the genome and processing the reads by tile. This plays well with parallelization (the semantic of 'which' does not). Inconvenient: the arbitrary nature of the selection criteria and its lack of symmetry is incompatible with summarization tools like BamTallyParam or pileup(). Note that (1) and (2) are not exclusive. Cheers, H. ~G On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres lcoll...@jhu.edu mailto:lcoll...@jhu.edu wrote: Related to my post on a separate thread (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html), I think that if 'which' is not being reduced by default, a simple example showing the effects of this could be included in the functions that have such an argument. Also note that 'reducing' could lead to unintended results. For example, in the help page for GenomicAlignments::readGAlignments, after the 'gal4' example it would be nice to add something like this: ## Note that if overlapping ranges are provided in 'which' ## reads could be selected more than once. This would artificually ## increase the coverage or affect other downstream results. ## If you 'reduce' the ranges, reads that originally overlapped ## two disjoint segments will be included. which_dups - RangesList(seq1=rep(IRanges(1000, 2000), 2), seq2=IRanges(c(100, 1000), c(1000, 2000))) param_dups - ScanBamParam(which=which_dups) param_reduced - ScanBamParam(which=reduce(which_dups)) gal4_dups - readGAlignments(bamfile, param=param_dups) gal4_reduced - readGAlignments(bamfile, param=param_reduced) length(gal4) ## Duplicates some reads. In this case, all the ones between ## bases 1000 and 2000 on seq1. length(gal4_dups) ## Includes some reads that mapped around base 1000 in seq2 ## that were excluded in gal4. length(gal4_reduced) Here's the output: library('GenomicAlignments') ## Code already included in ?readGAlignments bamfile - system.file(extdata, ex1.bam, package=Rsamtools, +mustWork=TRUE) which - RangesList(seq1=IRanges(1000, 2000), + seq2=IRanges(c(100, 1000), c(1000, 2000))) param - ScanBamParam(which=which) gal4 - readGAlignments(bamfile, param=param) gal4 GAlignments object with 2404 alignments and 0 metadata columns: seqnames strand cigarqwidth start end width njunc Rle Rle character integer integer integer integer integer [1] seq1 + 35M35 970 1004 35 0 [2] seq1 + 35M35 971 1005 35 0 [3] seq1 + 35M35 972 1006