Re: [Bioc-devel] plotPCA for BiocGenerics
While I tend to agree with you that PCA is too big an operation to be hidden within a plotting function (MDS is an edge-case I would say), I can’t see how we can ever reach a point where there is only one generic plot function. In the case of PCA there is a number of different plot-types that can all lay claim to the plot function of a PCA class, for instance scoreplot, scatterplot matrix of all scores, biplot, screeplot, accumulated R^2 barplot, leverage vs. distance-to-model… (you get the idea). So while having some very well-thought out classes for very common result types such as PCA, this class would still need a lot of different plot methods such as plotScores, plotScree etc (or plot(…, type=‘score’), but I don’t find that very appealing). Expanding beyond PCA only muddles the water even more - there are very few interesting data structures that only have one visual representation to-rule-them-all… just my 2c best Thomas Date: Mon, 20 Oct 2014 18:50:48 -0400 From: Kevin Coombes kevin.r.coom...@gmail.com Well. I have two responses to that. First, I think it would be a lot better/easier for users if (most) developers could make use of the same plot function for basic classes like PCA. Second, if you think the basic PCA plotting routine needs enhancements, you still have two options. On the one hand, you could (as you said) try to convince the maintainer of PCA to add what you want. If it's generally valuable, then he'd probably do it --- and other classes that use it would benefit. On the other hand, if it really is a special enhancement that only makes sense for your class, then you can derive a class from the basic PCA class setClass(mySpecialPCA, contains=c(PCA), *other stuff here*) and implement your own version of the plot generic for this class. And you could tweak the as.PCA function so it returns an object of the mySpecialPCA class. And the user could still just plot the result without hacving to care what's happening behind the scenes. On 10/20/2014 5:59 PM, Michael Love wrote: Ah, I see now. Personally, I don't think Bioconductor developers should have to agree on single plotting functions for basic classes like 'PCA' (because this logic applies equally to the situation of all Bioconductor developers agreeing on single MA-plot, a single variance-mean plot, etc). I think letting developers define their plotPCA makes contributions easier (I don't have to ask the owner of plot.PCA to incorporate something), even though it means we have a growing list of generics. Still you have a good point about splitting computation and plotting. In practice, we subset the rows so PCA is not laborious. On Mon, Oct 20, 2014 at 5:38 PM, Kevin Coombes kevin.r.coom...@gmail.com mailto:kevin.r.coom...@gmail.com wrote: Hi, I don't see how it needs more functions (as long as you can get developers to agree). Suppose that someone can define a reusable PCA class. This will contain a single plot generic function, defined once and reused by other classes. The existing plotPCA interface can also be implemented just once, in this class, as plotPCA - function(object, ...) plot(as.PCA(object), ...) This can be exposed to users of your class through namespaces. Then the only thing a developer needs to implement in his own class is the single as.PCA function. And he/she would have already been rquired to implement this as part of the old plotPCA function. So it can be extracted from that, and the developer doesn't have to reimplement the visualization code from the PCA class. Best, Kevin On 10/20/2014 5:15 PM, davide risso wrote: Hi Kevin, I see your points and I agree (especially for the specific case of plotPCA that involves some non trivial computations). On the other hand, having a wrapper function that starting from the raw data gives you a pretty picture (with virtually zero effort by the user) using a sensible choice of parameters that are more or less OK for RNA-seq data is useful for practitioners that just want to look for patterns in the data. I guess it would be the same to have a PCA method for each of the objects and then using the plot method on those new objects, but that would just create a lot more objects and functions than the current approach (like Mike was saying). Your as.pca or performPCA approach would be definitely better if all the different methods would create objects of the *same* PCA class, but since we are talking about different packages, I don't know how easy it would be to coordinate. But perhaps this is the way we should go. Best, davide On Mon, Oct 20, 2014 at 1:26 PM, Kevin Coombes kevin.r.coom...@gmail.com mailto:kevin.r.coom...@gmail.com wrote: Hi, It depends. The traditional R
Re: [Bioc-devel] Why be default 'D' is not dropped from coverage?
Thank you Hervé! I failed to realized that this was included in the docs. Anyhow, we've been discussing internally what default value to use. We also noted that the `genomecov` tool from BED in it's split mode ignores the D's: http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html Cheers, Leo On Mon, Oct 20, 2014 at 5:45 PM, Hervé Pagès hpa...@fhcrc.org wrote: Hi Leo, On 10/18/2014 10:50 AM, Leonardo Collado Torres wrote: Hi, A collaborator of mine is working on a new software and we while we were doing a sanity check to compare the base-level coverage from BAM files and bigWig files generated from his software we realized that by default bases corresponding to a 'D' on the CIGAR string get counted when reading coverage from BAM files. That is: getMethod('coverage', 'GAlignments') Method Definition: function (x, shift = 0L, width = NULL, weight = 1L, ...) { .local - function (x, shift = 0L, width = NULL, weight = 1L, method = c(auto, sort, hash), drop.D.ranges = FALSE) coverage(grglist(x, drop.D.ranges = drop.D.ranges), shift = shift, width = width, weight = weight, method = method) .local(x, shift, width, weight, ...) } environment: namespace:GenomicAlignments Signatures: x target GAlignments defined GAlignments packageVersion('GenomicAlignments') [1] ‘1.2.0’ You can see that this is default elsewhere, for example: extractAlignmentRangesOnReference function (cigar, pos = 1L, drop.D.ranges = FALSE, f = NULL) { if (!isTRUEorFALSE(drop.D.ranges)) stop('drop.D.ranges' must be TRUE or FALSE) if (drop.D.ranges) { ops - c(M, =, X, I) } else { ops - c(M, =, X, I, D) } cigarRangesAlongReferenceSpace(cigar, flag = NULL, pos = pos, f = f, ops = ops, drop.empty.ranges = FALSE, reduce.ranges = TRUE) } environment: namespace:GenomicAlignments What is the rationale for setting `drop.D.ranges` by default to FALSE? Because last time I checked (but that was 3 or 4 years ago), that's what Samtools pileup was doing. Cheers, H. Thanks, Leo -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] A more flexible GenomeInfoDb::mapSeqlevels(): used supported info but don't break with new organisms/toy examples
Hello, In my `derfinder` package I have been relying on `GenomeInfoDb` to correctly name the sequence names. Given https://support.bioconductor.org/p/62136/ I now realize that I was expecting too much from `GenomeInfoDb`. For example, in some cases I was expecting it to guess that a single '2' meant that the naming style was NCBI. This is true for Homo sapiens but now I realize that it's not necessary true for Arabidopsis thaliana. GenomeInfoDb:::.guessSpeciesStyle('2') [1] Arabidopsis thaliana NCBI ## Could have been Homo sapiens (NCBI) or Arabidopbis thaliana TAIR10 or NCBI GenomeInfoDb:::.supportedSeqnameMappings()[['Homo_sapiens']][2, ] NCBI UCSC dbSNP 22 chr2 ch2 GenomeInfoDb:::.supportedSeqnameMappings()[['Arabidopsis_thaliana']][2, ] NCBI TAIR10 22 2 I was also expecting mapSeqlevels() to return the same input if the specified 'style' was the same as the one currently being used. For example, if I was working with Homo sapiens NCBI style and attempted to map to NCBI, I was expecting the same output as the input. ## More than 1 output mapSeqlevels('2', 'NCBI') 2 [1,] 2 [2,] LGII ## Turns out there is another organism that would make this mapping valid GenomeInfoDb:::.supportedSeqnameMappings()[['Populus_trichocarpa']][2, ] NCBI JGI2.F 2 LGII 2 I also noticed that `GenomeInfoDb` couldn't work with some fictional examples. For example: ## Expected error, but this meant that `derfinder` couldn't use toy examples from other ## packages. seqlevelsStyle('seq1') Error in seqlevelsStyle(seq1) : The style does not have a compatible entry for the species supported by Seqname. Please see genomeStyles() for supported species/style ## Using toy data from Rsamtools fails as expected fl - system.file(extdata, ex1.bam, package=Rsamtools, + mustWork=TRUE) library(GenomicAlignments) names(scanBamHeader(BamFile(fl))$targets) [1] seq1 seq2 seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) Error in seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) : The style does not have a compatible entry for the species supported by Seqname. Please see genomeStyles() for supported species/style So, I wanted a way to use `GenomeInfoDb` to make the correct naming style maps for supported genomes, and simply return the original sequence names if using an unsupported species/mapping or if the original and target styles are the same. That's what I basically did when I wrote 'extendedMapSeqlevels' (aye, long name) which you can see at https://gist.github.com/3e6f0e2e2bb67ee910da If you think that others might be interested in such a function, then by all means please add it to `GenomeInfoDb` and I'll import it. You'll notice how it tries to guess the species and/or current naming style when that information is not provided. When it's used with verbose = TRUE it will show a message explaining why the mapping failed (if it did), it'll show the link to the `GenomeInfoDb` vignette for adding an organism, and then return the original sequence names. I have a slightly modified version in derfinder https://github.com/lcolladotor/derfinder/blob/master/R/extendedMapSeqlevels.R or https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/derfinder/R/extendedMapSeqlevels.R where I simply assume some other default values and guarantee continuity with how I was doing things previously. Naturally, R CMD check now shows a NOTE because I'm using the ::: syntax for un-exported functions from `GenomeInfoDb`. Cheers, Leo Leonardo Collado Torres, PhD Candidate Department of Biostatistics Johns Hopkins University Bloomberg School of Public Health Website: http://www.biostat.jhsph.edu/~lcollado/ Blog: http://lcolladotor.github.io/ ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel