Re: [Bioc-devel] plotPCA for BiocGenerics

2014-10-21 Thread Thomas Dybdal Pedersen
While I tend to agree with you that PCA is too big an operation to be hidden 
within a plotting function (MDS is an edge-case I would say), I can’t see how 
we can ever reach a point where there is only one generic plot function. In the 
case of PCA there is a number of different plot-types that can all lay claim to 
the plot function of a PCA class, for instance scoreplot, scatterplot matrix of 
all scores, biplot, screeplot, accumulated R^2 barplot, leverage vs. 
distance-to-model… (you get the idea). So while having some very well-thought 
out classes for very common result types such as PCA, this class would still 
need a lot of different plot methods such as plotScores, plotScree etc (or 
plot(…, type=‘score’), but I don’t find that very appealing). Expanding beyond 
PCA only muddles the water even more - there are very few interesting data 
structures that only have one visual representation to-rule-them-all…

just my 2c

best
Thomas


 Date: Mon, 20 Oct 2014 18:50:48 -0400
 From: Kevin Coombes kevin.r.coom...@gmail.com
 
 Well. I have two responses to that.
 
 First, I think it would be a lot better/easier for users if (most) 
 developers could make use of the same plot function for basic classes 
 like PCA.
 
 Second, if you think the basic PCA plotting routine needs enhancements, 
 you still have two options.  On the one hand, you could (as you said) 
 try to convince the maintainer of PCA to add what you want.  If it's 
 generally valuable, then he'd probably do it --- and other classes that 
 use it would benefit.  On the other hand, if it really is a special 
 enhancement that only makes sense for your class, then you can derive a 
 class from the basic PCA class
 setClass(mySpecialPCA, contains=c(PCA), *other stuff here*)
  and implement your own version of the plot generic for this class.  
 And you could tweak the as.PCA function so it returns an object of the 
 mySpecialPCA class. And the user could still just plot the result 
 without hacving to care what's happening behind the scenes.
 
 On 10/20/2014 5:59 PM, Michael Love wrote:
 Ah, I see now. Personally, I don't think Bioconductor developers 
 should have to agree on single plotting functions for basic classes 
 like 'PCA' (because this logic applies equally to the situation of all 
 Bioconductor developers agreeing on single MA-plot, a single 
 variance-mean plot, etc). I think letting developers define their 
 plotPCA makes contributions easier (I don't have to ask the owner of 
 plot.PCA to incorporate something), even though it means we have a 
 growing list of generics.
 
 Still you have a good point about splitting computation and plotting. 
 In practice, we subset the rows so PCA is not laborious.
 
 
 On Mon, Oct 20, 2014 at 5:38 PM, Kevin Coombes 
 kevin.r.coom...@gmail.com mailto:kevin.r.coom...@gmail.com wrote:
 
Hi,
 
I don't see how it needs more functions (as long as you can get
developers to agree).  Suppose that someone can define a reusable
PCA class.  This will contain a single plot generic function,
defined once and reused by other classes. The existing plotPCA
interface can also be implemented just once, in this class, as
 
plotPCA - function(object, ...) plot(as.PCA(object), ...)
 
This can be exposed to users of your class through namespaces. 
Then the only thing a developer needs to implement in his own
class is the single as.PCA function.  And he/she would have
already been rquired to implement this as part of the old
plotPCA function.  So it can be extracted from that, and the
developer doesn't have to reimplement the visualization code from
the PCA class.
 
Best,
  Kevin
 
 
On 10/20/2014 5:15 PM, davide risso wrote:
Hi Kevin,
 
I see your points and I agree (especially for the specific case
of plotPCA that involves some non trivial computations).
 
On the other hand, having a wrapper function that starting from
the raw data gives you a pretty picture (with virtually zero
effort by the user) using a sensible choice of parameters that
are more or less OK for RNA-seq data is useful for practitioners
that just want to look for patterns in the data.
 
I guess it would be the same to have a PCA method for each of the
objects and then using the plot method on those new objects, but
that would just create a lot more objects and functions than the
current approach (like Mike was saying).
 
Your as.pca or performPCA approach would be definitely better
if all the different methods would create objects of the *same*
PCA class, but since we are talking about different packages, I
don't know how easy it would be to coordinate. But perhaps this
is the way we should go.
 
Best,
davide
 
 
 
On Mon, Oct 20, 2014 at 1:26 PM, Kevin Coombes
kevin.r.coom...@gmail.com mailto:kevin.r.coom...@gmail.com wrote:
 
Hi,
 
It depends.
 
The traditional R 

Re: [Bioc-devel] Why be default 'D' is not dropped from coverage?

2014-10-21 Thread Leonardo Collado Torres
Thank you Hervé! I failed to realized that this was included in the docs.

Anyhow, we've been discussing internally what default value to use. We
also noted that the `genomecov` tool from BED in it's split mode
ignores the D's:
http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html

Cheers,
Leo

On Mon, Oct 20, 2014 at 5:45 PM, Hervé Pagès hpa...@fhcrc.org wrote:
 Hi Leo,


 On 10/18/2014 10:50 AM, Leonardo Collado Torres wrote:

 Hi,

 A collaborator of mine is working on a new software and we while we were
 doing a sanity check to compare the base-level coverage from BAM files
 and bigWig files generated from his software we realized that by default
 bases corresponding to a 'D' on the CIGAR string get counted when
 reading coverage from BAM files.

 That is:

   getMethod('coverage', 'GAlignments')
 Method Definition:

 function (x, shift = 0L, width = NULL, weight = 1L, ...)
 {
  .local - function (x, shift = 0L, width = NULL, weight = 1L,
  method = c(auto, sort, hash), drop.D.ranges = FALSE)
  coverage(grglist(x, drop.D.ranges = drop.D.ranges), shift = shift,
  width = width, weight = weight, method = method)
  .local(x, shift, width, weight, ...)
 }
 environment: namespace:GenomicAlignments

 Signatures:
  x
 target  GAlignments
 defined GAlignments
   packageVersion('GenomicAlignments')
 [1] ‘1.2.0’


 You can see that this is default elsewhere, for example:

   extractAlignmentRangesOnReference
 function (cigar, pos = 1L, drop.D.ranges = FALSE, f = NULL)
 {
  if (!isTRUEorFALSE(drop.D.ranges))
  stop('drop.D.ranges' must be TRUE or FALSE)
  if (drop.D.ranges) {
  ops - c(M, =, X, I)
  }
  else {
  ops - c(M, =, X, I, D)
  }
  cigarRangesAlongReferenceSpace(cigar, flag = NULL, pos = pos,
  f = f, ops = ops, drop.empty.ranges = FALSE, reduce.ranges =
 TRUE)
 }
 environment: namespace:GenomicAlignments


 What is the rationale for setting `drop.D.ranges` by default to FALSE?


 Because last time I checked (but that was 3 or 4 years ago), that's what
 Samtools pileup was doing.

 Cheers,
 H.


 Thanks,
 Leo



 --
 Hervé Pagès

 Program in Computational Biology
 Division of Public Health Sciences
 Fred Hutchinson Cancer Research Center
 1100 Fairview Ave. N, M1-B514
 P.O. Box 19024
 Seattle, WA 98109-1024

 E-mail: hpa...@fhcrc.org
 Phone:  (206) 667-5791
 Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] A more flexible GenomeInfoDb::mapSeqlevels(): used supported info but don't break with new organisms/toy examples

2014-10-21 Thread Leonardo Collado Torres
Hello,

In my `derfinder` package I have been relying on `GenomeInfoDb` to
correctly name the sequence names. Given
https://support.bioconductor.org/p/62136/ I now realize that I was
expecting too much from `GenomeInfoDb`.

For example, in some cases I was expecting it to guess that a single
'2' meant that the naming style was NCBI. This is true for Homo
sapiens but now I realize that it's not necessary true for Arabidopsis
thaliana.

 GenomeInfoDb:::.guessSpeciesStyle('2')
[1] Arabidopsis thaliana NCBI

## Could have been Homo sapiens (NCBI) or Arabidopbis thaliana TAIR10 or NCBI
 GenomeInfoDb:::.supportedSeqnameMappings()[['Homo_sapiens']][2, ]
  NCBI UCSC dbSNP
22 chr2   ch2
 GenomeInfoDb:::.supportedSeqnameMappings()[['Arabidopsis_thaliana']][2, ]
  NCBI TAIR10
22  2

I was also expecting mapSeqlevels() to return the same input if the
specified 'style' was the same as the one currently being used. For
example, if I was working with Homo sapiens NCBI style and attempted
to map to NCBI, I was expecting the same output as the input.

## More than 1 output
 mapSeqlevels('2', 'NCBI')
 2
[1,] 2
[2,] LGII
## Turns out there is another organism that would make this mapping valid
 GenomeInfoDb:::.supportedSeqnameMappings()[['Populus_trichocarpa']][2, ]
  NCBI JGI2.F
2 LGII  2


I also noticed that `GenomeInfoDb` couldn't work with some fictional
examples. For example:

## Expected error, but this meant that `derfinder` couldn't use toy
examples from other
## packages.
 seqlevelsStyle('seq1')
Error in seqlevelsStyle(seq1) :
  The style does not have a compatible entry for the species supported
by Seqname. Please see genomeStyles() for supported
  species/style

## Using toy data from Rsamtools fails as expected
 fl - system.file(extdata, ex1.bam, package=Rsamtools,
+   mustWork=TRUE)
 library(GenomicAlignments)
 names(scanBamHeader(BamFile(fl))$targets)
[1] seq1 seq2
 seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets))
Error in seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) :
  The style does not have a compatible entry for the species supported
by Seqname. Please see genomeStyles() for supported
  species/style


So, I wanted a way to use `GenomeInfoDb` to make the correct naming
style maps for supported genomes, and simply return the original
sequence names if using an unsupported species/mapping or if the
original and target styles are the same.


That's what I basically did when I wrote 'extendedMapSeqlevels' (aye,
long name) which you can see at
https://gist.github.com/3e6f0e2e2bb67ee910da If you think that others
might be interested in such a function, then by all means please add
it to `GenomeInfoDb` and I'll import it. You'll notice how it tries to
guess the species and/or current naming style when that information is
not provided. When it's used with verbose = TRUE it will show a
message explaining why the mapping failed (if it did), it'll show the
link to the `GenomeInfoDb` vignette for adding an organism, and then
return the original sequence names.

I have a slightly modified version in derfinder
https://github.com/lcolladotor/derfinder/blob/master/R/extendedMapSeqlevels.R
 or 
https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/derfinder/R/extendedMapSeqlevels.R
where I simply assume some other default values and guarantee
continuity with how I was doing things previously.

Naturally, R CMD check now shows a NOTE because I'm using the :::
syntax for un-exported functions from `GenomeInfoDb`.

Cheers,
Leo


Leonardo Collado Torres, PhD Candidate
Department of Biostatistics
Johns Hopkins University
Bloomberg School of Public Health
Website: http://www.biostat.jhsph.edu/~lcollado/
Blog: http://lcolladotor.github.io/

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel