Re: [Bioc-devel] plotPCA for BiocGenerics

2014-10-20 Thread Michael Love
hi Kevin,

that would imply there is only one way to plot an object of a given class.
Additionally, it would break a lot of code.​

best,

Mike

On Mon, Oct 20, 2014 at 12:50 PM, Kevin Coombes kevin.r.coom...@gmail.com
wrote:

 But shouldn't they all really just be named plot for the appropriate
 objects?  In which case, there would already be a perfectly good generic
 On Oct 20, 2014 10:27 AM, Michael Love michaelisaiahl...@gmail.com
 wrote:

 I noticed that 'plotPCA' functions are defined in EDASeq, DESeq2, DESeq,
 affycoretools, Rcade, facopy, CopyNumber450k, netresponse, MAIT (maybe
 more).

 Sounds like a case for BiocGenerics.

 best,

 Mike

 [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel



[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] plotPCA for BiocGenerics

2014-10-20 Thread Michael Love
I agree it depends on programming style. But to continue with the example,
it would have to be:

plot(as.mySpecialObjectPCA(mySpecialObject))

because we have many packages here with their own functions.

Wouldn't this just proliferate the number of classes instead of functions?
For every type of plot in my package, I would have to add a special class
branching off of mySpecialObject. Users would have to remember to convert
from mySpecialObject to mySpecialObjectPCA or mySpecialObjectMAPlot or
mySpecialObjectVarianceMeanPlot, etc.

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] plotPCA for BiocGenerics

2014-10-20 Thread Kevin Coombes
Hi,

It depends.

The traditional R approach to these matters is that you (a) first 
perform some sort of an analysis and save the results as an object and 
then (b) show or plot what you got.  It is part (b) that tends to be 
really generic, and (in my opinion) should have really generic names -- 
like show or plot or hist or image.

With PCA in particular, you usually have to perform a bunch of 
computations in order to get the principal components from some part of 
the data.  As I understand it now, these computations are performed 
along the way as part of the various plotPCA functions. The R way to 
do this would be something like
 pca - performPCA(mySpecialObject)  # or as.PCA(mySpecialObject)
 plot(pca) # to get the scatter plot
This apporach has the user-friendly advantage that you can tweak the 
plot (in terms of colors, symbols, ranges, titles, etc) without having 
to recompute the principal components every time. (I often find myself 
re-plotting the same PCA several times, with different colors or symbols 
for different factrors associated with the samples.) In addition, you 
could then also do something like
 screeplot(pca)
to get a plot of the percentages of variance explained.

My own feeling is that if the object doesn't know what to do when you 
tell it to plot itself, then you haven't got the right abstraction.

You may still end up needing generics for each kind of computation you 
want to perform (PCA, RLE, MA, etc), which is why I suggested an 
as.PCA function.  After all, as is already pretty generic.  In the 
long run, l this would herlp BioConductor developers, since they 
wouldn't all have to reimplement the visualization code; they would just 
have to figure out how to convert their own object into a PCA or RLE or 
MA object.

And I know that this plotWhatever approach is used elsewhere in 
BioConductor, and it has always bothered me. It just seemed that a post 
suggesting a new generic function provided a reasonable opportunity to 
point out that there might be a better way.

Best,
   Kevin

PS: My own ClassDicsovery package, which is available from RForge via
**|install.packages(ClassDiscovery, 
repos=http://R-Forge.R-project.org;)|**
includes a SamplePCA class that does something roughly similar to this 
for microarrays.

PPS (off-topic): The worst offender in base R -- because it doesn't use 
this typical approch -- is the heatmap function.  Having tried to 
teach this function in several different classes, I have come to the 
conclusion that it is basically unusable by mortals. And I think the 
problem is that it tries to combine too many steps -- clustering rows, 
clustering columns, scaling, visualization -- all in a single fiunction

On 10/20/2014 3:47 PM, davide risso wrote:
 Hi Kevin,

 I don't agree. In the case of EDASeq (as I suppose it is the case for 
 DESeq/DESeq2) plotting the principal components of the count matrix is 
 only one of possible exploratory plots (RLE plots, MA plots, etc.).
 So, in my opinion, it makes more sense from an object oriented point 
 of view to have multiple plotting methods for a single RNA-seq 
 experiment object.

 In addition, this is the same strategy adopted elsewhere in 
 Bioconductor, e.g., for the plotMA method.

 Just my two cents.

 Best,
 davide

 On Mon, Oct 20, 2014 at 11:30 AM, Kevin Coombes 
 kevin.r.coom...@gmail.com mailto:kevin.r.coom...@gmail.com wrote:

 I understand that breaking code is a problem, and that is
 admittedly the main reason not to immediately adopt my suggestion.

 But as a purely logical exercise, creating a PCA object X or
 something similar and using either
 plot(X)
 or
 plot(as.PCA(mySpecialObject))
 is a much more sensible use of object-oriented programming/design.
 This requires no new generics (to write or to learn).

 And you could use it to transition away from the current system by
 convincing the various package maintainers to re-implement plotPCA
 as follows:

 plotPCA - function(object, ...) {
   plot(as.PCA(object), ...)
 }

 This would be relatively easy to eventually deprecate and teach
 users to switch to the alternative.


 On 10/20/2014 1:07 PM, Michael Love wrote:
 hi Kevin,

 that would imply there is only one way to plot an object of a
 given class. Additionally, it would break a lot of code.​

 best,

 Mike

 On Mon, Oct 20, 2014 at 12:50 PM, Kevin Coombes
 kevin.r.coom...@gmail.com mailto:kevin.r.coom...@gmail.com wrote:

 But shouldn't they all really just be named plot for the
 appropriate objects?  In which case, there would already be a
 perfectly good generic

 On Oct 20, 2014 10:27 AM, Michael Love
 michaelisaiahl...@gmail.com
 mailto:michaelisaiahl...@gmail.com wrote:

 I noticed that 'plotPCA' functions are defined in EDASeq,
 DESeq2, DESeq,
 affycoretools, Rcade, 

Re: [Bioc-devel] plotPCA for BiocGenerics

2014-10-20 Thread Kevin Coombes
Hi,

I don't see how it needs more functions (as long as you can get 
developers to agree).  Suppose that someone can define a reusable PCA 
class.  This will contain a single plot generic function, defined once 
and reused by other classes. The existing plotPCA interface can also 
be implemented just once, in this class, as

 plotPCA - function(object, ...) plot(as.PCA(object), ...)

This can be exposed to users of your class through namespaces.  Then the 
only thing a developer needs to implement in his own class is the single 
as.PCA function.  And he/she would have already been rquired to 
implement this as part of the old plotPCA function.  So it can be 
extracted from that, and the developer doesn't have to reimplement the 
visualization code from the PCA class.

Best,
   Kevin

On 10/20/2014 5:15 PM, davide risso wrote:
 Hi Kevin,

 I see your points and I agree (especially for the specific case of 
 plotPCA that involves some non trivial computations).

 On the other hand, having a wrapper function that starting from the 
 raw data gives you a pretty picture (with virtually zero effort by 
 the user) using a sensible choice of parameters that are more or less 
 OK for RNA-seq data is useful for practitioners that just want to look 
 for patterns in the data.

 I guess it would be the same to have a PCA method for each of the 
 objects and then using the plot method on those new objects, but that 
 would just create a lot more objects and functions than the current 
 approach (like Mike was saying).

 Your as.pca or performPCA approach would be definitely better if 
 all the different methods would create objects of the *same* PCA 
 class, but since we are talking about different packages, I don't know 
 how easy it would be to coordinate. But perhaps this is the way we 
 should go.

 Best,
 davide



 On Mon, Oct 20, 2014 at 1:26 PM, Kevin Coombes 
 kevin.r.coom...@gmail.com mailto:kevin.r.coom...@gmail.com wrote:

 Hi,

 It depends.

 The traditional R approach to these matters is that you (a)
 first perform some sort of an analysis and save the results as an
 object and then (b) show or plot what you got.  It is part (b)
 that tends to be really generic, and (in my opinion) should have
 really generic names -- like show or plot or hist or image.

 With PCA in particular, you usually have to perform a bunch of
 computations in order to get the principal components from some
 part of the data.  As I understand it now, these computations are
 performed along the way as part of the various plotPCA
 functions.  The R way to do this would be something like
 pca - performPCA(mySpecialObject)# or as.PCA(mySpecialObject)
 plot(pca) # to get the scatter plot
 This apporach has the user-friendly advantage that you can tweak
 the plot (in terms of colors, symbols, ranges, titles, etc)
 without having to recompute the principal components every time.
 (I often find myself re-plotting the same PCA several times, with
 different colors or symbols for different factrors associated with
 the samples.) In addition, you could then also do something like
 screeplot(pca)
 to get a plot of the percentages of variance explained.

 My own feeling is that if the object doesn't know what to do when
 you tell it to plot itself, then you haven't got the right
 abstraction.

 You may still end up needing generics for each kind of computation
 you want to perform (PCA, RLE, MA, etc), which is why I suggested
 an as.PCA function.  After all, as is already pretty generic. 
 In the long run, l this would herlp BioConductor developers, since
 they wouldn't all have to reimplement the visualization code; they
 would just have to figure out how to convert their own object into
 a PCA or RLE or MA object.

 And I know that this plotWhatever approach is used elsewhere in
 BioConductor, and it has always bothered me. It just seemed that a
 post suggesting a new generic function provided a reasonable
 opportunity to point out that there might be a better way.

 Best,
   Kevin

 PS: My own ClassDicsovery package, which is available from
 RForge via
 **|install.packages(ClassDiscovery,
 repos=http://R-Forge.R-project.org;
 http://R-Forge.R-project.org)|**
 includes a SamplePCA class that does something roughly similar
 to this for microarrays.

 PPS (off-topic): The worst offender in base R -- because it
 doesn't use this typical approch -- is the heatmap function. 
 Having tried to teach this function in several different classes,
 I have come to the conclusion that it is basically unusable by
 mortals. And I think the problem is that it tries to combine too
 many steps -- clustering rows, clustering columns, scaling,
 visualization -- all in a single fiunction


 On 10/20/2014 3:47 PM, davide risso wrote:
 

Re: [Bioc-devel] Why be default 'D' is not dropped from coverage?

2014-10-20 Thread Hervé Pagès

Hi Leo,

On 10/18/2014 10:50 AM, Leonardo Collado Torres wrote:

Hi,

A collaborator of mine is working on a new software and we while we were
doing a sanity check to compare the base-level coverage from BAM files
and bigWig files generated from his software we realized that by default
bases corresponding to a 'D' on the CIGAR string get counted when
reading coverage from BAM files.

That is:

  getMethod('coverage', 'GAlignments')
Method Definition:

function (x, shift = 0L, width = NULL, weight = 1L, ...)
{
 .local - function (x, shift = 0L, width = NULL, weight = 1L,
 method = c(auto, sort, hash), drop.D.ranges = FALSE)
 coverage(grglist(x, drop.D.ranges = drop.D.ranges), shift = shift,
 width = width, weight = weight, method = method)
 .local(x, shift, width, weight, ...)
}
environment: namespace:GenomicAlignments

Signatures:
 x
target  GAlignments
defined GAlignments
  packageVersion('GenomicAlignments')
[1] ‘1.2.0’


You can see that this is default elsewhere, for example:

  extractAlignmentRangesOnReference
function (cigar, pos = 1L, drop.D.ranges = FALSE, f = NULL)
{
 if (!isTRUEorFALSE(drop.D.ranges))
 stop('drop.D.ranges' must be TRUE or FALSE)
 if (drop.D.ranges) {
 ops - c(M, =, X, I)
 }
 else {
 ops - c(M, =, X, I, D)
 }
 cigarRangesAlongReferenceSpace(cigar, flag = NULL, pos = pos,
 f = f, ops = ops, drop.empty.ranges = FALSE, reduce.ranges = TRUE)
}
environment: namespace:GenomicAlignments


What is the rationale for setting `drop.D.ranges` by default to FALSE?


Because last time I checked (but that was 3 or 4 years ago), that's what
Samtools pileup was doing.

Cheers,
H.



Thanks,
Leo




--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] plotPCA for BiocGenerics

2014-10-20 Thread Michael Love
Ah, I see now. Personally, I don't think Bioconductor developers should
have to agree on single plotting functions for basic classes like 'PCA'
(because this logic applies equally to the situation of all Bioconductor
developers agreeing on single MA-plot, a single variance-mean plot, etc). I
think letting developers define their plotPCA makes contributions easier (I
don't have to ask the owner of plot.PCA to incorporate something), even
though it means we have a growing list of generics.

Still you have a good point about splitting computation and plotting. In
practice, we subset the rows so PCA is not laborious.


On Mon, Oct 20, 2014 at 5:38 PM, Kevin Coombes kevin.r.coom...@gmail.com
wrote:

  Hi,

 I don't see how it needs more functions (as long as you can get developers
 to agree).  Suppose that someone can define a reusable PCA class.  This
 will contain a single plot generic function, defined once and reused by
 other classes. The existing plotPCA interface can also be implemented
 just once, in this class, as

 plotPCA - function(object, ...) plot(as.PCA(object), ...)

 This can be exposed to users of your class through namespaces.  Then the
 only thing a developer needs to implement in his own class is the single
 as.PCA function.  And he/she would have already been rquired to implement
 this as part of the old plotPCA function.  So it can be extracted from
 that, and the developer doesn't have to reimplement the visualization code
 from the PCA class.

 Best,
   Kevin


 On 10/20/2014 5:15 PM, davide risso wrote:

 Hi Kevin,

  I see your points and I agree (especially for the specific case of
 plotPCA that involves some non trivial computations).

  On the other hand, having a wrapper function that starting from the
 raw data gives you a pretty picture (with virtually zero effort by the
 user) using a sensible choice of parameters that are more or less OK for
 RNA-seq data is useful for practitioners that just want to look for
 patterns in the data.

  I guess it would be the same to have a PCA method for each of the
 objects and then using the plot method on those new objects, but that would
 just create a lot more objects and functions than the current approach
 (like Mike was saying).

  Your as.pca or performPCA approach would be definitely better if all
 the different methods would create objects of the *same* PCA class, but
 since we are talking about different packages, I don't know how easy it
 would be to coordinate. But perhaps this is the way we should go.

  Best,
 davide



 On Mon, Oct 20, 2014 at 1:26 PM, Kevin Coombes kevin.r.coom...@gmail.com
 wrote:

  Hi,

 It depends.

 The traditional R approach to these matters is that you (a) first
 perform some sort of an analysis and save the results as an object and then
 (b) show or plot what you got.  It is part (b) that tends to be really
 generic, and (in my opinion) should have really generic names -- like
 show or plot or hist or image.

 With PCA in particular, you usually have to perform a bunch of
 computations in order to get the principal components from some part of the
 data.  As I understand it now, these computations are performed along the
 way as part of the various plotPCA functions.  The R way to do this
 would be something like
 pca - performPCA(mySpecialObject)  # or as.PCA(mySpecialObject)
 plot(pca) # to get the scatter plot
 This apporach has the user-friendly advantage that you can tweak the plot
 (in terms of colors, symbols, ranges, titles, etc) without having to
 recompute the principal components every time. (I often find myself
 re-plotting the same PCA several times, with different colors or symbols
 for different factrors associated with the samples.) In addition, you could
 then also do something like
 screeplot(pca)
 to get a plot of the percentages of variance explained.

 My own feeling is that if the object doesn't know what to do when you
 tell it to plot itself, then you haven't got the right abstraction.

 You may still end up needing generics for each kind of computation you
 want to perform (PCA, RLE, MA, etc), which is why I suggested an as.PCA
 function.  After all, as is already pretty generic.  In the long run, l
 this would herlp BioConductor developers, since they wouldn't all have to
 reimplement the visualization code; they would just have to figure out how
 to convert their own object into a PCA or RLE or MA object.

 And I know that this plotWhatever approach is used elsewhere in
 BioConductor, and it has always bothered me. It just seemed that a post
 suggesting a new generic function provided a reasonable opportunity to
 point out that there might be a better way.

 Best,
   Kevin

 PS: My own ClassDicsovery package, which is available from RForge via
 *install.packages(ClassDiscovery,
 repos=http://R-Forge.R-project.org; http://R-Forge.R-project.org)*
 includes a SamplePCA class that does something roughly similar to this
 for microarrays.

 PPS (off-topic): 

Re: [Bioc-devel] plotPCA for BiocGenerics

2014-10-20 Thread Kevin Coombes
Well. I have two responses to that.

First, I think it would be a lot better/easier for users if (most) 
developers could make use of the same plot function for basic classes 
like PCA.

Second, if you think the basic PCA plotting routine needs enhancements, 
you still have two options.  On the one hand, you could (as you said) 
try to convince the maintainer of PCA to add what you want.  If it's 
generally valuable, then he'd probably do it --- and other classes that 
use it would benefit.  On the other hand, if it really is a special 
enhancement that only makes sense for your class, then you can derive a 
class from the basic PCA class
 setClass(mySpecialPCA, contains=c(PCA), *other stuff here*)
  and implement your own version of the plot generic for this class.  
And you could tweak the as.PCA function so it returns an object of the 
mySpecialPCA class. And the user could still just plot the result 
without hacving to care what's happening behind the scenes.

On 10/20/2014 5:59 PM, Michael Love wrote:
 Ah, I see now. Personally, I don't think Bioconductor developers 
 should have to agree on single plotting functions for basic classes 
 like 'PCA' (because this logic applies equally to the situation of all 
 Bioconductor developers agreeing on single MA-plot, a single 
 variance-mean plot, etc). I think letting developers define their 
 plotPCA makes contributions easier (I don't have to ask the owner of 
 plot.PCA to incorporate something), even though it means we have a 
 growing list of generics.

 Still you have a good point about splitting computation and plotting. 
 In practice, we subset the rows so PCA is not laborious.


 On Mon, Oct 20, 2014 at 5:38 PM, Kevin Coombes 
 kevin.r.coom...@gmail.com mailto:kevin.r.coom...@gmail.com wrote:

 Hi,

 I don't see how it needs more functions (as long as you can get
 developers to agree).  Suppose that someone can define a reusable
 PCA class.  This will contain a single plot generic function,
 defined once and reused by other classes. The existing plotPCA
 interface can also be implemented just once, in this class, as

 plotPCA - function(object, ...) plot(as.PCA(object), ...)

 This can be exposed to users of your class through namespaces. 
 Then the only thing a developer needs to implement in his own
 class is the single as.PCA function.  And he/she would have
 already been rquired to implement this as part of the old
 plotPCA function.  So it can be extracted from that, and the
 developer doesn't have to reimplement the visualization code from
 the PCA class.

 Best,
   Kevin


 On 10/20/2014 5:15 PM, davide risso wrote:
 Hi Kevin,

 I see your points and I agree (especially for the specific case
 of plotPCA that involves some non trivial computations).

 On the other hand, having a wrapper function that starting from
 the raw data gives you a pretty picture (with virtually zero
 effort by the user) using a sensible choice of parameters that
 are more or less OK for RNA-seq data is useful for practitioners
 that just want to look for patterns in the data.

 I guess it would be the same to have a PCA method for each of the
 objects and then using the plot method on those new objects, but
 that would just create a lot more objects and functions than the
 current approach (like Mike was saying).

 Your as.pca or performPCA approach would be definitely better
 if all the different methods would create objects of the *same*
 PCA class, but since we are talking about different packages, I
 don't know how easy it would be to coordinate. But perhaps this
 is the way we should go.

 Best,
 davide



 On Mon, Oct 20, 2014 at 1:26 PM, Kevin Coombes
 kevin.r.coom...@gmail.com mailto:kevin.r.coom...@gmail.com wrote:

 Hi,

 It depends.

 The traditional R approach to these matters is that you (a)
 first perform some sort of an analysis and save the results
 as an object and then (b) show or plot what you got.  It is
 part (b) that tends to be really generic, and (in my opinion)
 should have really generic names -- like show or plot or
 hist or image.

 With PCA in particular, you usually have to perform a bunch
 of computations in order to get the principal components from
 some part of the data.  As I understand it now, these
 computations are performed along the way as part of the
 various plotPCA functions.  The R way to do this would be
 something like
 pca - performPCA(mySpecialObject)  # or
 as.PCA(mySpecialObject)
 plot(pca) # to get the scatter plot
 This apporach has the user-friendly advantage that you can
 tweak the plot (in terms of colors, symbols, ranges, titles,
 etc) without having to recompute the principal components