Re: [Bioc-devel] Changes to the SummarizedExperiment Class

Robert Castelo Wed, 04 Mar 2015 10:59:14 -0800

some of the goals behind this discussion are IMO similar to the ones forbiocMultiAssay:


https://github.com/vjcitn/biocMultiAssay


maybe Vince can confirm.

robert.

On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:

Oh, I don't disagree.  Perhaps the two problems can be addressed
simultaneously by

1) deciding on what contracts a multi-assay container can/would demand to
be useful
2) calling it something besides SummarizedExperiment, say,
ExperimentCollection

Then the SE API could stay the same as it is (which is already very useful)
and progress could be sought in the offshoot (ExperimentCollection or
whatever) without breaking things that rely on SE.

Just off the top of my head, a most generically useful container for DNA
methylation&  CNV data (which can of course be called from the same assay)
is Kasper&  JP's GenomicRatioSet, which already has some weird quirks for
eSet backwards compatibility.  (e.g. sampleNames(x) works, but
sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls
rowData(x))  There are little niggles that I should probably just send in a
patch for, but a cleaner overall container would be better, if for no other
reason than the aforementioned ability to easily experiment with
imputation. An approach that I've been using is to stuff the SNPs, CNV (as
GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
somewhat less than optimal, especially when subsetting.

But it does suggest that I could define a coercion from the current
rambling wreck into a nice clean new class/API (ExperimentCollection or
whatever) and I'll bet other package authors could, too.  The presence of a
GRangesFrame would then be handy for returning a given assay's results, so
that the user could be blissfully ignorant of the storage backing (ff,
BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management
advantages of a SummarizedExperiment.

JMHO







Statistics is the grammar of science.
Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey<st...@channing.harvard.edu>
wrote:

  I am a bit concerned about any major alterations to the
SummarizedExperiment API.  We have
two papers and plenty of working code that use it in meaningful ways.
Effort required to keep new
formulations back-compatible as well as bug-free has to be weighed
seriously.

  I agree that the name is not ideal.  We are learning as we go.

  Seems to make sense to start with the contracts we want the instances of
a class to satisfy.  I have long felt
that X[i, j] idiom is one users and developers should be comfortable with,
even insist on, and for consistency
with matrix operations idiom, it should work in a natural way for numeric
indexing.  This seems like an important
constraint.  subsetBy* is a useful idiom, but it is conceivable that we
would adopt filter() for row-oriented selections
and select() for column-oriented selections.  Do we have to make any
special design considerations to allow
very smooth interoperation with out-of-memory resources for certain
components for developers who want to allow this?

  We should have a reasonable way to get data on what is out there, what
is used, how it is most effectively used.
What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
killer packages that use/don't use it?
Even getting data on the formal API for a class is not all that familiar.
And if folks are writing non-S4 interfaces (i.e., naked
functions) we have no way of identifying them.  See below for one way of
discovering the API for SummarizedExperiment.

  In summary, I think we have to be careful about overdesigning too
early.  Getting clear on contracts seems the best
way to ensure reuse, and we really want that so that reliability is
continually assessed.  My sense is that it is good
to give developers something they'll gladly extend, not necessarily reuse
directly.  So we don't have to have
broad consensus on class details, but on the minimal abstraction and on
obligatory tests on its basic implementation.

methods(class="SummarizedExperiment")  # perhaps an obsolete version of

methods cataloguer by MTM

DataFrame with 76 rows and 3 columns

          generic
       signature       package

      <character>
     <character>    <character>

1              [                   x="SummarizedExperiment", i="ANY",
j="ANY", drop="ANY"          base

2              [              x="SummarizedExperiment", i="ANY",
j="missing", value="ANY"          base

3              [                           x="SummarizedExperiment",
i="ANY", j="missing"          base

4            [<- x="SummarizedExperiment", i="ANY", j="ANY",
value="SummarizedExperiment"          base

5          assay
x="SummarizedExperiment", i="character" GenomicRanges

...          ...
             ...           ...

72  updateObject
object="SummarizedExperiment"  BiocGenerics

73        values
x="SummarizedExperiment"     S4Vectors

74      values<-
x="SummarizedExperiment"     S4Vectors

75         width
x="SummarizedExperiment"  BiocGenerics

76       width<-
x="SummarizedExperiment"  BiocGenerics

On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo<hcorr...@gmail.com>
wrote:

May I advocate for  'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' can
return whatever makes sense (GRanges, or other data structures -thinking
taxonomy for metagenomics for example-). GRangesFrame can inherit from
this.

On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès<hpa...@fredhutch.org>  wrote:

GRangesFrame is an interesting idea and I gave it some thoughts.

There is this nice symmetry between GRanges and GRangesFrame:

- GRanges = a naked GRanges + a DataFrame accessible via mcols()

- GRangesFrame = a DataFrame + a naked GRanges accessible via
                  some accessor (e.g. rowRanges())

So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!

What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personality syndrome". For
example, it's not clear anymore what their length should be.
Strictly speaking it should be their number of columns (that's
what the length of a DataFrame is), but the ranges API that
we're trying to put on them also makes them feel like vectors
along the vertical dimension so it also feels that their length
should be their number of rows. Same thing with 1D subsetting.
Why does it subset the columns and not the rows? Most people
are now confused.

It's interesting to note that the same thing happens with GRanges
objects, but in the opposite direction: people wish they could
do DataFrame operations directly on them without calling mcols()
first. But in order to preserve the good health of GRanges objects,
we've not done that (except for $, a shortcut for mcols(x)$,
the pressure was just too strong).

H.



On 03/03/2015 04:35 PM, Michael Lawrence wrote:

Should be possible for the annotations to be of any type, as long as

they

satisfy a simple contract of NROW() and 2D "[". Then, you could have a
DataFrame, GRanges, or whatever in there. But it would be nice to have

special class for the container with range information. The contract

for

the range annotation would be to have a granges() method.

I agree it would be nice if there was a way with the methods package to
easily assert such contracts. For example, one could define an

interface

with a set of generics (and optionally the relevant position in the
generic
signature). Then, once all of the methods have been assigned for a
particular class, it is made to inherit from that contract class. There
are
lots of gotchas though. Not sure how useful it would be in practice.


On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty<haverty.pe...@gene.com>
wrote:

  There are some nice similarities in these new imaginary types.  A

"GRangesFrame" is a list of dimensionally identical things (columns)

and

some row meta-data (the GRanges).  The SE-like object is similarly a

list

of dimensionally like things (matrices, RleDataFrames, BigMatrix

objects,

HDF5-backed things) with some row meta-data (a DataFrame or
GRangesFrame).
Elegant?  Maybe they would actually be relatives in the class tree.

I wonder if this kind of thing would be easier if we had Java-style
Interfaces or duck-typing.  The "x" slot of "y" holds something that
implements this set of methods ...

Oh, and kinda apropos, the genoset class will probably go away or

become

an extension to this new SE-like thing.  The extra stuff that comes

along

with genoset will still be available.

Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr.<tim.tri...@gmail.com

wrote:

  This.


It would be damned near perfect as a return value for assays coming

out

of
an object that held several such assays at several time points in a
population, where there are both assay-wise and covariate-wise

"holes"

that
could nonetheless be usefully imputed across assays.


Statistics is the grammar of science.
Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty<

haverty.pe...@gene.com>

wrote:



   I still think GRanges should be a subclass of DataFrame,

which would make this easy, but I don't seem to be winning that

argument.

Just impossible. As Michael mentioned back in November, they have
conflicting APIs.



Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
(without mcols) as an index?


          [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

          [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

Reply via email to