My response was meant to address this: 1) fixed-dimension, fixed sample set is a solved problem, and SE is that solution. 2) multi-assay, "holes" across samples remains an ugly thorny problem, maybe needs a new API
So why not keep SE as stable as possible, and dump all the explosive changes into the latter? Statistics is the grammar of science. Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> On Wed, Mar 4, 2015 at 9:12 AM, Vincent Carey <st...@channing.harvard.edu> wrote: > > > On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo <robert.cast...@upf.edu> > wrote: > >> some of the goals behind this discussion are IMO similar to the ones for >> biocMultiAssay: >> >> https://github.com/vjcitn/biocMultiAssay >> >> maybe Vince can confirm. >> > > > It is true that there are connections between the concerns But the way I > see it, the container design we > are talking about in this thread addresses the management of a fixed > common assay type over a fixed set of samples. > > The biocMultiAssay deals with the management of multiple assay types over > multiple samples, with possible > disparities in sample sets over the different assay types. > > > >> robert. >> >> On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote: >> >>> Oh, I don't disagree. Perhaps the two problems can be addressed >>> simultaneously by >>> >>> 1) deciding on what contracts a multi-assay container can/would demand to >>> be useful >>> 2) calling it something besides SummarizedExperiment, say, >>> ExperimentCollection >>> >>> Then the SE API could stay the same as it is (which is already very >>> useful) >>> and progress could be sought in the offshoot (ExperimentCollection or >>> whatever) without breaking things that rely on SE. >>> >>> Just off the top of my head, a most generically useful container for DNA >>> methylation& CNV data (which can of course be called from the same >>> assay) >>> is Kasper& JP's GenomicRatioSet, which already has some weird quirks for >>> eSet backwards compatibility. (e.g. sampleNames(x) works, but >>> sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls >>> rowData(x)) There are little niggles that I should probably just send >>> in a >>> patch for, but a cleaner overall container would be better, if for no >>> other >>> reason than the aforementioned ability to easily experiment with >>> imputation. An approach that I've been using is to stuff the SNPs, CNV >>> (as >>> GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE). This is... >>> somewhat less than optimal, especially when subsetting. >>> >>> But it does suggest that I could define a coercion from the current >>> rambling wreck into a nice clean new class/API (ExperimentCollection or >>> whatever) and I'll bet other package authors could, too. The presence >>> of a >>> GRangesFrame would then be handy for returning a given assay's results, >>> so >>> that the user could be blissfully ignorant of the storage backing (ff, >>> BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data >>> management >>> advantages of a SummarizedExperiment. >>> >>> JMHO >>> >>> >>> >>> >>> >>> >>> >>> Statistics is the grammar of science. >>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science> >>> >>> >>> On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey<st...@channing.harvard.edu >>> > >>> wrote: >>> >>> I am a bit concerned about any major alterations to the >>>> SummarizedExperiment API. We have >>>> two papers and plenty of working code that use it in meaningful ways. >>>> Effort required to keep new >>>> formulations back-compatible as well as bug-free has to be weighed >>>> seriously. >>>> >>>> I agree that the name is not ideal. We are learning as we go. >>>> >>>> Seems to make sense to start with the contracts we want the instances >>>> of >>>> a class to satisfy. I have long felt >>>> that X[i, j] idiom is one users and developers should be comfortable >>>> with, >>>> even insist on, and for consistency >>>> with matrix operations idiom, it should work in a natural way for >>>> numeric >>>> indexing. This seems like an important >>>> constraint. subsetBy* is a useful idiom, but it is conceivable that we >>>> would adopt filter() for row-oriented selections >>>> and select() for column-oriented selections. Do we have to make any >>>> special design considerations to allow >>>> very smooth interoperation with out-of-memory resources for certain >>>> components for developers who want to allow this? >>>> >>>> We should have a reasonable way to get data on what is out there, what >>>> is used, how it is most effectively used. >>>> What's the SE API? Is it well-adapted to requirements of DESeq2? Other >>>> killer packages that use/don't use it? >>>> Even getting data on the formal API for a class is not all that >>>> familiar. >>>> And if folks are writing non-S4 interfaces (i.e., naked >>>> functions) we have no way of identifying them. See below for one way of >>>> discovering the API for SummarizedExperiment. >>>> >>>> In summary, I think we have to be careful about overdesigning too >>>> early. Getting clear on contracts seems the best >>>> way to ensure reuse, and we really want that so that reliability is >>>> continually assessed. My sense is that it is good >>>> to give developers something they'll gladly extend, not necessarily >>>> reuse >>>> directly. So we don't have to have >>>> broad consensus on class details, but on the minimal abstraction and on >>>> obligatory tests on its basic implementation. >>>> >>>> methods(class="SummarizedExperiment") # perhaps an obsolete version >>>>> of >>>>> >>>> methods cataloguer by MTM >>>> >>>> DataFrame with 76 rows and 3 columns >>>> >>>> generic >>>> signature package >>>> >>>> <character> >>>> <character> <character> >>>> >>>> 1 [ x="SummarizedExperiment", i="ANY", >>>> j="ANY", drop="ANY" base >>>> >>>> 2 [ x="SummarizedExperiment", i="ANY", >>>> j="missing", value="ANY" base >>>> >>>> 3 [ x="SummarizedExperiment", >>>> i="ANY", j="missing" base >>>> >>>> 4 [<- x="SummarizedExperiment", i="ANY", j="ANY", >>>> value="SummarizedExperiment" base >>>> >>>> 5 assay >>>> x="SummarizedExperiment", i="character" GenomicRanges >>>> >>>> ... ... >>>> ... ... >>>> >>>> 72 updateObject >>>> object="SummarizedExperiment" BiocGenerics >>>> >>>> 73 values >>>> x="SummarizedExperiment" S4Vectors >>>> >>>> 74 values<- >>>> x="SummarizedExperiment" S4Vectors >>>> >>>> 75 width >>>> x="SummarizedExperiment" BiocGenerics >>>> >>>> 76 width<- >>>> x="SummarizedExperiment" BiocGenerics >>>> >>>> On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo<hcorr...@gmail.com >>>> > >>>> wrote: >>>> >>>> May I advocate for 'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' >>>>> can >>>>> return whatever makes sense (GRanges, or other data structures >>>>> -thinking >>>>> taxonomy for metagenomics for example-). GRangesFrame can inherit from >>>>> this. >>>>> >>>>> On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès<hpa...@fredhutch.org> >>>>> wrote: >>>>> >>>>> GRangesFrame is an interesting idea and I gave it some thoughts. >>>>>> >>>>>> There is this nice symmetry between GRanges and GRangesFrame: >>>>>> >>>>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols() >>>>>> >>>>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via >>>>>> some accessor (e.g. rowRanges()) >>>>>> >>>>>> So GRanges and GRangesFrame are equivalent in terms of what they >>>>>> can hold, but different in terms of API: the former has the ranges >>>>>> API as primary API and the DataFrame API on its mcols() component, >>>>>> and the latter has the DataFrame API as primary API and the ranges >>>>>> API on its rowRanges() component. Nice switch! >>>>>> >>>>>> What does this API switch bring us? A GRangesFrame object is now >>>>>> an object that fully behaves like a DataFrame and people can also >>>>>> perform range-based operations on its rowRanges() component. >>>>>> Here is what I'm afraid is going to happen: people will also want >>>>>> to be able to perform range-based operations *directly* on >>>>>> these objects, i.e. without having to call rowRanges() first. >>>>>> So for example when they do subsetByOverlaps(), subsetting >>>>>> happens vertically. Also the Hits object returned by findOverlaps() >>>>>> would contain row indices. Problem with this is that these objects >>>>>> now start to suffer from the "dual personality syndrome". For >>>>>> example, it's not clear anymore what their length should be. >>>>>> Strictly speaking it should be their number of columns (that's >>>>>> what the length of a DataFrame is), but the ranges API that >>>>>> we're trying to put on them also makes them feel like vectors >>>>>> along the vertical dimension so it also feels that their length >>>>>> should be their number of rows. Same thing with 1D subsetting. >>>>>> Why does it subset the columns and not the rows? Most people >>>>>> are now confused. >>>>>> >>>>>> It's interesting to note that the same thing happens with GRanges >>>>>> objects, but in the opposite direction: people wish they could >>>>>> do DataFrame operations directly on them without calling mcols() >>>>>> first. But in order to preserve the good health of GRanges objects, >>>>>> we've not done that (except for $, a shortcut for mcols(x)$, >>>>>> the pressure was just too strong). >>>>>> >>>>>> H. >>>>>> >>>>>> >>>>>> >>>>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote: >>>>>> >>>>>> Should be possible for the annotations to be of any type, as long as >>>>>>> >>>>>> they >>>>> >>>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a >>>>>>> DataFrame, GRanges, or whatever in there. But it would be nice to >>>>>>> have >>>>>>> >>>>>> a >>>>> >>>>>> special class for the container with range information. The contract >>>>>>> >>>>>> for >>>>> >>>>>> the range annotation would be to have a granges() method. >>>>>>> >>>>>>> I agree it would be nice if there was a way with the methods package >>>>>>> to >>>>>>> easily assert such contracts. For example, one could define an >>>>>>> >>>>>> interface >>>>> >>>>>> with a set of generics (and optionally the relevant position in the >>>>>>> generic >>>>>>> signature). Then, once all of the methods have been assigned for a >>>>>>> particular class, it is made to inherit from that contract class. >>>>>>> There >>>>>>> are >>>>>>> lots of gotchas though. Not sure how useful it would be in practice. >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty<haverty.pe...@gene.com >>>>>>> > >>>>>>> wrote: >>>>>>> >>>>>>> There are some nice similarities in these new imaginary types. A >>>>>>> >>>>>>>> "GRangesFrame" is a list of dimensionally identical things (columns) >>>>>>>> >>>>>>> and >>>>> >>>>>> some row meta-data (the GRanges). The SE-like object is similarly a >>>>>>>> >>>>>>> list >>>>> >>>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix >>>>>>>> >>>>>>> objects, >>>>> >>>>>> HDF5-backed things) with some row meta-data (a DataFrame or >>>>>>>> GRangesFrame). >>>>>>>> Elegant? Maybe they would actually be relatives in the class tree. >>>>>>>> >>>>>>>> I wonder if this kind of thing would be easier if we had Java-style >>>>>>>> Interfaces or duck-typing. The "x" slot of "y" holds something that >>>>>>>> implements this set of methods ... >>>>>>>> >>>>>>>> Oh, and kinda apropos, the genoset class will probably go away or >>>>>>>> >>>>>>> become >>>>> >>>>>> an extension to this new SE-like thing. The extra stuff that comes >>>>>>>> >>>>>>> along >>>>> >>>>>> with genoset will still be available. >>>>>>>> >>>>>>>> Pete >>>>>>>> >>>>>>>> ____________________ >>>>>>>> Peter M. Haverty, Ph.D. >>>>>>>> Genentech, Inc. >>>>>>>> phave...@gene.com >>>>>>>> >>>>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr.< >>>>>>>> tim.tri...@gmail.com >>>>>>>> >>>>>>> >>>>>> wrote: >>>>>>>> >>>>>>>> This. >>>>>>>> >>>>>>>>> >>>>>>>>> It would be damned near perfect as a return value for assays coming >>>>>>>>> >>>>>>>> out >>>>> >>>>>> of >>>>>>>>> an object that held several such assays at several time points in a >>>>>>>>> population, where there are both assay-wise and covariate-wise >>>>>>>>> >>>>>>>> "holes" >>>>> >>>>>> that >>>>>>>>> could nonetheless be usefully imputed across assays. >>>>>>>>> >>>>>>>>> >>>>>>>>> Statistics is the grammar of science. >>>>>>>>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science> >>>>>>>>> >>>>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty< >>>>>>>>> >>>>>>>> haverty.pe...@gene.com> >>>>> >>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I still think GRanges should be a subclass of DataFrame, >>>>>>>>>>> >>>>>>>>>>> which would make this easy, but I don't seem to be winning that >>>>>>>>>>>> >>>>>>>>>>>> argument. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Just impossible. As Michael mentioned back in November, they >>>>>>>>>>> have >>>>>>>>>>> conflicting APIs. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges >>>>>>>>>> (without mcols) as an index? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioc-devel@r-project.org mailing list >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioc-devel@r-project.org mailing list >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioc-devel@r-project.org mailing list >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>> >>>>>>> >>>>>>> -- >>>>>> Hervé Pagès >>>>>> >>>>>> Program in Computational Biology >>>>>> Division of Public Health Sciences >>>>>> Fred Hutchinson Cancer Research Center >>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>> P.O. Box 19024 >>>>>> Seattle, WA 98109-1024 >>>>>> >>>>>> E-mail: hpa...@fredhutch.org >>>>>> Phone: (206) 667-5791 >>>>>> Fax: (206) 667-1319 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioc-devel@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>> >>>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioc-devel@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>> >>>>> >>>> >>>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>> >> >> -- >> Robert Castelo, PhD >> Associate Professor >> Dept. of Experimental and Health Sciences >> Universitat Pompeu Fabra (UPF) >> Barcelona Biomedical Research Park (PRBB) >> Dr Aiguader 88 >> E-08003 Barcelona, Spain >> telf: +34.933.160.514 >> fax: +34.933.160.550 >> > > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel