Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Gabe Becker
Jim et al.,

Why have two accessors (rowRanges, rowData), each of which are less
flexible than the underlying structure and thus will fail (return NULL? or
GRanges()/DataFrame() ?) in some proportion of valid objects?

~G

On Tue, Mar 3, 2015 at 2:37 PM, Jim Hester  wrote:

> Motivated by the discussion thread from November (https://stat.ethz.ch/
> pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core team
> is planning on making changes to the SummarizedExperiment class.  Our end
> goal is to allow the @rowData slot to become more flexible and hold either
> a DataFrame or GRanges type object.
>
> To this end we have currently deprecated the current rowData accessor in
> favor of a rowRanges accessor.  This change has resulted in a few broken
> builds in devel, which we are in the process of fixing now.  We will
> contact any package authors directly if needed for this migration.
>
> The rowData accessor will be deprecated in this release, however eventually
> the plan is to re-purpose this function to serve as an accessor for
> DataFrame data on the rows.
>
> Please let us know if you have any questions with the above and if you need
> any assistance with the transition.
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



-- 
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Michael Lawrence
Seems like rowData could be made to work universallly through coercion.
rowRanges would not, however, and one would like a convenient mechanism to
condition on whether range information is available. One way is to
introduce a new class and rely on dispatch. But that adds complexity.

On Tue, Mar 3, 2015 at 2:44 PM, Gabe Becker  wrote:

> Jim et al.,
>
> Why have two accessors (rowRanges, rowData), each of which are less
> flexible than the underlying structure and thus will fail (return NULL? or
> GRanges()/DataFrame() ?) in some proportion of valid objects?
>
> ~G
>
> On Tue, Mar 3, 2015 at 2:37 PM, Jim Hester 
> wrote:
>
> > Motivated by the discussion thread from November (https://stat.ethz.ch/
> > pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core
> team
> > is planning on making changes to the SummarizedExperiment class.  Our end
> > goal is to allow the @rowData slot to become more flexible and hold
> either
> > a DataFrame or GRanges type object.
> >
> > To this end we have currently deprecated the current rowData accessor in
> > favor of a rowRanges accessor.  This change has resulted in a few broken
> > builds in devel, which we are in the process of fixing now.  We will
> > contact any package authors directly if needed for this migration.
> >
> > The rowData accessor will be deprecated in this release, however
> eventually
> > the plan is to re-purpose this function to serve as an accessor for
> > DataFrame data on the rows.
> >
> > Please let us know if you have any questions with the above and if you
> need
> > any assistance with the transition.
> >
> > [[alternative HTML version deleted]]
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
>
>
> --
> Gabriel Becker, Ph.D
> Computational Biologist
> Genentech Research
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Peter Haverty
I'd like to see a basic class that takes a DataFrame and a sub-class that
takes a GRanges.  I still think GRanges should be a subclass of DataFrame,
which would make this easy, but I don't seem to be winning that argument.

While the hood is up, can we try some different names?
SummarizedExperiment never seemed like a great fit to me because it doesn't
necessarily contain experiments or summaries thereof.  It's a collection of
like-sized rectangular things with metadata on the two dimensions.  Maybe
the name could reflect what it holds rather than a common use case?
AnnotatedMatrixList?

 Anyway, I'm excited to see a version on the way that takes a DataFrame as
rowData.  I'm glad you guys are working on that.

Regards,

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Tue, Mar 3, 2015 at 2:57 PM, Michael Lawrence 
wrote:

> Seems like rowData could be made to work universallly through coercion.
> rowRanges would not, however, and one would like a convenient mechanism to
> condition on whether range information is available. One way is to
> introduce a new class and rely on dispatch. But that adds complexity.
>
> On Tue, Mar 3, 2015 at 2:44 PM, Gabe Becker  wrote:
>
> > Jim et al.,
> >
> > Why have two accessors (rowRanges, rowData), each of which are less
> > flexible than the underlying structure and thus will fail (return NULL?
> or
> > GRanges()/DataFrame() ?) in some proportion of valid objects?
> >
> > ~G
> >
> > On Tue, Mar 3, 2015 at 2:37 PM, Jim Hester 
> > wrote:
> >
> > > Motivated by the discussion thread from November (
> https://stat.ethz.ch/
> > > pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core
> > team
> > > is planning on making changes to the SummarizedExperiment class.  Our
> end
> > > goal is to allow the @rowData slot to become more flexible and hold
> > either
> > > a DataFrame or GRanges type object.
> > >
> > > To this end we have currently deprecated the current rowData accessor
> in
> > > favor of a rowRanges accessor.  This change has resulted in a few
> broken
> > > builds in devel, which we are in the process of fixing now.  We will
> > > contact any package authors directly if needed for this migration.
> > >
> > > The rowData accessor will be deprecated in this release, however
> > eventually
> > > the plan is to re-purpose this function to serve as an accessor for
> > > DataFrame data on the rows.
> > >
> > > Please let us know if you have any questions with the above and if you
> > need
> > > any assistance with the transition.
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ___
> > > Bioc-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >
> >
> >
> >
> > --
> > Gabriel Becker, Ph.D
> > Computational Biologist
> > Genentech Research
> >
> > [[alternative HTML version deleted]]
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Hervé Pagès

On 03/03/2015 03:06 PM, Peter Haverty wrote:

I'd like to see a basic class that takes a DataFrame and a sub-class that
takes a GRanges.


Yes.


I still think GRanges should be a subclass of DataFrame,
which would make this easy, but I don't seem to be winning that argument.


Just impossible. As Michael mentioned back in November, they have
conflicting APIs.



While the hood is up, can we try some different names?
SummarizedExperiment never seemed like a great fit to me because it doesn't
necessarily contain experiments or summaries thereof.  It's a collection of
like-sized rectangular things with metadata on the two dimensions.  Maybe
the name could reflect what it holds rather than a common use case?
AnnotatedMatrixList?


We actually need 2 names: 1 for the parent class, 1 for the child. I'm
starting to think that introducing 2 new names would maybe make the
migration a little bit easier, especially since the plan is to move the
"refactored SummarizedExperiment" to its own package. With 2 new names
we can start the new package, implement the 2 new classes in it, and
have the old SummarizedExperiment (in GenomicRanges) and the 2 new
classes peacefully cohabit during the time of the migration.

Cheers,
H.



  Anyway, I'm excited to see a version on the way that takes a DataFrame as
rowData.  I'm glad you guys are working on that.

Regards,

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Tue, Mar 3, 2015 at 2:57 PM, Michael Lawrence 
wrote:


Seems like rowData could be made to work universallly through coercion.
rowRanges would not, however, and one would like a convenient mechanism to
condition on whether range information is available. One way is to
introduce a new class and rely on dispatch. But that adds complexity.

On Tue, Mar 3, 2015 at 2:44 PM, Gabe Becker  wrote:


Jim et al.,

Why have two accessors (rowRanges, rowData), each of which are less
flexible than the underlying structure and thus will fail (return NULL?

or

GRanges()/DataFrame() ?) in some proportion of valid objects?

~G

On Tue, Mar 3, 2015 at 2:37 PM, Jim Hester 
wrote:


Motivated by the discussion thread from November (

https://stat.ethz.ch/

pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core

team

is planning on making changes to the SummarizedExperiment class.  Our

end

goal is to allow the @rowData slot to become more flexible and hold

either

a DataFrame or GRanges type object.

To this end we have currently deprecated the current rowData accessor

in

favor of a rowRanges accessor.  This change has resulted in a few

broken

builds in devel, which we are in the process of fixing now.  We will
contact any package authors directly if needed for this migration.

The rowData accessor will be deprecated in this release, however

eventually

the plan is to re-purpose this function to serve as an accessor for
DataFrame data on the rows.

Please let us know if you have any questions with the above and if you

need

any assistance with the transition.

 [[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel





--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research

 [[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



 [[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Peter Haverty
>
>
>
>  I still think GRanges should be a subclass of DataFrame,
>> which would make this easy, but I don't seem to be winning that argument.
>>
>
> Just impossible. As Michael mentioned back in November, they have
> conflicting APIs.


Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
(without mcols) as an index?


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Tim Triche, Jr.
This.

It would be damned near perfect as a return value for assays coming out of
an object that held several such assays at several time points in a
population, where there are both assay-wise and covariate-wise "holes" that
could nonetheless be usefully imputed across assays.


Statistics is the grammar of science.
Karl Pearson 

On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty 
wrote:

> >
> >
> >
> >  I still think GRanges should be a subclass of DataFrame,
> >> which would make this easy, but I don't seem to be winning that
> argument.
> >>
> >
> > Just impossible. As Michael mentioned back in November, they have
> > conflicting APIs.
>
>
> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
> (without mcols) as an index?
>
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Peter Haverty
There are some nice similarities in these new imaginary types.  A
"GRangesFrame" is a list of dimensionally identical things (columns) and
some row meta-data (the GRanges).  The SE-like object is similarly a list
of dimensionally like things (matrices, RleDataFrames, BigMatrix objects,
HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame).
Elegant?  Maybe they would actually be relatives in the class tree.

I wonder if this kind of thing would be easier if we had Java-style
Interfaces or duck-typing.  The "x" slot of "y" holds something that
implements this set of methods ...

Oh, and kinda apropos, the genoset class will probably go away or become an
extension to this new SE-like thing.  The extra stuff that comes along with
genoset will still be available.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. 
wrote:

> This.
>
> It would be damned near perfect as a return value for assays coming out of
> an object that held several such assays at several time points in a
> population, where there are both assay-wise and covariate-wise "holes" that
> could nonetheless be usefully imputed across assays.
>
>
> Statistics is the grammar of science.
> Karl Pearson 
>
> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty 
> wrote:
>
> > >
> > >
> > >
> > >  I still think GRanges should be a subclass of DataFrame,
> > >> which would make this easy, but I don't seem to be winning that
> > argument.
> > >>
> > >
> > > Just impossible. As Michael mentioned back in November, they have
> > > conflicting APIs.
> >
> >
> > Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
> > (without mcols) as an index?
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Michael Lawrence
Should be possible for the annotations to be of any type, as long as they
satisfy a simple contract of NROW() and 2D "[". Then, you could have a
DataFrame, GRanges, or whatever in there. But it would be nice to have a
special class for the container with range information. The contract for
the range annotation would be to have a granges() method.

I agree it would be nice if there was a way with the methods package to
easily assert such contracts. For example, one could define an interface
with a set of generics (and optionally the relevant position in the generic
signature). Then, once all of the methods have been assigned for a
particular class, it is made to inherit from that contract class. There are
lots of gotchas though. Not sure how useful it would be in practice.


On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty 
wrote:

> There are some nice similarities in these new imaginary types.  A
> "GRangesFrame" is a list of dimensionally identical things (columns) and
> some row meta-data (the GRanges).  The SE-like object is similarly a list
> of dimensionally like things (matrices, RleDataFrames, BigMatrix objects,
> HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame).
> Elegant?  Maybe they would actually be relatives in the class tree.
>
> I wonder if this kind of thing would be easier if we had Java-style
> Interfaces or duck-typing.  The "x" slot of "y" holds something that
> implements this set of methods ...
>
> Oh, and kinda apropos, the genoset class will probably go away or become
> an extension to this new SE-like thing.  The extra stuff that comes along
> with genoset will still be available.
>
> Pete
>
> 
> Peter M. Haverty, Ph.D.
> Genentech, Inc.
> phave...@gene.com
>
> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. 
> wrote:
>
>> This.
>>
>> It would be damned near perfect as a return value for assays coming out of
>> an object that held several such assays at several time points in a
>> population, where there are both assay-wise and covariate-wise "holes"
>> that
>> could nonetheless be usefully imputed across assays.
>>
>>
>> Statistics is the grammar of science.
>> Karl Pearson 
>>
>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty 
>> wrote:
>>
>> > >
>> > >
>> > >
>> > >  I still think GRanges should be a subclass of DataFrame,
>> > >> which would make this easy, but I don't seem to be winning that
>> > argument.
>> > >>
>> > >
>> > > Just impossible. As Michael mentioned back in November, they have
>> > > conflicting APIs.
>> >
>> >
>> > Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
>> > (without mcols) as an index?
>> >
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ___
>> > Bioc-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>>
>> [[alternative HTML version deleted]]
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Hervé Pagès

GRangesFrame is an interesting idea and I gave it some thoughts.

There is this nice symmetry between GRanges and GRangesFrame:

- GRanges = a naked GRanges + a DataFrame accessible via mcols()

- GRangesFrame = a DataFrame + a naked GRanges accessible via
 some accessor (e.g. rowRanges())

So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!

What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personality syndrome". For
example, it's not clear anymore what their length should be.
Strictly speaking it should be their number of columns (that's
what the length of a DataFrame is), but the ranges API that
we're trying to put on them also makes them feel like vectors
along the vertical dimension so it also feels that their length
should be their number of rows. Same thing with 1D subsetting.
Why does it subset the columns and not the rows? Most people
are now confused.

It's interesting to note that the same thing happens with GRanges
objects, but in the opposite direction: people wish they could
do DataFrame operations directly on them without calling mcols()
first. But in order to preserve the good health of GRanges objects,
we've not done that (except for $, a shortcut for mcols(x)$,
the pressure was just too strong).

H.


On 03/03/2015 04:35 PM, Michael Lawrence wrote:

Should be possible for the annotations to be of any type, as long as they
satisfy a simple contract of NROW() and 2D "[". Then, you could have a
DataFrame, GRanges, or whatever in there. But it would be nice to have a
special class for the container with range information. The contract for
the range annotation would be to have a granges() method.

I agree it would be nice if there was a way with the methods package to
easily assert such contracts. For example, one could define an interface
with a set of generics (and optionally the relevant position in the generic
signature). Then, once all of the methods have been assigned for a
particular class, it is made to inherit from that contract class. There are
lots of gotchas though. Not sure how useful it would be in practice.


On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty 
wrote:


There are some nice similarities in these new imaginary types.  A
"GRangesFrame" is a list of dimensionally identical things (columns) and
some row meta-data (the GRanges).  The SE-like object is similarly a list
of dimensionally like things (matrices, RleDataFrames, BigMatrix objects,
HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame).
Elegant?  Maybe they would actually be relatives in the class tree.

I wonder if this kind of thing would be easier if we had Java-style
Interfaces or duck-typing.  The "x" slot of "y" holds something that
implements this set of methods ...

Oh, and kinda apropos, the genoset class will probably go away or become
an extension to this new SE-like thing.  The extra stuff that comes along
with genoset will still be available.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. 
wrote:


This.

It would be damned near perfect as a return value for assays coming out of
an object that held several such assays at several time points in a
population, where there are both assay-wise and covariate-wise "holes"
that
could nonetheless be usefully imputed across assays.


Statistics is the grammar of science.
Karl Pearson 

On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty 
wrote:





  I still think GRanges should be a subclass of DataFrame,

which would make this easy, but I don't seem to be winning that

argument.




Just impossible. As Michael mentioned back in November, they have
conflicting APIs.



Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
(without mcols) as an index?


 [[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



 [[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Hector Corrada Bravo
May I advocate for  'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' can
return whatever makes sense (GRanges, or other data structures -thinking
taxonomy for metagenomics for example-). GRangesFrame can inherit from this.

On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès  wrote:

> GRangesFrame is an interesting idea and I gave it some thoughts.
>
> There is this nice symmetry between GRanges and GRangesFrame:
>
> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>
> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>  some accessor (e.g. rowRanges())
>
> So GRanges and GRangesFrame are equivalent in terms of what they
> can hold, but different in terms of API: the former has the ranges
> API as primary API and the DataFrame API on its mcols() component,
> and the latter has the DataFrame API as primary API and the ranges
> API on its rowRanges() component. Nice switch!
>
> What does this API switch bring us? A GRangesFrame object is now
> an object that fully behaves like a DataFrame and people can also
> perform range-based operations on its rowRanges() component.
> Here is what I'm afraid is going to happen: people will also want
> to be able to perform range-based operations *directly* on
> these objects, i.e. without having to call rowRanges() first.
> So for example when they do subsetByOverlaps(), subsetting
> happens vertically. Also the Hits object returned by findOverlaps()
> would contain row indices. Problem with this is that these objects
> now start to suffer from the "dual personality syndrome". For
> example, it's not clear anymore what their length should be.
> Strictly speaking it should be their number of columns (that's
> what the length of a DataFrame is), but the ranges API that
> we're trying to put on them also makes them feel like vectors
> along the vertical dimension so it also feels that their length
> should be their number of rows. Same thing with 1D subsetting.
> Why does it subset the columns and not the rows? Most people
> are now confused.
>
> It's interesting to note that the same thing happens with GRanges
> objects, but in the opposite direction: people wish they could
> do DataFrame operations directly on them without calling mcols()
> first. But in order to preserve the good health of GRanges objects,
> we've not done that (except for $, a shortcut for mcols(x)$,
> the pressure was just too strong).
>
> H.
>
>
>
> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>
>> Should be possible for the annotations to be of any type, as long as they
>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>> DataFrame, GRanges, or whatever in there. But it would be nice to have a
>> special class for the container with range information. The contract for
>> the range annotation would be to have a granges() method.
>>
>> I agree it would be nice if there was a way with the methods package to
>> easily assert such contracts. For example, one could define an interface
>> with a set of generics (and optionally the relevant position in the
>> generic
>> signature). Then, once all of the methods have been assigned for a
>> particular class, it is made to inherit from that contract class. There
>> are
>> lots of gotchas though. Not sure how useful it would be in practice.
>>
>>
>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty 
>> wrote:
>>
>>  There are some nice similarities in these new imaginary types.  A
>>> "GRangesFrame" is a list of dimensionally identical things (columns) and
>>> some row meta-data (the GRanges).  The SE-like object is similarly a list
>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix objects,
>>> HDF5-backed things) with some row meta-data (a DataFrame or
>>> GRangesFrame).
>>> Elegant?  Maybe they would actually be relatives in the class tree.
>>>
>>> I wonder if this kind of thing would be easier if we had Java-style
>>> Interfaces or duck-typing.  The "x" slot of "y" holds something that
>>> implements this set of methods ...
>>>
>>> Oh, and kinda apropos, the genoset class will probably go away or become
>>> an extension to this new SE-like thing.  The extra stuff that comes along
>>> with genoset will still be available.
>>>
>>> Pete
>>>
>>> 
>>> Peter M. Haverty, Ph.D.
>>> Genentech, Inc.
>>> phave...@gene.com
>>>
>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. 
>>> wrote:
>>>
>>>  This.

 It would be damned near perfect as a return value for assays coming out
 of
 an object that held several such assays at several time points in a
 population, where there are both assay-wise and covariate-wise "holes"
 that
 could nonetheless be usefully imputed across assays.


 Statistics is the grammar of science.
 Karl Pearson 

 On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty 
 wrote:


>>
>>
>>   I still think GRanges should be a subclass o

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Vincent Carey
I am a bit concerned about any major alterations to the
SummarizedExperiment API.  We have
two papers and plenty of working code that use it in meaningful ways.
Effort required to keep new
formulations back-compatible as well as bug-free has to be weighed
seriously.

I agree that the name is not ideal.  We are learning as we go.

Seems to make sense to start with the contracts we want the instances of a
class to satisfy.  I have long felt
that X[i, j] idiom is one users and developers should be comfortable with,
even insist on, and for consistency
with matrix operations idiom, it should work in a natural way for numeric
indexing.  This seems like an important
constraint.  subsetBy* is a useful idiom, but it is conceivable that we
would adopt filter() for row-oriented selections
and select() for column-oriented selections.  Do we have to make any
special design considerations to allow
very smooth interoperation with out-of-memory resources for certain
components for developers who want to allow this?

We should have a reasonable way to get data on what is out there, what is
used, how it is most effectively used.
What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
killer packages that use/don't use it?
Even getting data on the formal API for a class is not all that familiar.
And if folks are writing non-S4 interfaces (i.e., naked
functions) we have no way of identifying them.  See below for one way of
discovering the API for SummarizedExperiment.

In summary, I think we have to be careful about overdesigning too early.
Getting clear on contracts seems the best
way to ensure reuse, and we really want that so that reliability is
continually assessed.  My sense is that it is good
to give developers something they'll gladly extend, not necessarily reuse
directly.  So we don't have to have
broad consensus on class details, but on the minimal abstraction and on
obligatory tests on its basic implementation.

> methods(class="SummarizedExperiment")  # perhaps an obsolete version of
methods cataloguer by MTM

DataFrame with 76 rows and 3 columns

 generic
signature   package

 
 

1  [   x="SummarizedExperiment", i="ANY",
j="ANY", drop="ANY"  base

2  [  x="SummarizedExperiment", i="ANY",
j="missing", value="ANY"  base

3  [   x="SummarizedExperiment",
i="ANY", j="missing"  base

4[<- x="SummarizedExperiment", i="ANY", j="ANY",
value="SummarizedExperiment"  base

5  assay  x="SummarizedExperiment",
i="character" GenomicRanges

...  ...
  ...   ...

72  updateObject
object="SummarizedExperiment"  BiocGenerics

73values
x="SummarizedExperiment" S4Vectors

74  values<-
x="SummarizedExperiment" S4Vectors

75 width
x="SummarizedExperiment"  BiocGenerics

76   width<-
x="SummarizedExperiment"  BiocGenerics

On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo 
wrote:

> May I advocate for  'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' can
> return whatever makes sense (GRanges, or other data structures -thinking
> taxonomy for metagenomics for example-). GRangesFrame can inherit from
> this.
>
> On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès  wrote:
>
> > GRangesFrame is an interesting idea and I gave it some thoughts.
> >
> > There is this nice symmetry between GRanges and GRangesFrame:
> >
> > - GRanges = a naked GRanges + a DataFrame accessible via mcols()
> >
> > - GRangesFrame = a DataFrame + a naked GRanges accessible via
> >  some accessor (e.g. rowRanges())
> >
> > So GRanges and GRangesFrame are equivalent in terms of what they
> > can hold, but different in terms of API: the former has the ranges
> > API as primary API and the DataFrame API on its mcols() component,
> > and the latter has the DataFrame API as primary API and the ranges
> > API on its rowRanges() component. Nice switch!
> >
> > What does this API switch bring us? A GRangesFrame object is now
> > an object that fully behaves like a DataFrame and people can also
> > perform range-based operations on its rowRanges() component.
> > Here is what I'm afraid is going to happen: people will also want
> > to be able to perform range-based operations *directly* on
> > these objects, i.e. without having to call rowRanges() first.
> > So for example when they do subsetByOverlaps(), subsetting
> > happens vertically. Also the Hits object returned by findOverlaps()
> > would contain row indices. Problem with this is that these objects
> > now start to suffer from the "dual personality syndrome". For
> > example, it's not clear anymore what their length should be.
> > Strictly speaking it should be their number of columns (that's
> > what the length of a DataFrame is), but the ranges API that
> > we're trying to put on them also makes them feel like vectors
> > along the verti

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Tim Triche, Jr.
Oh, I don't disagree.  Perhaps the two problems can be addressed
simultaneously by

1) deciding on what contracts a multi-assay container can/would demand to
be useful
2) calling it something besides SummarizedExperiment, say,
ExperimentCollection

Then the SE API could stay the same as it is (which is already very useful)
and progress could be sought in the offshoot (ExperimentCollection or
whatever) without breaking things that rely on SE.

Just off the top of my head, a most generically useful container for DNA
methylation & CNV data (which can of course be called from the same assay)
is Kasper & JP's GenomicRatioSet, which already has some weird quirks for
eSet backwards compatibility.  (e.g. sampleNames(x) works, but
sampleNames(x) <- does not work; pData(x) calls colData(x); fData(x) calls
rowData(x))  There are little niggles that I should probably just send in a
patch for, but a cleaner overall container would be better, if for no other
reason than the aforementioned ability to easily experiment with
imputation. An approach that I've been using is to stuff the SNPs, CNV (as
GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
somewhat less than optimal, especially when subsetting.

But it does suggest that I could define a coercion from the current
rambling wreck into a nice clean new class/API (ExperimentCollection or
whatever) and I'll bet other package authors could, too.  The presence of a
GRangesFrame would then be handy for returning a given assay's results, so
that the user could be blissfully ignorant of the storage backing (ff,
BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management
advantages of a SummarizedExperiment.

JMHO







Statistics is the grammar of science.
Karl Pearson 

On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey 
wrote:

>  I am a bit concerned about any major alterations to the
> SummarizedExperiment API.  We have
> two papers and plenty of working code that use it in meaningful ways.
> Effort required to keep new
> formulations back-compatible as well as bug-free has to be weighed
> seriously.
>
>  I agree that the name is not ideal.  We are learning as we go.
>
>  Seems to make sense to start with the contracts we want the instances of
> a class to satisfy.  I have long felt
> that X[i, j] idiom is one users and developers should be comfortable with,
> even insist on, and for consistency
> with matrix operations idiom, it should work in a natural way for numeric
> indexing.  This seems like an important
> constraint.  subsetBy* is a useful idiom, but it is conceivable that we
> would adopt filter() for row-oriented selections
> and select() for column-oriented selections.  Do we have to make any
> special design considerations to allow
> very smooth interoperation with out-of-memory resources for certain
> components for developers who want to allow this?
>
>  We should have a reasonable way to get data on what is out there, what
> is used, how it is most effectively used.
> What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
> killer packages that use/don't use it?
> Even getting data on the formal API for a class is not all that familiar.
> And if folks are writing non-S4 interfaces (i.e., naked
> functions) we have no way of identifying them.  See below for one way of
> discovering the API for SummarizedExperiment.
>
>  In summary, I think we have to be careful about overdesigning too
> early.  Getting clear on contracts seems the best
> way to ensure reuse, and we really want that so that reliability is
> continually assessed.  My sense is that it is good
> to give developers something they'll gladly extend, not necessarily reuse
> directly.  So we don't have to have
> broad consensus on class details, but on the minimal abstraction and on
> obligatory tests on its basic implementation.
>
> > methods(class="SummarizedExperiment")  # perhaps an obsolete version of
> methods cataloguer by MTM
>
> DataFrame with 76 rows and 3 columns
>
>  generic
>   signature   package
>
>  
>
>
> 1  [   x="SummarizedExperiment", i="ANY",
> j="ANY", drop="ANY"  base
>
> 2  [  x="SummarizedExperiment", i="ANY",
> j="missing", value="ANY"  base
>
> 3  [   x="SummarizedExperiment",
> i="ANY", j="missing"  base
>
> 4[<- x="SummarizedExperiment", i="ANY", j="ANY",
> value="SummarizedExperiment"  base
>
> 5  assay
> x="SummarizedExperiment", i="character" GenomicRanges
>
> ...  ...
> ...   ...
>
> 72  updateObject
> object="SummarizedExperiment"  BiocGenerics
>
> 73values
> x="SummarizedExperiment" S4Vectors
>
> 74  values<-
> x="SummarizedExperiment" S4Vectors
>
> 75 width
> x="SummarizedExperiment"  BiocGenerics
>
> 76   width<-
> x="SummarizedExperime

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Vincent Carey
On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo 
wrote:

> some of the goals behind this discussion are IMO similar to the ones for
> biocMultiAssay:
>
> https://github.com/vjcitn/biocMultiAssay
>
> maybe Vince can confirm.
>


It is true that there are connections between the concerns  But the way I
see it, the container design we
are talking about in this thread addresses the management of a fixed common
assay type over a fixed set of samples.

The biocMultiAssay deals with the management of multiple assay types over
multiple samples, with possible
disparities in sample sets over the different assay types.



> robert.
>
> On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:
>
>> Oh, I don't disagree.  Perhaps the two problems can be addressed
>> simultaneously by
>>
>> 1) deciding on what contracts a multi-assay container can/would demand to
>> be useful
>> 2) calling it something besides SummarizedExperiment, say,
>> ExperimentCollection
>>
>> Then the SE API could stay the same as it is (which is already very
>> useful)
>> and progress could be sought in the offshoot (ExperimentCollection or
>> whatever) without breaking things that rely on SE.
>>
>> Just off the top of my head, a most generically useful container for DNA
>> methylation&  CNV data (which can of course be called from the same assay)
>> is Kasper&  JP's GenomicRatioSet, which already has some weird quirks for
>> eSet backwards compatibility.  (e.g. sampleNames(x) works, but
>> sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls
>> rowData(x))  There are little niggles that I should probably just send in
>> a
>> patch for, but a cleaner overall container would be better, if for no
>> other
>> reason than the aforementioned ability to easily experiment with
>> imputation. An approach that I've been using is to stuff the SNPs, CNV (as
>> GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
>> somewhat less than optimal, especially when subsetting.
>>
>> But it does suggest that I could define a coercion from the current
>> rambling wreck into a nice clean new class/API (ExperimentCollection or
>> whatever) and I'll bet other package authors could, too.  The presence of
>> a
>> GRangesFrame would then be handy for returning a given assay's results, so
>> that the user could be blissfully ignorant of the storage backing (ff,
>> BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management
>> advantages of a SummarizedExperiment.
>>
>> JMHO
>>
>>
>>
>>
>>
>>
>>
>> Statistics is the grammar of science.
>> Karl Pearson
>>
>>
>> On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey
>> wrote:
>>
>>I am a bit concerned about any major alterations to the
>>> SummarizedExperiment API.  We have
>>> two papers and plenty of working code that use it in meaningful ways.
>>> Effort required to keep new
>>> formulations back-compatible as well as bug-free has to be weighed
>>> seriously.
>>>
>>>   I agree that the name is not ideal.  We are learning as we go.
>>>
>>>   Seems to make sense to start with the contracts we want the instances
>>> of
>>> a class to satisfy.  I have long felt
>>> that X[i, j] idiom is one users and developers should be comfortable
>>> with,
>>> even insist on, and for consistency
>>> with matrix operations idiom, it should work in a natural way for numeric
>>> indexing.  This seems like an important
>>> constraint.  subsetBy* is a useful idiom, but it is conceivable that we
>>> would adopt filter() for row-oriented selections
>>> and select() for column-oriented selections.  Do we have to make any
>>> special design considerations to allow
>>> very smooth interoperation with out-of-memory resources for certain
>>> components for developers who want to allow this?
>>>
>>>   We should have a reasonable way to get data on what is out there, what
>>> is used, how it is most effectively used.
>>> What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
>>> killer packages that use/don't use it?
>>> Even getting data on the formal API for a class is not all that familiar.
>>> And if folks are writing non-S4 interfaces (i.e., naked
>>> functions) we have no way of identifying them.  See below for one way of
>>> discovering the API for SummarizedExperiment.
>>>
>>>   In summary, I think we have to be careful about overdesigning too
>>> early.  Getting clear on contracts seems the best
>>> way to ensure reuse, and we really want that so that reliability is
>>> continually assessed.  My sense is that it is good
>>> to give developers something they'll gladly extend, not necessarily reuse
>>> directly.  So we don't have to have
>>> broad consensus on class details, but on the minimal abstraction and on
>>> obligatory tests on its basic implementation.
>>>
>>>  methods(class="SummarizedExperiment")  # perhaps an obsolete version of

>>> methods cataloguer by MTM
>>>
>>> DataFrame with 76 rows and 3 columns
>>>
>>>  

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Tim Triche, Jr.
so I'm told:

https://github.com/vjcitn/biocMultiAssay/blob/master/R/triche.R



Statistics is the grammar of science.
Karl Pearson 

On Wed, Mar 4, 2015 at 9:01 AM, Robert Castelo 
wrote:

> some of the goals behind this discussion are IMO similar to the ones for
> biocMultiAssay:
>
> https://github.com/vjcitn/biocMultiAssay
>
> maybe Vince can confirm.
>
> robert.
>
> On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:
> > Oh, I don't disagree.  Perhaps the two problems can be addressed
> > simultaneously by
> >
> > 1) deciding on what contracts a multi-assay container can/would demand to
> > be useful
> > 2) calling it something besides SummarizedExperiment, say,
> > ExperimentCollection
> >
> > Then the SE API could stay the same as it is (which is already very
> useful)
> > and progress could be sought in the offshoot (ExperimentCollection or
> > whatever) without breaking things that rely on SE.
> >
> > Just off the top of my head, a most generically useful container for DNA
> > methylation&  CNV data (which can of course be called from the same
> assay)
> > is Kasper&  JP's GenomicRatioSet, which already has some weird quirks for
> > eSet backwards compatibility.  (e.g. sampleNames(x) works, but
> > sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls
> > rowData(x))  There are little niggles that I should probably just send
> in a
> > patch for, but a cleaner overall container would be better, if for no
> other
> > reason than the aforementioned ability to easily experiment with
> > imputation. An approach that I've been using is to stuff the SNPs, CNV
> (as
> > GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
> > somewhat less than optimal, especially when subsetting.
> >
> > But it does suggest that I could define a coercion from the current
> > rambling wreck into a nice clean new class/API (ExperimentCollection or
> > whatever) and I'll bet other package authors could, too.  The presence
> of a
> > GRangesFrame would then be handy for returning a given assay's results,
> so
> > that the user could be blissfully ignorant of the storage backing (ff,
> > BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data
> management
> > advantages of a SummarizedExperiment.
> >
> > JMHO
> >
> >
> >
> >
> >
> >
> >
> > Statistics is the grammar of science.
> > Karl Pearson
> >
> > On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey >
> > wrote:
> >
> >>   I am a bit concerned about any major alterations to the
> >> SummarizedExperiment API.  We have
> >> two papers and plenty of working code that use it in meaningful ways.
> >> Effort required to keep new
> >> formulations back-compatible as well as bug-free has to be weighed
> >> seriously.
> >>
> >>   I agree that the name is not ideal.  We are learning as we go.
> >>
> >>   Seems to make sense to start with the contracts we want the instances
> of
> >> a class to satisfy.  I have long felt
> >> that X[i, j] idiom is one users and developers should be comfortable
> with,
> >> even insist on, and for consistency
> >> with matrix operations idiom, it should work in a natural way for
> numeric
> >> indexing.  This seems like an important
> >> constraint.  subsetBy* is a useful idiom, but it is conceivable that we
> >> would adopt filter() for row-oriented selections
> >> and select() for column-oriented selections.  Do we have to make any
> >> special design considerations to allow
> >> very smooth interoperation with out-of-memory resources for certain
> >> components for developers who want to allow this?
> >>
> >>   We should have a reasonable way to get data on what is out there, what
> >> is used, how it is most effectively used.
> >> What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
> >> killer packages that use/don't use it?
> >> Even getting data on the formal API for a class is not all that
> familiar.
> >> And if folks are writing non-S4 interfaces (i.e., naked
> >> functions) we have no way of identifying them.  See below for one way of
> >> discovering the API for SummarizedExperiment.
> >>
> >>   In summary, I think we have to be careful about overdesigning too
> >> early.  Getting clear on contracts seems the best
> >> way to ensure reuse, and we really want that so that reliability is
> >> continually assessed.  My sense is that it is good
> >> to give developers something they'll gladly extend, not necessarily
> reuse
> >> directly.  So we don't have to have
> >> broad consensus on class details, but on the minimal abstraction and on
> >> obligatory tests on its basic implementation.
> >>
> >>> methods(class="SummarizedExperiment")  # perhaps an obsolete version of
> >> methods cataloguer by MTM
> >>
> >> DataFrame with 76 rows and 3 columns
> >>
> >>   generic
> >>signature   package
> >>
> >>   
> >>  
> >>
> >> 1  [  

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Michael Lawrence
I think we need to make sure that there are enough benefits of something
like GRangesFrame before we introduce yet another complicated and
overlapping data structure into the framework. Prior to summarization, the
ranges seem primary, after summarization, it may often make sense for them
to be secondary. But I'm just not sure what we gain from a new data
structure.

On Wed, Mar 4, 2015 at 12:28 AM, Hervé Pagès  wrote:

> GRangesFrame is an interesting idea and I gave it some thoughts.
>
> There is this nice symmetry between GRanges and GRangesFrame:
>
> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>
> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>  some accessor (e.g. rowRanges())
>
> So GRanges and GRangesFrame are equivalent in terms of what they
> can hold, but different in terms of API: the former has the ranges
> API as primary API and the DataFrame API on its mcols() component,
> and the latter has the DataFrame API as primary API and the ranges
> API on its rowRanges() component. Nice switch!
>
> What does this API switch bring us? A GRangesFrame object is now
> an object that fully behaves like a DataFrame and people can also
> perform range-based operations on its rowRanges() component.
> Here is what I'm afraid is going to happen: people will also want
> to be able to perform range-based operations *directly* on
> these objects, i.e. without having to call rowRanges() first.
> So for example when they do subsetByOverlaps(), subsetting
> happens vertically. Also the Hits object returned by findOverlaps()
> would contain row indices. Problem with this is that these objects
> now start to suffer from the "dual personality syndrome". For
> example, it's not clear anymore what their length should be.
> Strictly speaking it should be their number of columns (that's
> what the length of a DataFrame is), but the ranges API that
> we're trying to put on them also makes them feel like vectors
> along the vertical dimension so it also feels that their length
> should be their number of rows. Same thing with 1D subsetting.
> Why does it subset the columns and not the rows? Most people
> are now confused.
>
> It's interesting to note that the same thing happens with GRanges
> objects, but in the opposite direction: people wish they could
> do DataFrame operations directly on them without calling mcols()
> first. But in order to preserve the good health of GRanges objects,
> we've not done that (except for $, a shortcut for mcols(x)$,
> the pressure was just too strong).
>
> H.
>
>
>
> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>
>> Should be possible for the annotations to be of any type, as long as they
>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>> DataFrame, GRanges, or whatever in there. But it would be nice to have a
>> special class for the container with range information. The contract for
>> the range annotation would be to have a granges() method.
>>
>> I agree it would be nice if there was a way with the methods package to
>> easily assert such contracts. For example, one could define an interface
>> with a set of generics (and optionally the relevant position in the
>> generic
>> signature). Then, once all of the methods have been assigned for a
>> particular class, it is made to inherit from that contract class. There
>> are
>> lots of gotchas though. Not sure how useful it would be in practice.
>>
>>
>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty 
>> wrote:
>>
>>  There are some nice similarities in these new imaginary types.  A
>>> "GRangesFrame" is a list of dimensionally identical things (columns) and
>>> some row meta-data (the GRanges).  The SE-like object is similarly a list
>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix objects,
>>> HDF5-backed things) with some row meta-data (a DataFrame or
>>> GRangesFrame).
>>> Elegant?  Maybe they would actually be relatives in the class tree.
>>>
>>> I wonder if this kind of thing would be easier if we had Java-style
>>> Interfaces or duck-typing.  The "x" slot of "y" holds something that
>>> implements this set of methods ...
>>>
>>> Oh, and kinda apropos, the genoset class will probably go away or become
>>> an extension to this new SE-like thing.  The extra stuff that comes along
>>> with genoset will still be available.
>>>
>>> Pete
>>>
>>> 
>>> Peter M. Haverty, Ph.D.
>>> Genentech, Inc.
>>> phave...@gene.com
>>>
>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. 
>>> wrote:
>>>
>>>  This.

 It would be damned near perfect as a return value for assays coming out
 of
 an object that held several such assays at several time points in a
 population, where there are both assay-wise and covariate-wise "holes"
 that
 could nonetheless be usefully imputed across assays.


 Statistics is the grammar of science.
 Karl Pearson 
>>

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Peter Haverty
Michael has a good point. The complexity of the BioC universe of classes
hurts our ability to attract new users. More classes would be a minus there
... but a small set of common, explicit APIs would simplify things.
Rectangular things implement the matrix Interface.  :-) Deprecating old
stuff, like eSet, might help more than it hurts, on the simplicity front.

P.S. apropos of understanding this universe of classes, I *love* the
methods(class=x) thing Vincent mentioned.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence 
wrote:

> I think we need to make sure that there are enough benefits of something
> like GRangesFrame before we introduce yet another complicated and
> overlapping data structure into the framework. Prior to summarization, the
> ranges seem primary, after summarization, it may often make sense for them
> to be secondary. But I'm just not sure what we gain from a new data
> structure.
>
> On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s  wrote:
>
>> GRangesFrame is an interesting idea and I gave it some thoughts.
>>
>> There is this nice symmetry between GRanges and GRangesFrame:
>>
>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>
>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>  some accessor (e.g. rowRanges())
>>
>> So GRanges and GRangesFrame are equivalent in terms of what they
>> can hold, but different in terms of API: the former has the ranges
>> API as primary API and the DataFrame API on its mcols() component,
>> and the latter has the DataFrame API as primary API and the ranges
>> API on its rowRanges() component. Nice switch!
>>
>> What does this API switch bring us? A GRangesFrame object is now
>> an object that fully behaves like a DataFrame and people can also
>> perform range-based operations on its rowRanges() component.
>> Here is what I'm afraid is going to happen: people will also want
>> to be able to perform range-based operations *directly* on
>> these objects, i.e. without having to call rowRanges() first.
>> So for example when they do subsetByOverlaps(), subsetting
>> happens vertically. Also the Hits object returned by findOverlaps()
>> would contain row indices. Problem with this is that these objects
>> now start to suffer from the "dual personality syndrome". For
>> example, it's not clear anymore what their length should be.
>> Strictly speaking it should be their number of columns (that's
>> what the length of a DataFrame is), but the ranges API that
>> we're trying to put on them also makes them feel like vectors
>> along the vertical dimension so it also feels that their length
>> should be their number of rows. Same thing with 1D subsetting.
>> Why does it subset the columns and not the rows? Most people
>> are now confused.
>>
>> It's interesting to note that the same thing happens with GRanges
>> objects, but in the opposite direction: people wish they could
>> do DataFrame operations directly on them without calling mcols()
>> first. But in order to preserve the good health of GRanges objects,
>> we've not done that (except for $, a shortcut for mcols(x)$,
>> the pressure was just too strong).
>>
>> H.
>>
>>
>>
>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>
>>> Should be possible for the annotations to be of any type, as long as they
>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>>> DataFrame, GRanges, or whatever in there. But it would be nice to have a
>>> special class for the container with range information. The contract for
>>> the range annotation would be to have a granges() method.
>>>
>>> I agree it would be nice if there was a way with the methods package to
>>> easily assert such contracts. For example, one could define an interface
>>> with a set of generics (and optionally the relevant position in the
>>> generic
>>> signature). Then, once all of the methods have been assigned for a
>>> particular class, it is made to inherit from that contract class. There
>>> are
>>> lots of gotchas though. Not sure how useful it would be in practice.
>>>
>>>
>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty 
>>> wrote:
>>>
>>>  There are some nice similarities in these new imaginary types.  A
 "GRangesFrame" is a list of dimensionally identical things (columns) and
 some row meta-data (the GRanges).  The SE-like object is similarly a
 list
 of dimensionally like things (matrices, RleDataFrames, BigMatrix
 objects,
 HDF5-backed things) with some row meta-data (a DataFrame or
 GRangesFrame).
 Elegant?  Maybe they would actually be relatives in the class tree.

 I wonder if this kind of thing would be easier if we had Java-style
 Interfaces or duck-typing.  The "x" slot of "y" holds something that
 implements this set of methods ...

 Oh, and kinda apropos, the genoset class will probably go away or become
 an extension to thi

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Tim Triche, Jr.
What complexity?  The Nature Methods paper laid it out: for most people,
most of the time, use an SE.

That way, the organization of metadata and covariates is enforced for you,
like an ExpressionSet (another winning data structure) but without its
baggage.

Maybe the "Summarized" in the name isn't such a bad idea after all.
 "AfterTheDataMungingIsDone" doesn't have the same ring to it.

What would be equally awesome IMHO is to have a similarly unifying
structure for integrative work.

But that's just, like, my opinion.  I've taken a whack at it when I knew
even less than I do now, and it's hard.  However, data management for
expression arrays was hard, too.  If I'm not mistaken, there were benefits
to solving that data management problem, too.  Some sort of a software
project.  I think it was called "MADMAN".  I'll have to go look.  ;-)



Statistics is the grammar of science.
Karl Pearson 

On Wed, Mar 4, 2015 at 10:03 AM, Peter Haverty 
wrote:

>  Michael has a good point. The complexity of the BioC universe of classes
> hurts our ability to attract new users. More classes would be a minus there
> ... but a small set of common, explicit APIs would simplify things.
> Rectangular things implement the matrix Interface.  :-) Deprecating old
> stuff, like eSet, might help more than it hurts, on the simplicity front.
>
>  P.S. apropos of understanding this universe of classes, I *love* the
> methods(class=x) thing Vincent mentioned.
>
>  Pete
>
> 
> Peter M. Haverty, Ph.D.
> Genentech, Inc.
> phave...@gene.com
>
> On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <
> lawrence.mich...@gene.com> wrote:
>
>> I think we need to make sure that there are enough benefits of something
>> like GRangesFrame before we introduce yet another complicated and
>> overlapping data structure into the framework. Prior to summarization, the
>> ranges seem primary, after summarization, it may often make sense for them
>> to be secondary. But I'm just not sure what we gain from a new data
>> structure.
>>
>> On Wed, Mar 4, 2015 at 12:28 AM, Hervé Pagès 
>> wrote:
>>
>>> GRangesFrame is an interesting idea and I gave it some thoughts.
>>>
>>> There is this nice symmetry between GRanges and GRangesFrame:
>>>
>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>>
>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>>  some accessor (e.g. rowRanges())
>>>
>>> So GRanges and GRangesFrame are equivalent in terms of what they
>>> can hold, but different in terms of API: the former has the ranges
>>> API as primary API and the DataFrame API on its mcols() component,
>>> and the latter has the DataFrame API as primary API and the ranges
>>> API on its rowRanges() component. Nice switch!
>>>
>>> What does this API switch bring us? A GRangesFrame object is now
>>> an object that fully behaves like a DataFrame and people can also
>>> perform range-based operations on its rowRanges() component.
>>> Here is what I'm afraid is going to happen: people will also want
>>> to be able to perform range-based operations *directly* on
>>> these objects, i.e. without having to call rowRanges() first.
>>> So for example when they do subsetByOverlaps(), subsetting
>>> happens vertically. Also the Hits object returned by findOverlaps()
>>> would contain row indices. Problem with this is that these objects
>>> now start to suffer from the "dual personality syndrome". For
>>> example, it's not clear anymore what their length should be.
>>> Strictly speaking it should be their number of columns (that's
>>> what the length of a DataFrame is), but the ranges API that
>>> we're trying to put on them also makes them feel like vectors
>>> along the vertical dimension so it also feels that their length
>>> should be their number of rows. Same thing with 1D subsetting.
>>> Why does it subset the columns and not the rows? Most people
>>> are now confused.
>>>
>>> It's interesting to note that the same thing happens with GRanges
>>> objects, but in the opposite direction: people wish they could
>>> do DataFrame operations directly on them without calling mcols()
>>> first. But in order to preserve the good health of GRanges objects,
>>> we've not done that (except for $, a shortcut for mcols(x)$,
>>> the pressure was just too strong).
>>>
>>> H.
>>>
>>>
>>>
>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>>
 Should be possible for the annotations to be of any type, as long as
 they
 satisfy a simple contract of NROW() and 2D "[". Then, you could have a
 DataFrame, GRanges, or whatever in there. But it would be nice to have a
 special class for the container with range information. The contract for
 the range annotation would be to have a granges() method.

 I agree it would be nice if there was a way with the methods package to
 easily assert such contracts. For example, one could define an interface
 with

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Peter Haverty
Clarification:  the complexity of the full BioC class universe, not the
SE/eSet part. GenomicRanges, GRanges, GRangesList, RangesView,
RangesViewsList, ... I think all of that intimidates new people.  Maybe
that's not generally the case.  Sorry, I've taken this thread way off
topic.  I'll stop now.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Mar 4, 2015 at 10:08 AM, Tim Triche, Jr. 
wrote:

> What complexity?  The Nature Methods paper laid it out: for most people,
> most of the time, use an SE.
>
> That way, the organization of metadata and covariates is enforced for you,
> like an ExpressionSet (another winning data structure) but without its
> baggage.
>
> Maybe the "Summarized" in the name isn't such a bad idea after all.
>  "AfterTheDataMungingIsDone" doesn't have the same ring to it.
>
> What would be equally awesome IMHO is to have a similarly unifying
> structure for integrative work.
>
> But that's just, like, my opinion.  I've taken a whack at it when I knew
> even less than I do now, and it's hard.  However, data management for
> expression arrays was hard, too.  If I'm not mistaken, there were benefits
> to solving that data management problem, too.  Some sort of a software
> project.  I think it was called "MADMAN".  I'll have to go look.  ;-)
>
>
>
> Statistics is the grammar of science.
> Karl Pearson 
>
> On Wed, Mar 4, 2015 at 10:03 AM, Peter Haverty 
> wrote:
>
>>  Michael has a good point. The complexity of the BioC universe of
>> classes hurts our ability to attract new users. More classes would be a
>> minus there ... but a small set of common, explicit APIs would simplify
>> things.  Rectangular things implement the matrix Interface.  :-)
>> Deprecating old stuff, like eSet, might help more than it hurts, on the
>> simplicity front.
>>
>>  P.S. apropos of understanding this universe of classes, I *love* the
>> methods(class=x) thing Vincent mentioned.
>>
>>  Pete
>>
>> 
>> Peter M. Haverty, Ph.D.
>> Genentech, Inc.
>> phave...@gene.com
>>
>> On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <
>> lawrence.mich...@gene.com> wrote:
>>
>>> I think we need to make sure that there are enough benefits of something
>>> like GRangesFrame before we introduce yet another complicated and
>>> overlapping data structure into the framework. Prior to summarization, the
>>> ranges seem primary, after summarization, it may often make sense for them
>>> to be secondary. But I'm just not sure what we gain from a new data
>>> structure.
>>>
>>> On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s 
>>> wrote:
>>>
 GRangesFrame is an interesting idea and I gave it some thoughts.

 There is this nice symmetry between GRanges and GRangesFrame:

 - GRanges = a naked GRanges + a DataFrame accessible via mcols()

 - GRangesFrame = a DataFrame + a naked GRanges accessible via
  some accessor (e.g. rowRanges())

 So GRanges and GRangesFrame are equivalent in terms of what they
 can hold, but different in terms of API: the former has the ranges
 API as primary API and the DataFrame API on its mcols() component,
 and the latter has the DataFrame API as primary API and the ranges
 API on its rowRanges() component. Nice switch!

 What does this API switch bring us? A GRangesFrame object is now
 an object that fully behaves like a DataFrame and people can also
 perform range-based operations on its rowRanges() component.
 Here is what I'm afraid is going to happen: people will also want
 to be able to perform range-based operations *directly* on
 these objects, i.e. without having to call rowRanges() first.
 So for example when they do subsetByOverlaps(), subsetting
 happens vertically. Also the Hits object returned by findOverlaps()
 would contain row indices. Problem with this is that these objects
 now start to suffer from the "dual personality syndrome". For
 example, it's not clear anymore what their length should be.
 Strictly speaking it should be their number of columns (that's
 what the length of a DataFrame is), but the ranges API that
 we're trying to put on them also makes them feel like vectors
 along the vertical dimension so it also feels that their length
 should be their number of rows. Same thing with 1D subsetting.
 Why does it subset the columns and not the rows? Most people
 are now confused.

 It's interesting to note that the same thing happens with GRanges
 objects, but in the opposite direction: people wish they could
 do DataFrame operations directly on them without calling mcols()
 first. But in order to preserve the good health of GRanges objects,
 we've not done that (except for $, a shortcut for mcols(x)$,
 the pressure was just too strong).

 H.



 On 03/03/2015 04:35 PM, Michael

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Tim Triche, Jr.
My response was meant to address this:

1) fixed-dimension, fixed sample set is a solved problem, and SE is that
solution.
2) multi-assay, "holes" across samples remains an ugly thorny problem,
maybe needs a new API

So why not keep SE as stable as possible, and dump all the explosive
changes into the latter?


Statistics is the grammar of science.
Karl Pearson 

On Wed, Mar 4, 2015 at 9:12 AM, Vincent Carey 
wrote:

>
>
> On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo 
> wrote:
>
>> some of the goals behind this discussion are IMO similar to the ones for
>> biocMultiAssay:
>>
>> https://github.com/vjcitn/biocMultiAssay
>>
>> maybe Vince can confirm.
>>
>
>
> It is true that there are connections between the concerns  But the way I
> see it, the container design we
> are talking about in this thread addresses the management of a fixed
> common assay type over a fixed set of samples.
>
> The biocMultiAssay deals with the management of multiple assay types over
> multiple samples, with possible
> disparities in sample sets over the different assay types.
>
>
>
>> robert.
>>
>> On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:
>>
>>> Oh, I don't disagree.  Perhaps the two problems can be addressed
>>> simultaneously by
>>>
>>> 1) deciding on what contracts a multi-assay container can/would demand to
>>> be useful
>>> 2) calling it something besides SummarizedExperiment, say,
>>> ExperimentCollection
>>>
>>> Then the SE API could stay the same as it is (which is already very
>>> useful)
>>> and progress could be sought in the offshoot (ExperimentCollection or
>>> whatever) without breaking things that rely on SE.
>>>
>>> Just off the top of my head, a most generically useful container for DNA
>>> methylation&  CNV data (which can of course be called from the same
>>> assay)
>>> is Kasper&  JP's GenomicRatioSet, which already has some weird quirks for
>>> eSet backwards compatibility.  (e.g. sampleNames(x) works, but
>>> sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls
>>> rowData(x))  There are little niggles that I should probably just send
>>> in a
>>> patch for, but a cleaner overall container would be better, if for no
>>> other
>>> reason than the aforementioned ability to easily experiment with
>>> imputation. An approach that I've been using is to stuff the SNPs, CNV
>>> (as
>>> GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
>>> somewhat less than optimal, especially when subsetting.
>>>
>>> But it does suggest that I could define a coercion from the current
>>> rambling wreck into a nice clean new class/API (ExperimentCollection or
>>> whatever) and I'll bet other package authors could, too.  The presence
>>> of a
>>> GRangesFrame would then be handy for returning a given assay's results,
>>> so
>>> that the user could be blissfully ignorant of the storage backing (ff,
>>> BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data
>>> management
>>> advantages of a SummarizedExperiment.
>>>
>>> JMHO
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Statistics is the grammar of science.
>>> Karl Pearson
>>>
>>>
>>> On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey>> >
>>> wrote:
>>>
>>>I am a bit concerned about any major alterations to the
 SummarizedExperiment API.  We have
 two papers and plenty of working code that use it in meaningful ways.
 Effort required to keep new
 formulations back-compatible as well as bug-free has to be weighed
 seriously.

   I agree that the name is not ideal.  We are learning as we go.

   Seems to make sense to start with the contracts we want the instances
 of
 a class to satisfy.  I have long felt
 that X[i, j] idiom is one users and developers should be comfortable
 with,
 even insist on, and for consistency
 with matrix operations idiom, it should work in a natural way for
 numeric
 indexing.  This seems like an important
 constraint.  subsetBy* is a useful idiom, but it is conceivable that we
 would adopt filter() for row-oriented selections
 and select() for column-oriented selections.  Do we have to make any
 special design considerations to allow
 very smooth interoperation with out-of-memory resources for certain
 components for developers who want to allow this?

   We should have a reasonable way to get data on what is out there, what
 is used, how it is most effectively used.
 What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
 killer packages that use/don't use it?
 Even getting data on the formal API for a class is not all that
 familiar.
 And if folks are writing non-S4 interfaces (i.e., naked
 functions) we have no way of identifying them.  See below for one way of
 discovering the API for SummarizedExperiment.

   In summary, I think we have to be carefu

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Robert Castelo
some of the goals behind this discussion are IMO similar to the ones for 
biocMultiAssay:


https://github.com/vjcitn/biocMultiAssay

maybe Vince can confirm.

robert.

On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:

Oh, I don't disagree.  Perhaps the two problems can be addressed
simultaneously by

1) deciding on what contracts a multi-assay container can/would demand to
be useful
2) calling it something besides SummarizedExperiment, say,
ExperimentCollection

Then the SE API could stay the same as it is (which is already very useful)
and progress could be sought in the offshoot (ExperimentCollection or
whatever) without breaking things that rely on SE.

Just off the top of my head, a most generically useful container for DNA
methylation&  CNV data (which can of course be called from the same assay)
is Kasper&  JP's GenomicRatioSet, which already has some weird quirks for
eSet backwards compatibility.  (e.g. sampleNames(x) works, but
sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls
rowData(x))  There are little niggles that I should probably just send in a
patch for, but a cleaner overall container would be better, if for no other
reason than the aforementioned ability to easily experiment with
imputation. An approach that I've been using is to stuff the SNPs, CNV (as
GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
somewhat less than optimal, especially when subsetting.

But it does suggest that I could define a coercion from the current
rambling wreck into a nice clean new class/API (ExperimentCollection or
whatever) and I'll bet other package authors could, too.  The presence of a
GRangesFrame would then be handy for returning a given assay's results, so
that the user could be blissfully ignorant of the storage backing (ff,
BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management
advantages of a SummarizedExperiment.

JMHO







Statistics is the grammar of science.
Karl Pearson

On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey
wrote:


  I am a bit concerned about any major alterations to the
SummarizedExperiment API.  We have
two papers and plenty of working code that use it in meaningful ways.
Effort required to keep new
formulations back-compatible as well as bug-free has to be weighed
seriously.

  I agree that the name is not ideal.  We are learning as we go.

  Seems to make sense to start with the contracts we want the instances of
a class to satisfy.  I have long felt
that X[i, j] idiom is one users and developers should be comfortable with,
even insist on, and for consistency
with matrix operations idiom, it should work in a natural way for numeric
indexing.  This seems like an important
constraint.  subsetBy* is a useful idiom, but it is conceivable that we
would adopt filter() for row-oriented selections
and select() for column-oriented selections.  Do we have to make any
special design considerations to allow
very smooth interoperation with out-of-memory resources for certain
components for developers who want to allow this?

  We should have a reasonable way to get data on what is out there, what
is used, how it is most effectively used.
What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
killer packages that use/don't use it?
Even getting data on the formal API for a class is not all that familiar.
And if folks are writing non-S4 interfaces (i.e., naked
functions) we have no way of identifying them.  See below for one way of
discovering the API for SummarizedExperiment.

  In summary, I think we have to be careful about overdesigning too
early.  Getting clear on contracts seems the best
way to ensure reuse, and we really want that so that reliability is
continually assessed.  My sense is that it is good
to give developers something they'll gladly extend, not necessarily reuse
directly.  So we don't have to have
broad consensus on class details, but on the minimal abstraction and on
obligatory tests on its basic implementation.


methods(class="SummarizedExperiment")  # perhaps an obsolete version of

methods cataloguer by MTM

DataFrame with 76 rows and 3 columns

  generic
   signature   package

  
 

1  [   x="SummarizedExperiment", i="ANY",
j="ANY", drop="ANY"  base

2  [  x="SummarizedExperiment", i="ANY",
j="missing", value="ANY"  base

3  [   x="SummarizedExperiment",
i="ANY", j="missing"  base

4[<- x="SummarizedExperiment", i="ANY", j="ANY",
value="SummarizedExperiment"  base

5  assay
x="SummarizedExperiment", i="character" GenomicRanges

...  ...
 ...   ...

72  updateObject
object="SummarizedExperiment"  BiocGenerics

73values
x="SummarizedExperiment" S4Vectors

74  values<-
x="SummarizedExperiment" S4Vectors

75 width

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Martin Morgan

On 03/04/2015 10:03 AM, Peter Haverty wrote:

Michael has a good point. The complexity of the BioC universe of classes
hurts our ability to attract new users. More classes would be a minus there
... but a small set of common, explicit APIs would simplify things.
Rectangular things implement the matrix Interface.  :-) Deprecating old
stuff, like eSet, might help more than it hurts, on the simplicity front.

P.S. apropos of understanding this universe of classes, I *love* the
methods(class=x) thing Vincent mentioned.


The current version, under R-devel, is at

  devtools::source_gist("https://gist.github.com/mtmorgan/9f98871adb9f0c1891a4";)

  > methods(class="SummarizedExperiment")
   [1] [ [[[[<-  [<-
   [5] $ $<-   assay assay<-
   [9] assayNamesassayNames<-  assaysassays<-
  [13] cbind coercecolData   colData<-
  [17] compare   Compare   countOverlaps coverage
  [21] dim   dimnames  dimnames<-disjointBins
  [25] distance  distanceToNearest duplicatedelementMetadata
  [29] elementMetadata<- end   end<- exptData
  [33] exptData<-extractROWS   findOverlaps  flank
  [37] followgranges   isDisjointmcols
  [41] mcols<-   narrownearest   order
  [45] overlapsAny   precede   rangesranges<-
  [49] rank  rbind replaceROWS   resize
  [53] restrict  rowData   rowData<- seqinfo
  [57] seqinfo<- seqnames  shift show
  [61] sort  split start start<-
  [65] strandstrand<-  subsetsubsetByOverlaps
  [69] updateObject  valuesvalues<-  width
  [73] width<-

  see ?"methods" for accessing help and source code

and

> head(attr(methods(class="SummarizedExperiment"), "info"))
 generic visible
[,SummarizedExperiment,ANY-method  [TRUE
[[,SummarizedExperiment,ANY,missing-method[[TRUE
[[<-,SummarizedExperiment,ANY,missing-method[[<-TRUE
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method [<-TRUE
$,SummarizedExperiment-method  $TRUE
$<-,SummarizedExperiment-method  $<-TRUE
 isS4  from
[,SummarizedExperiment,ANY-methodTRUE GenomicRanges
[[,SummarizedExperiment,ANY,missing-method   TRUE GenomicRanges
[[<-,SummarizedExperiment,ANY,missing-method TRUE GenomicRanges
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE GenomicRanges
$,SummarizedExperiment-methodTRUE GenomicRanges
$<-,SummarizedExperiment-method  TRUE GenomicRanges

Martin



Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence 
wrote:


I think we need to make sure that there are enough benefits of something
like GRangesFrame before we introduce yet another complicated and
overlapping data structure into the framework. Prior to summarization, the
ranges seem primary, after summarization, it may often make sense for them
to be secondary. But I'm just not sure what we gain from a new data
structure.

On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s  wrote:


GRangesFrame is an interesting idea and I gave it some thoughts.

There is this nice symmetry between GRanges and GRangesFrame:

- GRanges = a naked GRanges + a DataFrame accessible via mcols()

- GRangesFrame = a DataFrame + a naked GRanges accessible via
  some accessor (e.g. rowRanges())

So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!

What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personali

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-06 Thread Michael Love
hi all,

just a practical issue: I have GenomicRanges version 1.19.42 on my
computer which does not have rowRanges defined, although the 1.19.42
version on the Bioc website does have rowRanges in the man page:

http://master.bioconductor.org/packages/3.1/bioc/html/GenomicRanges.html

So I pass check locally but not in the devel branch on Bioc servers.

> library(GenomicRanges)
> rowRanges
Error: object 'rowRanges' not found
> sessionInfo()
R Under development (unstable) (2014-12-08 r67137)
Platform: x86_64-apple-darwin12.5.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4parallel  stats graphics  grDevices datasets  utils
   methods   base

other attached packages:
[1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
S4Vectors_0.5.21
[5] BiocGenerics_0.13.6   RUnit_0.4.28  devtools_1.7.0knitr_1.9
[9] BiocInstaller_1.17.5



On Wed, Mar 4, 2015 at 3:03 PM, Martin Morgan  wrote:
>
> On 03/04/2015 10:03 AM, Peter Haverty wrote:
>>
>> Michael has a good point. The complexity of the BioC universe of classes
>> hurts our ability to attract new users. More classes would be a minus there
>> ... but a small set of common, explicit APIs would simplify things.
>> Rectangular things implement the matrix Interface.  :-) Deprecating old
>> stuff, like eSet, might help more than it hurts, on the simplicity front.
>>
>> P.S. apropos of understanding this universe of classes, I *love* the
>> methods(class=x) thing Vincent mentioned.
>
>
> The current version, under R-devel, is at
>
>   
> devtools::source_gist("https://gist.github.com/mtmorgan/9f98871adb9f0c1891a4";)
>
>   > methods(class="SummarizedExperiment")
>[1] [ [[[[<-  [<-
>[5] $ $<-   assay assay<-
>[9] assayNamesassayNames<-  assaysassays<-
>   [13] cbind coercecolData   colData<-
>   [17] compare   Compare   countOverlaps coverage
>   [21] dim   dimnames  dimnames<-disjointBins
>   [25] distance  distanceToNearest duplicatedelementMetadata
>   [29] elementMetadata<- end   end<- exptData
>   [33] exptData<-extractROWS   findOverlaps  flank
>   [37] followgranges   isDisjointmcols
>   [41] mcols<-   narrownearest   order
>   [45] overlapsAny   precede   rangesranges<-
>   [49] rank  rbind replaceROWS   resize
>   [53] restrict  rowData   rowData<- seqinfo
>   [57] seqinfo<- seqnames  shift show
>   [61] sort  split start start<-
>   [65] strandstrand<-  subsetsubsetByOverlaps
>   [69] updateObject  valuesvalues<-  width
>   [73] width<-
>
>   see ?"methods" for accessing help and source code
>
> and
>
> > head(attr(methods(class="SummarizedExperiment"), "info"))
>  generic visible
> [,SummarizedExperiment,ANY-method  [TRUE
> [[,SummarizedExperiment,ANY,missing-method[[TRUE
> [[<-,SummarizedExperiment,ANY,missing-method[[<-TRUE
> [<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method [<-TRUE
> $,SummarizedExperiment-method  $TRUE
> $<-,SummarizedExperiment-method  $<-TRUE
>  isS4  
> from
> [,SummarizedExperiment,ANY-methodTRUE 
> GenomicRanges
> [[,SummarizedExperiment,ANY,missing-method   TRUE 
> GenomicRanges
> [[<-,SummarizedExperiment,ANY,missing-method TRUE 
> GenomicRanges
> [<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE 
> GenomicRanges
> $,SummarizedExperiment-methodTRUE 
> GenomicRanges
> $<-,SummarizedExperiment-method  TRUE 
> GenomicRanges
>
> Martin
>
>>
>> Pete
>>
>> 
>> Peter M. Haverty, Ph.D.
>> Genentech, Inc.
>> phave...@gene.com
>>
>> On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence 
>> wrote:
>>
>>> I think we need to make sure that there are enough benefits of something
>>> like GRangesFrame before we introduce yet another complicated and
>>> overlapping data structure into the framework. Prior to summarization, the
>>> ranges seem primary, after summarization, it may often make sense for them
>>> to be secondary. But I'm just not sure what we gain from a new data
>>> structure.
>>>
>>> On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s  wrote:
>>>
 GRangesFrame is an interesting idea an

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-06 Thread Valerie Obenchain

Hi Mike,

Our error - we didn't bump GenomicRanges when rowRanges was added. 
Hopefully 1.19.43 will propagate today and things will be sorted out.


Val


On 03/06/2015 07:40 AM, Michael Love wrote:

hi all,

just a practical issue: I have GenomicRanges version 1.19.42 on my
computer which does not have rowRanges defined, although the 1.19.42
version on the Bioc website does have rowRanges in the man page:

http://master.bioconductor.org/packages/3.1/bioc/html/GenomicRanges.html

So I pass check locally but not in the devel branch on Bioc servers.


library(GenomicRanges)
rowRanges

Error: object 'rowRanges' not found

sessionInfo()

R Under development (unstable) (2014-12-08 r67137)
Platform: x86_64-apple-darwin12.5.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4parallel  stats graphics  grDevices datasets  utils
methods   base

other attached packages:
[1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
S4Vectors_0.5.21
[5] BiocGenerics_0.13.6   RUnit_0.4.28  devtools_1.7.0knitr_1.9
[9] BiocInstaller_1.17.5



On Wed, Mar 4, 2015 at 3:03 PM, Martin Morgan  wrote:


On 03/04/2015 10:03 AM, Peter Haverty wrote:


Michael has a good point. The complexity of the BioC universe of classes
hurts our ability to attract new users. More classes would be a minus there
... but a small set of common, explicit APIs would simplify things.
Rectangular things implement the matrix Interface.  :-) Deprecating old
stuff, like eSet, might help more than it hurts, on the simplicity front.

P.S. apropos of understanding this universe of classes, I *love* the
methods(class=x) thing Vincent mentioned.



The current version, under R-devel, is at

   
devtools::source_gist("https://gist.github.com/mtmorgan/9f98871adb9f0c1891a4";)

   > methods(class="SummarizedExperiment")
[1] [ [[[[<-  [<-
[5] $ $<-   assay assay<-
[9] assayNamesassayNames<-  assaysassays<-
   [13] cbind coercecolData   colData<-
   [17] compare   Compare   countOverlaps coverage
   [21] dim   dimnames  dimnames<-disjointBins
   [25] distance  distanceToNearest duplicatedelementMetadata
   [29] elementMetadata<- end   end<- exptData
   [33] exptData<-extractROWS   findOverlaps  flank
   [37] followgranges   isDisjointmcols
   [41] mcols<-   narrownearest   order
   [45] overlapsAny   precede   rangesranges<-
   [49] rank  rbind replaceROWS   resize
   [53] restrict  rowData   rowData<- seqinfo
   [57] seqinfo<- seqnames  shift show
   [61] sort  split start start<-
   [65] strandstrand<-  subsetsubsetByOverlaps
   [69] updateObject  valuesvalues<-  width
   [73] width<-

   see ?"methods" for accessing help and source code

and


head(attr(methods(class="SummarizedExperiment"), "info"))

  generic visible
[,SummarizedExperiment,ANY-method  [TRUE
[[,SummarizedExperiment,ANY,missing-method[[TRUE
[[<-,SummarizedExperiment,ANY,missing-method[[<-TRUE
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method [<-TRUE
$,SummarizedExperiment-method  $TRUE
$<-,SummarizedExperiment-method  $<-TRUE
  isS4  from
[,SummarizedExperiment,ANY-methodTRUE GenomicRanges
[[,SummarizedExperiment,ANY,missing-method   TRUE GenomicRanges
[[<-,SummarizedExperiment,ANY,missing-method TRUE GenomicRanges
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE GenomicRanges
$,SummarizedExperiment-methodTRUE GenomicRanges
$<-,SummarizedExperiment-method  TRUE GenomicRanges

Martin



Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence 
wrote:


I think we need to make sure that there are enough benefits of something
like GRangesFrame before we introduce yet another complicated and
overlapping data structure into the framework. Prior to summarization, the
ranges seem primary, after summarization, it may often make sense for them
to be secondary. But I'm just not sure what we gain from a new data
structure.

On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s  wrote:


GRangesFrame is

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-09 Thread Kasper Daniel Hansen
It sounds like the proposed changes are already made.  However (like
others) I am still a bit mystified why this was necessary.  The old version
did allow for a GRanges inside the DataFrame of the rowData, as far as I
recall.  So I assume this is for efficiency.  But why?  What kind of
data/use cases is this for?

I am happy to hear that SummarizedExperiment is going to be spun out into
its own package.  When that happens, I have some comments, which I'll
include here in anticipation
  1) I now very strongly believe it was a design mistake to not have
colnames on the assays.  The advantage of this choice is that sampleNames
are only stored one place.  The extreme disadvantage is the high
ineffeciency when you want colnames on an extracted assay.
  2) I still strongly believe we should support pData, sampleNames etc etc
on SummarizedExperiments.
  3) Having developed a package (minfi) where eSets co-exists with
SummarizedExperiment, I have to mention that for the developer there is a
number of places where the different internals of these two classes makes
like irritating.  For this reason I would support a "modern" implementation
of eSet, in parallel with SummarizedExperiment.

Best,
Kasper

On Fri, Mar 6, 2015 at 10:59 AM, Valerie Obenchain 
wrote:

> Hi Mike,
>
> Our error - we didn't bump GenomicRanges when rowRanges was added.
> Hopefully 1.19.43 will propagate today and things will be sorted out.
>
> Val
>
>
> On 03/06/2015 07:40 AM, Michael Love wrote:
>
>> hi all,
>>
>> just a practical issue: I have GenomicRanges version 1.19.42 on my
>> computer which does not have rowRanges defined, although the 1.19.42
>> version on the Bioc website does have rowRanges in the man page:
>>
>> http://master.bioconductor.org/packages/3.1/bioc/html/GenomicRanges.html
>>
>> So I pass check locally but not in the devel branch on Bioc servers.
>>
>>  library(GenomicRanges)
>>> rowRanges
>>>
>> Error: object 'rowRanges' not found
>>
>>> sessionInfo()
>>>
>> R Under development (unstable) (2014-12-08 r67137)
>> Platform: x86_64-apple-darwin12.5.0 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats4parallel  stats graphics  grDevices datasets  utils
>> methods   base
>>
>> other attached packages:
>> [1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
>> S4Vectors_0.5.21
>> [5] BiocGenerics_0.13.6   RUnit_0.4.28  devtools_1.7.0
>> knitr_1.9
>> [9] BiocInstaller_1.17.5
>>
>>
>>
>> On Wed, Mar 4, 2015 at 3:03 PM, Martin Morgan 
>> wrote:
>>
>>>
>>> On 03/04/2015 10:03 AM, Peter Haverty wrote:
>>>

 Michael has a good point. The complexity of the BioC universe of classes
 hurts our ability to attract new users. More classes would be a minus
 there
 ... but a small set of common, explicit APIs would simplify things.
 Rectangular things implement the matrix Interface.  :-) Deprecating old
 stuff, like eSet, might help more than it hurts, on the simplicity
 front.

 P.S. apropos of understanding this universe of classes, I *love* the
 methods(class=x) thing Vincent mentioned.

>>>
>>>
>>> The current version, under R-devel, is at
>>>
>>>devtools::source_gist("https://gist.github.com/mtmorgan/
>>> 9f98871adb9f0c1891a4")
>>>
>>>> methods(class="SummarizedExperiment")
>>> [1] [ [[[[<-  [<-
>>> [5] $ $<-   assay assay<-
>>> [9] assayNamesassayNames<-  assaysassays<-
>>>[13] cbind coercecolData   colData<-
>>>[17] compare   Compare   countOverlaps coverage
>>>[21] dim   dimnames  dimnames<-
>>> disjointBins
>>>[25] distance  distanceToNearest duplicated
>>> elementMetadata
>>>[29] elementMetadata<- end   end<- exptData
>>>[33] exptData<-extractROWS   findOverlaps  flank
>>>[37] followgranges   isDisjointmcols
>>>[41] mcols<-   narrownearest   order
>>>[45] overlapsAny   precede   rangesranges<-
>>>[49] rank  rbind replaceROWS   resize
>>>[53] restrict  rowData   rowData<- seqinfo
>>>[57] seqinfo<- seqnames  shift show
>>>[61] sort  split start start<-
>>>[65] strandstrand<-  subset
>>> subsetByOverlaps
>>>[69] updateObject  valuesvalues<-  width
>>>[73] width<-
>>>
>>>see ?"methods" for accessing help and source code
>>>
>>> and
>>>
>>>  head(attr(methods(class="SummarizedExperiment"), "info"))

>>>   generic
>>> visible
>>> [,SummarizedExperiment,ANY-method

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-09 Thread Kasper Daniel Hansen
On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
wrote:

> I am glad you are keeping this discussion alive Kasper.
>
> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
> kasperdanielhan...@gmail.com> wrote:
>
>> It sounds like the proposed changes are already made.  However (like
>> others) I am still a bit mystified why this was necessary.  The old
>> version
>> did allow for a GRanges inside the DataFrame of the rowData, as far as I
>> recall.  So I assume this is for efficiency.  But why?  What kind of
>> data/use cases is this for?
>>
>> I am happy to hear that SummarizedExperiment is going to be spun out into
>> its own package.  When that happens, I have some comments, which I'll
>> include here in anticipation
>>   1) I now very strongly believe it was a design mistake to not have
>> colnames on the assays.  The advantage of this choice is that sampleNames
>> are only stored one place.  The extreme disadvantage is the high
>> ineffeciency when you want colnames on an extracted assay.
>>
>
> after example(SummarizedExperiment)
>
> > colnames(assays(se1)[[1]])
> [1] "A" "B" "C" "D" "E" "F"
>
> so this seems to be optional.  But attempts to set rownames will fail
> silently
>
> > rownames(assays(se1)[[1]]) = as.character(1:200)
>
> > rownames(assays(se1)[[1]])
>
> NULL
> seems we could issue a warning there
>


Vince, you need to be careful here.

The assays are stored without colnames (unless something has recently
changed).  The default is to - upon extraction - set the colnames of the
matrix.  This however requires a copy of the entire matrix.  So
essentially, upon extraction, each assay is needlessly duplicated to add
the colnames.  This is what I mean by inefficient. I would prefer to store
the assays with colnames.  This means that changing sampleNames of the
object will be inefficient (as it is for eSets) since it would require a
complete copy of everything.  But I would rather - much rather - copy when
setting sampleNames than copy when extracting an assay.

Best,
Kasper

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-09 Thread Vincent Carey
I am glad you are keeping this discussion alive Kasper.

On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:

> It sounds like the proposed changes are already made.  However (like
> others) I am still a bit mystified why this was necessary.  The old version
> did allow for a GRanges inside the DataFrame of the rowData, as far as I
> recall.  So I assume this is for efficiency.  But why?  What kind of
> data/use cases is this for?
>
> I am happy to hear that SummarizedExperiment is going to be spun out into
> its own package.  When that happens, I have some comments, which I'll
> include here in anticipation
>   1) I now very strongly believe it was a design mistake to not have
> colnames on the assays.  The advantage of this choice is that sampleNames
> are only stored one place.  The extreme disadvantage is the high
> ineffeciency when you want colnames on an extracted assay.
>

after example(SummarizedExperiment)

> colnames(assays(se1)[[1]])
[1] "A" "B" "C" "D" "E" "F"

so this seems to be optional.  But attempts to set rownames will fail
silently

> rownames(assays(se1)[[1]]) = as.character(1:200)

> rownames(assays(se1)[[1]])

NULL
seems we could issue a warning there

  2) I still strongly believe we should support pData, sampleNames etc etc
> on SummarizedExperiments.
>

worthy of discussion


>   3) Having developed a package (minfi) where eSets co-exists with
> SummarizedExperiment, I have to mention that for the developer there is a
> number of places where the different internals of these two classes makes
> like irritating.  For this reason I would support a "modern" implementation
> of eSet, in parallel with SummarizedExperiment.
>
>
also worthy of further discussion IMHO


> Best,
> Kasper
>
> On Fri, Mar 6, 2015 at 10:59 AM, Valerie Obenchain  >
> wrote:
>
> > Hi Mike,
> >
> > Our error - we didn't bump GenomicRanges when rowRanges was added.
> > Hopefully 1.19.43 will propagate today and things will be sorted out.
> >
> > Val
> >
> >
> > On 03/06/2015 07:40 AM, Michael Love wrote:
> >
> >> hi all,
> >>
> >> just a practical issue: I have GenomicRanges version 1.19.42 on my
> >> computer which does not have rowRanges defined, although the 1.19.42
> >> version on the Bioc website does have rowRanges in the man page:
> >>
> >>
> http://master.bioconductor.org/packages/3.1/bioc/html/GenomicRanges.html
> >>
> >> So I pass check locally but not in the devel branch on Bioc servers.
> >>
> >>  library(GenomicRanges)
> >>> rowRanges
> >>>
> >> Error: object 'rowRanges' not found
> >>
> >>> sessionInfo()
> >>>
> >> R Under development (unstable) (2014-12-08 r67137)
> >> Platform: x86_64-apple-darwin12.5.0 (64-bit)
> >>
> >> locale:
> >> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >>
> >> attached base packages:
> >> [1] stats4parallel  stats graphics  grDevices datasets  utils
> >> methods   base
> >>
> >> other attached packages:
> >> [1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
> >> S4Vectors_0.5.21
> >> [5] BiocGenerics_0.13.6   RUnit_0.4.28  devtools_1.7.0
> >> knitr_1.9
> >> [9] BiocInstaller_1.17.5
> >>
> >>
> >>
> >> On Wed, Mar 4, 2015 at 3:03 PM, Martin Morgan 
> >> wrote:
> >>
> >>>
> >>> On 03/04/2015 10:03 AM, Peter Haverty wrote:
> >>>
> 
>  Michael has a good point. The complexity of the BioC universe of
> classes
>  hurts our ability to attract new users. More classes would be a minus
>  there
>  ... but a small set of common, explicit APIs would simplify things.
>  Rectangular things implement the matrix Interface.  :-) Deprecating
> old
>  stuff, like eSet, might help more than it hurts, on the simplicity
>  front.
> 
>  P.S. apropos of understanding this universe of classes, I *love* the
>  methods(class=x) thing Vincent mentioned.
> 
> >>>
> >>>
> >>> The current version, under R-devel, is at
> >>>
> >>>devtools::source_gist("https://gist.github.com/mtmorgan/
> >>> 9f98871adb9f0c1891a4")
> >>>
> >>>> methods(class="SummarizedExperiment")
> >>> [1] [ [[[[<-  [<-
> >>> [5] $ $<-   assay assay<-
> >>> [9] assayNamesassayNames<-  assaysassays<-
> >>>[13] cbind coercecolData   colData<-
> >>>[17] compare   Compare   countOverlaps coverage
> >>>[21] dim   dimnames  dimnames<-
> >>> disjointBins
> >>>[25] distance  distanceToNearest duplicated
> >>> elementMetadata
> >>>[29] elementMetadata<- end   end<- exptData
> >>>[33] exptData<-extractROWS   findOverlaps  flank
> >>>[37] followgranges   isDisjointmcols
> >>>[41] mcols<-   narrownearest   order
> >>>[45] overlapsAny   precede   ranges  

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-09 Thread Michael Love
Some guidance on how to avoid duplication of the matrix for developers
would be greatly appreciated.

Another example of a trouble point, is that if I am given an SE with
an unnamed assay and I need to give the assay a name, this also can
expand the memory used. I had found a solution (which works with
GenomicRanges 1.18 / current release) with:

names(assays(se, withDimnames=FALSE))[1] <- "foo"

But now I'm looking in devel and this appears to no longer work. The
memory used expands, equivalent to:

names(assays(se))[1] <- "foo"

Here's some code to try this:

m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
se <- SummarizedExperiment(m)
names(assays(se, withDimnames=FALSE))[1] <- "foo"
names(assays(se))[1] <- "foo"

while running gc() in between steps.


On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
 wrote:
> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
> wrote:
>
>> I am glad you are keeping this discussion alive Kasper.
>>
>> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>> kasperdanielhan...@gmail.com> wrote:
>>
>>> It sounds like the proposed changes are already made.  However (like
>>> others) I am still a bit mystified why this was necessary.  The old
>>> version
>>> did allow for a GRanges inside the DataFrame of the rowData, as far as I
>>> recall.  So I assume this is for efficiency.  But why?  What kind of
>>> data/use cases is this for?
>>>
>>> I am happy to hear that SummarizedExperiment is going to be spun out into
>>> its own package.  When that happens, I have some comments, which I'll
>>> include here in anticipation
>>>   1) I now very strongly believe it was a design mistake to not have
>>> colnames on the assays.  The advantage of this choice is that sampleNames
>>> are only stored one place.  The extreme disadvantage is the high
>>> ineffeciency when you want colnames on an extracted assay.
>>>
>>
>> after example(SummarizedExperiment)
>>
>> > colnames(assays(se1)[[1]])
>> [1] "A" "B" "C" "D" "E" "F"
>>
>> so this seems to be optional.  But attempts to set rownames will fail
>> silently
>>
>> > rownames(assays(se1)[[1]]) = as.character(1:200)
>>
>> > rownames(assays(se1)[[1]])
>>
>> NULL
>> seems we could issue a warning there
>>
>
>
> Vince, you need to be careful here.
>
> The assays are stored without colnames (unless something has recently
> changed).  The default is to - upon extraction - set the colnames of the
> matrix.  This however requires a copy of the entire matrix.  So
> essentially, upon extraction, each assay is needlessly duplicated to add
> the colnames.  This is what I mean by inefficient. I would prefer to store
> the assays with colnames.  This means that changing sampleNames of the
> object will be inefficient (as it is for eSets) since it would require a
> complete copy of everything.  But I would rather - much rather - copy when
> setting sampleNames than copy when extracting an assay.
>
> Best,
> Kasper
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-09 Thread Martin Morgan

On 03/09/2015 07:36 AM, Kasper Daniel Hansen wrote:

On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
wrote:


I am glad you are keeping this discussion alive Kasper.

On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:


It sounds like the proposed changes are already made.  However (like
others) I am still a bit mystified why this was necessary.  The old
version
did allow for a GRanges inside the DataFrame of the rowData, as far as I
recall.  So I assume this is for efficiency.  But why?  What kind of
data/use cases is this for?


Actually the design has GRanges on the 'outside'; a DataFrame can be emulated 
with GRangesList of 0 elements, with mcols() the DataFrame, but this is 
obviously a hack. Simon Anders argued perhaps 5 years ago for DataFrame on the 
'outside', allowing for GRanges on the inside; maybe that would have been a 
better original design, but I guess I was stuck on the the defining 
characteristic of sequencing experiments being range-based, the expressive power 
of reliably overlapping say ranges of differentially expressed genes with ranges 
of variants or ChIP binding sites, and a desire not to introduce a plethora 
(e.g., 2) of classes.


You can think of what has been done so far as simply renaming the accessor, from 
a bland rowData() to more meaningful rowRanges(). This did come about from 
discussion of community input (start ing 
https://stat.ethz.ch/pipermail/bioc-devel/2014-November/006686.html), just 
perhaps not consistent with all opinions expressed. We felt it was important to 
get this first step done 'this release', because it frees us to do more 
substantial refactoring immediately after the coming release while allowing 
rowData() a chance to cycle out of existence.


The exact nature of the refactoring implementation is still not decided, but the 
conceptual ideas are to enable a SummarizedExperiment (sub)class that does not 
require a GRanges* rowData, while retaining a SummarizedExperiment (sub)class 
that is based on GRanges rowData / rowRanges.




I am happy to hear that SummarizedExperiment is going to be spun out into
its own package.  When that happens, I have some comments, which I'll
include here in anticipation
   1) I now very strongly believe it was a design mistake to not have
colnames on the assays.  The advantage of this choice is that sampleNames
are only stored one place.  The extreme disadvantage is the high
ineffeciency when you want colnames on an extracted assay.



after example(SummarizedExperiment)


colnames(assays(se1)[[1]])

[1] "A" "B" "C" "D" "E" "F"

so this seems to be optional.  But attempts to set rownames will fail
silently


rownames(assays(se1)[[1]]) = as.character(1:200)



rownames(assays(se1)[[1]])


NULL
seems we could issue a warning there


the rownames issue seems to be a bug; simply accessing row and colnames on the 
object itself is sufficient


  > colnames(se1) = tolower(colnames(se1))
  > colnames(se1)
  [1] "a" "b" "c" "d" "e" "f"
  > rownames(se1) = 1:200
  > head(rownames(se1))
  [1] "1" "2" "3" "4" "5" "6"






Vince, you need to be careful here.

The assays are stored without colnames (unless something has recently
changed).  The default is to - upon extraction - set the colnames of the
matrix.  This however requires a copy of the entire matrix.  So
essentially, upon extraction, each assay is needlessly duplicated to add
the colnames.  This is what I mean by inefficient. I would prefer to store


yes this is certainly a bad design decision, and will be corrected.


the assays with colnames.  This means that changing sampleNames of the
object will be inefficient (as it is for eSets) since it would require a
complete copy of everything.  But I would rather - much rather - copy when
setting sampleNames than copy when extracting an assay.

Best,
Kasper

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-09 Thread Martin Morgan

On 03/09/2015 07:06 AM, Kasper Daniel Hansen wrote:

It sounds like the proposed changes are already made.  However (like
others) I am still a bit mystified why this was necessary.  The old version
did allow for a GRanges inside the DataFrame of the rowData, as far as I
recall.  So I assume this is for efficiency.  But why?  What kind of
data/use cases is this for?

I am happy to hear that SummarizedExperiment is going to be spun out into
its own package.  When that happens, I have some comments, which I'll
include here in anticipation
   1) I now very strongly believe it was a design mistake to not have
colnames on the assays.  The advantage of this choice is that sampleNames
are only stored one place.  The extreme disadvantage is the high
ineffeciency when you want colnames on an extracted assay.
   2) I still strongly believe we should support pData, sampleNames etc etc
on SummarizedExperiments.


I'm not keen on this 'backward compatibility' layer, or introducing functions 
with duplicate functionality, even if their implementation is just a 'one 
liner'; use rownames, colData, etc.



   3) Having developed a package (minfi) where eSets co-exists with
SummarizedExperiment, I have to mention that for the developer there is a
number of places where the different internals of these two classes makes
like irritating.  For this reason I would support a "modern" implementation
of eSet, in parallel with SummarizedExperiment.


Yes, the intention is that a SummarizedExperiment (sub) class with rowData() 
being a DataFrame would be a replacement for eSet.


I don't think you were suggesting that eSet itself should be modernized; it has 
a lot of historical baggage.


Martin



Best,
Kasper

On Fri, Mar 6, 2015 at 10:59 AM, Valerie Obenchain 
wrote:


Hi Mike,

Our error - we didn't bump GenomicRanges when rowRanges was added.
Hopefully 1.19.43 will propagate today and things will be sorted out.

Val


On 03/06/2015 07:40 AM, Michael Love wrote:


hi all,

just a practical issue: I have GenomicRanges version 1.19.42 on my
computer which does not have rowRanges defined, although the 1.19.42
version on the Bioc website does have rowRanges in the man page:

http://master.bioconductor.org/packages/3.1/bioc/html/GenomicRanges.html

So I pass check locally but not in the devel branch on Bioc servers.

  library(GenomicRanges)

rowRanges


Error: object 'rowRanges' not found


sessionInfo()


R Under development (unstable) (2014-12-08 r67137)
Platform: x86_64-apple-darwin12.5.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4parallel  stats graphics  grDevices datasets  utils
 methods   base

other attached packages:
[1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
S4Vectors_0.5.21
[5] BiocGenerics_0.13.6   RUnit_0.4.28  devtools_1.7.0
knitr_1.9
[9] BiocInstaller_1.17.5



On Wed, Mar 4, 2015 at 3:03 PM, Martin Morgan 
wrote:



On 03/04/2015 10:03 AM, Peter Haverty wrote:



Michael has a good point. The complexity of the BioC universe of classes
hurts our ability to attract new users. More classes would be a minus
there
... but a small set of common, explicit APIs would simplify things.
Rectangular things implement the matrix Interface.  :-) Deprecating old
stuff, like eSet, might help more than it hurts, on the simplicity
front.

P.S. apropos of understanding this universe of classes, I *love* the
methods(class=x) thing Vincent mentioned.




The current version, under R-devel, is at

devtools::source_gist("https://gist.github.com/mtmorgan/
9f98871adb9f0c1891a4")

> methods(class="SummarizedExperiment")
 [1] [ [[[[<-  [<-
 [5] $ $<-   assay assay<-
 [9] assayNamesassayNames<-  assaysassays<-
[13] cbind coercecolData   colData<-
[17] compare   Compare   countOverlaps coverage
[21] dim   dimnames  dimnames<-
disjointBins
[25] distance  distanceToNearest duplicated
elementMetadata
[29] elementMetadata<- end   end<- exptData
[33] exptData<-extractROWS   findOverlaps  flank
[37] followgranges   isDisjointmcols
[41] mcols<-   narrownearest   order
[45] overlapsAny   precede   rangesranges<-
[49] rank  rbind replaceROWS   resize
[53] restrict  rowData   rowData<- seqinfo
[57] seqinfo<- seqnames  shift show
[61] sort  split start start<-
[65] strandstrand<-  subset
subsetByOverlaps
[69] updateObject  valuesvalues<-  width
[73] width<-

see ?"methods" for accessing help and 

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-09 Thread Martin Morgan

On 03/09/2015 08:07 AM, Michael Love wrote:

Some guidance on how to avoid duplication of the matrix for developers
would be greatly appreciated.


It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
extraction of assays (but obviously you don't have dimnames on the matrix). Row 
or column subsetting necessarily causes the subsetted assay data to be 
duplicated. There should not be any duplication when rowRanges() or colData() 
are changed without changing their dimension / ordering.



Another example of a trouble point, is that if I am given an SE with
an unnamed assay and I need to give the assay a name, this also can
expand the memory used. I had found a solution (which works with
GenomicRanges 1.18 / current release) with:

names(assays(se, withDimnames=FALSE))[1] <- "foo"

But now I'm looking in devel and this appears to no longer work. The
memory used expands, equivalent to:

names(assays(se))[1] <- "foo"

Here's some code to try this:

m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
se <- SummarizedExperiment(m)
names(assays(se, withDimnames=FALSE))[1] <- "foo"
names(assays(se))[1] <- "foo"

while running gc() in between steps.


I think this is a regression of some sort, and I'll look into it. Thanks for the 
heads-up.


Martin




On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
 wrote:

On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
wrote:


I am glad you are keeping this discussion alive Kasper.

On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:


It sounds like the proposed changes are already made.  However (like
others) I am still a bit mystified why this was necessary.  The old
version
did allow for a GRanges inside the DataFrame of the rowData, as far as I
recall.  So I assume this is for efficiency.  But why?  What kind of
data/use cases is this for?

I am happy to hear that SummarizedExperiment is going to be spun out into
its own package.  When that happens, I have some comments, which I'll
include here in anticipation
   1) I now very strongly believe it was a design mistake to not have
colnames on the assays.  The advantage of this choice is that sampleNames
are only stored one place.  The extreme disadvantage is the high
ineffeciency when you want colnames on an extracted assay.



after example(SummarizedExperiment)


colnames(assays(se1)[[1]])

[1] "A" "B" "C" "D" "E" "F"

so this seems to be optional.  But attempts to set rownames will fail
silently


rownames(assays(se1)[[1]]) = as.character(1:200)



rownames(assays(se1)[[1]])


NULL
seems we could issue a warning there




Vince, you need to be careful here.

The assays are stored without colnames (unless something has recently
changed).  The default is to - upon extraction - set the colnames of the
matrix.  This however requires a copy of the entire matrix.  So
essentially, upon extraction, each assay is needlessly duplicated to add
the colnames.  This is what I mean by inefficient. I would prefer to store
the assays with colnames.  This means that changing sampleNames of the
object will be inefficient (as it is for eSets) since it would require a
complete copy of everything.  But I would rather - much rather - copy when
setting sampleNames than copy when extracting an assay.

Best,
Kasper

 [[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-09 Thread Michael Love
On Mar 9, 2015 12:36 PM, "Martin Morgan"  wrote:
>
> On 03/09/2015 08:07 AM, Michael Love wrote:
>>
>> Some guidance on how to avoid duplication of the matrix for developers
>> would be greatly appreciated.
>
>
> It's unsatisfactory, but using withDimnames=FALSE avoids duplication on
extraction of assays (but obviously you don't have dimnames on the matrix).
Row or column subsetting necessarily causes the subsetted assay data to be
duplicated. There should not be any duplication when rowRanges() or
colData() are changed without changing their dimension / ordering.
>

Thanks Martin for checking into the regression.

Sorry, I should have been more specific earlier, I meant more
guidance/documentation in the man page for SE. I scanned the 'Extension'
section but didn't find a note on withDimnames for extracting the matrix or
this example of renaming the assays (it seems like this could easily be
relevant for other package authors).

A prominent note there might help devs write more memory efficient
packages.

The argument section mentions speed but I'd explicitly mention memory given
that we're often storing big matrices:

"Setting withDimnames=FALSE  increases the speed with which assays are
extracted."

(its entirely possible the info is there but i missed it)

Best,

Mike

>
>> Another example of a trouble point, is that if I am given an SE with
>> an unnamed assay and I need to give the assay a name, this also can
>> expand the memory used. I had found a solution (which works with
>> GenomicRanges 1.18 / current release) with:
>>
>> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>>
>> But now I'm looking in devel and this appears to no longer work. The
>> memory used expands, equivalent to:
>>
>> names(assays(se))[1] <- "foo"
>>
>> Here's some code to try this:
>>
>> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
>> se <- SummarizedExperiment(m)
>> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>> names(assays(se))[1] <- "foo"
>>
>> while running gc() in between steps.
>
>
> I think this is a regression of some sort, and I'll look into it. Thanks
for the heads-up.
>
> Martin
>
>
>>
>>
>> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
>>  wrote:
>>>
>>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey <
st...@channing.harvard.edu>
>>> wrote:
>>>
 I am glad you are keeping this discussion alive Kasper.

 On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
 kasperdanielhan...@gmail.com> wrote:

> It sounds like the proposed changes are already made.  However (like
> others) I am still a bit mystified why this was necessary.  The old
> version
> did allow for a GRanges inside the DataFrame of the rowData, as far
as I
> recall.  So I assume this is for efficiency.  But why?  What kind of
> data/use cases is this for?
>
> I am happy to hear that SummarizedExperiment is going to be spun out
into
> its own package.  When that happens, I have some comments, which I'll
> include here in anticipation
>1) I now very strongly believe it was a design mistake to not have
> colnames on the assays.  The advantage of this choice is that
sampleNames
> are only stored one place.  The extreme disadvantage is the high
> ineffeciency when you want colnames on an extracted assay.
>

 after example(SummarizedExperiment)

> colnames(assays(se1)[[1]])

 [1] "A" "B" "C" "D" "E" "F"

 so this seems to be optional.  But attempts to set rownames will fail
 silently

> rownames(assays(se1)[[1]]) = as.character(1:200)


> rownames(assays(se1)[[1]])


 NULL
 seems we could issue a warning there

>>>
>>>
>>> Vince, you need to be careful here.
>>>
>>> The assays are stored without colnames (unless something has recently
>>> changed).  The default is to - upon extraction - set the colnames of the
>>> matrix.  This however requires a copy of the entire matrix.  So
>>> essentially, upon extraction, each assay is needlessly duplicated to add
>>> the colnames.  This is what I mean by inefficient. I would prefer to
store
>>> the assays with colnames.  This means that changing sampleNames of the
>>> object will be inefficient (as it is for eSets) since it would require a
>>> complete copy of everything.  But I would rather - much rather - copy
when
>>> setting sampleNames than copy when extracting an assay.
>>>
>>> Best,
>>> Kasper
>>>
>>>  [[alternative HTML version deleted]]
>>>
>>> ___
>>> Bioc-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-31 Thread Michael Love
With GenomicRanges 1.19.48, I'm still having issues with re-naming the
first assay and duplication of memory from my March 9 email. I tried
assayNames<- as well. My use case is if I am given a
SummarizedExperiment where the first element is not named "counts"
(albeit the SE is most likely coming from summarizeOverlaps() and
already named "counts"...).

> sessionInfo()
R Under development (unstable) (2015-03-31 r68129)
Platform: x86_64-apple-darwin12.5.0 (64-bit)
Running under: OS X 10.8.5 (Mountain Lion)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4parallel  stats graphics  grDevices datasets  utils
   methods   base

other attached packages:
[1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
S4Vectors_0.5.22
[5] BiocGenerics_0.13.10  testthat_0.9.1devtools_1.7.0knitr_1.9
[9] BiocInstaller_1.17.6

loaded via a namespace (and not attached):
[1] formatR_1.1XVector_0.7.4  tools_3.3.0stringr_0.6.2  evaluate_0.5.5

On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
 wrote:
>
>
> On Mar 9, 2015 12:36 PM, "Martin Morgan"  wrote:
> >
> > On 03/09/2015 08:07 AM, Michael Love wrote:
> >>
> >> Some guidance on how to avoid duplication of the matrix for developers
> >> would be greatly appreciated.
> >
> >
> > It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
> > extraction of assays (but obviously you don't have dimnames on the matrix). 
> > Row or column subsetting necessarily causes the subsetted assay data to be 
> > duplicated. There should not be any duplication when rowRanges() or 
> > colData() are changed without changing their dimension / ordering.
> >
>
> Thanks Martin for checking into the regression.
>
> Sorry, I should have been more specific earlier, I meant more 
> guidance/documentation in the man page for SE. I scanned the 'Extension' 
> section but didn't find a note on withDimnames for extracting the matrix or 
> this example of renaming the assays (it seems like this could easily be 
> relevant for other package authors).
>
> A prominent note there might help devs write more memory efficient packages.
>
> The argument section mentions speed but I'd explicitly mention memory given 
> that we're often storing big matrices:
>
> "Setting withDimnames=FALSE  increases the speed with which assays are 
> extracted."
>
> (its entirely possible the info is there but i missed it)
>
> Best,
>
> Mike
>
> >
> >> Another example of a trouble point, is that if I am given an SE with
> >> an unnamed assay and I need to give the assay a name, this also can
> >> expand the memory used. I had found a solution (which works with
> >> GenomicRanges 1.18 / current release) with:
> >>
> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
> >>
> >> But now I'm looking in devel and this appears to no longer work. The
> >> memory used expands, equivalent to:
> >>
> >> names(assays(se))[1] <- "foo"
> >>
> >> Here's some code to try this:
> >>
> >> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
> >> se <- SummarizedExperiment(m)
> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
> >> names(assays(se))[1] <- "foo"
> >>
> >> while running gc() in between steps.
> >
> >
> > I think this is a regression of some sort, and I'll look into it. Thanks 
> > for the heads-up.
> >
> > Martin
> >
> >
> >>
> >>
> >> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
> >>  wrote:
> >>>
> >>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
> >>> 
> >>> wrote:
> >>>
>  I am glad you are keeping this discussion alive Kasper.
> 
>  On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>  kasperdanielhan...@gmail.com> wrote:
> 
> > It sounds like the proposed changes are already made.  However (like
> > others) I am still a bit mystified why this was necessary.  The old
> > version
> > did allow for a GRanges inside the DataFrame of the rowData, as far as I
> > recall.  So I assume this is for efficiency.  But why?  What kind of
> > data/use cases is this for?
> >
> > I am happy to hear that SummarizedExperiment is going to be spun out 
> > into
> > its own package.  When that happens, I have some comments, which I'll
> > include here in anticipation
> >1) I now very strongly believe it was a design mistake to not have
> > colnames on the assays.  The advantage of this choice is that 
> > sampleNames
> > are only stored one place.  The extreme disadvantage is the high
> > ineffeciency when you want colnames on an extracted assay.
> >
> 
>  after example(SummarizedExperiment)
> 
> > colnames(assays(se1)[[1]])
> 
>  [1] "A" "B" "C" "D" "E" "F"
> 
>  so this seems to be optional.  But attempts to set rownames will fail
>  silently
> 
> > rownames(assays(se1)[[1]]) = as.character(1:200)
> 
> 
> > rownames(assays(se1)[[1]])
> 
> 
>  NULL
>  seems we c

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-31 Thread Michael Love
I forgot to ask my other question. I had gone in early March and fixed
my code to eliminate rowData<-, but the argument to SummarizedExperiment
was still called rowData, and a DataFrame could be provided. Then I
didn't check for a few weeks, but the argument for the rowData slot is
now called rowRanges. What's the trick to putting a DataFrame on an
empty GRanges, so I can get the old behavior but now using the rowRanges
argument?

On Tue, Mar 31, 2015 at 3:40 PM, Michael Love
 wrote:
> With GenomicRanges 1.19.48, I'm still having issues with re-naming the
> first assay and duplication of memory from my March 9 email. I tried
> assayNames<- as well. My use case is if I am given a
> SummarizedExperiment where the first element is not named "counts"
> (albeit the SE is most likely coming from summarizeOverlaps() and
> already named "counts"...).
>
>> sessionInfo()
> R Under development (unstable) (2015-03-31 r68129)
> Platform: x86_64-apple-darwin12.5.0 (64-bit)
> Running under: OS X 10.8.5 (Mountain Lion)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats4parallel  stats graphics  grDevices datasets  utils
>methods   base
>
> other attached packages:
> [1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
> S4Vectors_0.5.22
> [5] BiocGenerics_0.13.10  testthat_0.9.1devtools_1.7.0
> knitr_1.9
> [9] BiocInstaller_1.17.6
>
> loaded via a namespace (and not attached):
> [1] formatR_1.1XVector_0.7.4  tools_3.3.0stringr_0.6.2  evaluate_0.5.5
>
> On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
>  wrote:
>>
>>
>> On Mar 9, 2015 12:36 PM, "Martin Morgan"  wrote:
>> >
>> > On 03/09/2015 08:07 AM, Michael Love wrote:
>> >>
>> >> Some guidance on how to avoid duplication of the matrix for developers
>> >> would be greatly appreciated.
>> >
>> >
>> > It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
>> > extraction of assays (but obviously you don't have dimnames on the 
>> > matrix). Row or column subsetting necessarily causes the subsetted assay 
>> > data to be duplicated. There should not be any duplication when 
>> > rowRanges() or colData() are changed without changing their dimension / 
>> > ordering.
>> >
>>
>> Thanks Martin for checking into the regression.
>>
>> Sorry, I should have been more specific earlier, I meant more 
>> guidance/documentation in the man page for SE. I scanned the 'Extension' 
>> section but didn't find a note on withDimnames for extracting the matrix or 
>> this example of renaming the assays (it seems like this could easily be 
>> relevant for other package authors).
>>
>> A prominent note there might help devs write more memory efficient packages.
>>
>> The argument section mentions speed but I'd explicitly mention memory given 
>> that we're often storing big matrices:
>>
>> "Setting withDimnames=FALSE  increases the speed with which assays are 
>> extracted."
>>
>> (its entirely possible the info is there but i missed it)
>>
>> Best,
>>
>> Mike
>>
>> >
>> >> Another example of a trouble point, is that if I am given an SE with
>> >> an unnamed assay and I need to give the assay a name, this also can
>> >> expand the memory used. I had found a solution (which works with
>> >> GenomicRanges 1.18 / current release) with:
>> >>
>> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>> >>
>> >> But now I'm looking in devel and this appears to no longer work. The
>> >> memory used expands, equivalent to:
>> >>
>> >> names(assays(se))[1] <- "foo"
>> >>
>> >> Here's some code to try this:
>> >>
>> >> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
>> >> se <- SummarizedExperiment(m)
>> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>> >> names(assays(se))[1] <- "foo"
>> >>
>> >> while running gc() in between steps.
>> >
>> >
>> > I think this is a regression of some sort, and I'll look into it. Thanks 
>> > for the heads-up.
>> >
>> > Martin
>> >
>> >
>> >>
>> >>
>> >> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
>> >>  wrote:
>> >>>
>> >>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
>> >>> 
>> >>> wrote:
>> >>>
>>  I am glad you are keeping this discussion alive Kasper.
>> 
>>  On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>>  kasperdanielhan...@gmail.com> wrote:
>> 
>> > It sounds like the proposed changes are already made.  However (like
>> > others) I am still a bit mystified why this was necessary.  The old
>> > version
>> > did allow for a GRanges inside the DataFrame of the rowData, as far as 
>> > I
>> > recall.  So I assume this is for efficiency.  But why?  What kind of
>> > data/use cases is this for?
>> >
>> > I am happy to hear that SummarizedExperiment is going to be spun out 
>> > into
>> > its own package.  When that happens, I have some comments, which I'll
>> > include here in anticipation
>> >1) I now very strongly believe it was a d

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-31 Thread Michael Love
Would this code inspired by the release version of GenomicRanges work?
e.g. if I want to add a DataFrame with 10 rows:

names <- letters[1:10]
x <- relist(GRanges(), PartitioningByEnd(integer(10), names=names))
mcols(x) <- DataFrame(foo=1:10)

Then give x to the rowRanges argument of SummarizedExperiment?

On Tue, Mar 31, 2015 at 3:49 PM, Michael Love
 wrote:
> I forgot to ask my other question. I had gone in early March and fixed
> my code to eliminate rowData<-, but the argument to SummarizedExperiment
> was still called rowData, and a DataFrame could be provided. Then I
> didn't check for a few weeks, but the argument for the rowData slot is
> now called rowRanges. What's the trick to putting a DataFrame on an
> empty GRanges, so I can get the old behavior but now using the rowRanges
> argument?
>
> On Tue, Mar 31, 2015 at 3:40 PM, Michael Love
>  wrote:
>> With GenomicRanges 1.19.48, I'm still having issues with re-naming the
>> first assay and duplication of memory from my March 9 email. I tried
>> assayNames<- as well. My use case is if I am given a
>> SummarizedExperiment where the first element is not named "counts"
>> (albeit the SE is most likely coming from summarizeOverlaps() and
>> already named "counts"...).
>>
>>> sessionInfo()
>> R Under development (unstable) (2015-03-31 r68129)
>> Platform: x86_64-apple-darwin12.5.0 (64-bit)
>> Running under: OS X 10.8.5 (Mountain Lion)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats4parallel  stats graphics  grDevices datasets  utils
>>methods   base
>>
>> other attached packages:
>> [1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
>> S4Vectors_0.5.22
>> [5] BiocGenerics_0.13.10  testthat_0.9.1devtools_1.7.0
>> knitr_1.9
>> [9] BiocInstaller_1.17.6
>>
>> loaded via a namespace (and not attached):
>> [1] formatR_1.1XVector_0.7.4  tools_3.3.0stringr_0.6.2  
>> evaluate_0.5.5
>>
>> On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
>>  wrote:
>>>
>>>
>>> On Mar 9, 2015 12:36 PM, "Martin Morgan"  wrote:
>>> >
>>> > On 03/09/2015 08:07 AM, Michael Love wrote:
>>> >>
>>> >> Some guidance on how to avoid duplication of the matrix for developers
>>> >> would be greatly appreciated.
>>> >
>>> >
>>> > It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
>>> > extraction of assays (but obviously you don't have dimnames on the 
>>> > matrix). Row or column subsetting necessarily causes the subsetted assay 
>>> > data to be duplicated. There should not be any duplication when 
>>> > rowRanges() or colData() are changed without changing their dimension / 
>>> > ordering.
>>> >
>>>
>>> Thanks Martin for checking into the regression.
>>>
>>> Sorry, I should have been more specific earlier, I meant more 
>>> guidance/documentation in the man page for SE. I scanned the 'Extension' 
>>> section but didn't find a note on withDimnames for extracting the matrix or 
>>> this example of renaming the assays (it seems like this could easily be 
>>> relevant for other package authors).
>>>
>>> A prominent note there might help devs write more memory efficient packages.
>>>
>>> The argument section mentions speed but I'd explicitly mention memory given 
>>> that we're often storing big matrices:
>>>
>>> "Setting withDimnames=FALSE  increases the speed with which assays are 
>>> extracted."
>>>
>>> (its entirely possible the info is there but i missed it)
>>>
>>> Best,
>>>
>>> Mike
>>>
>>> >
>>> >> Another example of a trouble point, is that if I am given an SE with
>>> >> an unnamed assay and I need to give the assay a name, this also can
>>> >> expand the memory used. I had found a solution (which works with
>>> >> GenomicRanges 1.18 / current release) with:
>>> >>
>>> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>>> >>
>>> >> But now I'm looking in devel and this appears to no longer work. The
>>> >> memory used expands, equivalent to:
>>> >>
>>> >> names(assays(se))[1] <- "foo"
>>> >>
>>> >> Here's some code to try this:
>>> >>
>>> >> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
>>> >> se <- SummarizedExperiment(m)
>>> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>>> >> names(assays(se))[1] <- "foo"
>>> >>
>>> >> while running gc() in between steps.
>>> >
>>> >
>>> > I think this is a regression of some sort, and I'll look into it. Thanks 
>>> > for the heads-up.
>>> >
>>> > Martin
>>> >
>>> >
>>> >>
>>> >>
>>> >> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
>>> >>  wrote:
>>> >>>
>>> >>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
>>> >>> 
>>> >>> wrote:
>>> >>>
>>>  I am glad you are keeping this discussion alive Kasper.
>>> 
>>>  On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>>>  kasperdanielhan...@gmail.com> wrote:
>>> 
>>> > It sounds like the proposed changes are already made.  However (like
>>> > others) I am still a bit mystified why this was necessary.  The

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-04-01 Thread Michael Love
I'll retract those last two emails about empty GRanges. That's simply:

se <- SummarizedExperiment(assays, colData=colData)
mcols(se) <- myDataFrame

On Tue, Mar 31, 2015 at 4:40 PM, Michael Love
 wrote:
> Would this code inspired by the release version of GenomicRanges work?
> e.g. if I want to add a DataFrame with 10 rows:
>
> names <- letters[1:10]
> x <- relist(GRanges(), PartitioningByEnd(integer(10), names=names))
> mcols(x) <- DataFrame(foo=1:10)
>
> Then give x to the rowRanges argument of SummarizedExperiment?
>
> On Tue, Mar 31, 2015 at 3:49 PM, Michael Love
>  wrote:
>> I forgot to ask my other question. I had gone in early March and fixed
>> my code to eliminate rowData<-, but the argument to SummarizedExperiment
>> was still called rowData, and a DataFrame could be provided. Then I
>> didn't check for a few weeks, but the argument for the rowData slot is
>> now called rowRanges. What's the trick to putting a DataFrame on an
>> empty GRanges, so I can get the old behavior but now using the rowRanges
>> argument?
>>
>> On Tue, Mar 31, 2015 at 3:40 PM, Michael Love
>>  wrote:
>>> With GenomicRanges 1.19.48, I'm still having issues with re-naming the
>>> first assay and duplication of memory from my March 9 email. I tried
>>> assayNames<- as well. My use case is if I am given a
>>> SummarizedExperiment where the first element is not named "counts"
>>> (albeit the SE is most likely coming from summarizeOverlaps() and
>>> already named "counts"...).
>>>
 sessionInfo()
>>> R Under development (unstable) (2015-03-31 r68129)
>>> Platform: x86_64-apple-darwin12.5.0 (64-bit)
>>> Running under: OS X 10.8.5 (Mountain Lion)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats4parallel  stats graphics  grDevices datasets  utils
>>>methods   base
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
>>> S4Vectors_0.5.22
>>> [5] BiocGenerics_0.13.10  testthat_0.9.1devtools_1.7.0
>>> knitr_1.9
>>> [9] BiocInstaller_1.17.6
>>>
>>> loaded via a namespace (and not attached):
>>> [1] formatR_1.1XVector_0.7.4  tools_3.3.0stringr_0.6.2  
>>> evaluate_0.5.5
>>>
>>> On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
>>>  wrote:


 On Mar 9, 2015 12:36 PM, "Martin Morgan"  wrote:
 >
 > On 03/09/2015 08:07 AM, Michael Love wrote:
 >>
 >> Some guidance on how to avoid duplication of the matrix for developers
 >> would be greatly appreciated.
 >
 >
 > It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
 > extraction of assays (but obviously you don't have dimnames on the 
 > matrix). Row or column subsetting necessarily causes the subsetted assay 
 > data to be duplicated. There should not be any duplication when 
 > rowRanges() or colData() are changed without changing their dimension / 
 > ordering.
 >

 Thanks Martin for checking into the regression.

 Sorry, I should have been more specific earlier, I meant more 
 guidance/documentation in the man page for SE. I scanned the 'Extension' 
 section but didn't find a note on withDimnames for extracting the matrix 
 or this example of renaming the assays (it seems like this could easily be 
 relevant for other package authors).

 A prominent note there might help devs write more memory efficient 
 packages.

 The argument section mentions speed but I'd explicitly mention memory 
 given that we're often storing big matrices:

 "Setting withDimnames=FALSE  increases the speed with which assays are 
 extracted."

 (its entirely possible the info is there but i missed it)

 Best,

 Mike

 >
 >> Another example of a trouble point, is that if I am given an SE with
 >> an unnamed assay and I need to give the assay a name, this also can
 >> expand the memory used. I had found a solution (which works with
 >> GenomicRanges 1.18 / current release) with:
 >>
 >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
 >>
 >> But now I'm looking in devel and this appears to no longer work. The
 >> memory used expands, equivalent to:
 >>
 >> names(assays(se))[1] <- "foo"
 >>
 >> Here's some code to try this:
 >>
 >> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
 >> se <- SummarizedExperiment(m)
 >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
 >> names(assays(se))[1] <- "foo"
 >>
 >> while running gc() in between steps.
 >
 >
 > I think this is a regression of some sort, and I'll look into it. Thanks 
 > for the heads-up.
 >
 > Martin
 >
 >
 >>
 >>
 >> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
 >>  wrote:
 >>>
 >>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
 >>> 
 >>> wrote:
 >

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-04-01 Thread Hervé Pagès

Hi Michael,

On 04/01/2015 07:17 AM, Michael Love wrote:

I'll retract those last two emails about empty GRanges. That's simply:

se <- SummarizedExperiment(assays, colData=colData)
mcols(se) <- myDataFrame


Glad you found a simple way to do what you wanted.

More below...



On Tue, Mar 31, 2015 at 4:40 PM, Michael Love
 wrote:

Would this code inspired by the release version of GenomicRanges work?
e.g. if I want to add a DataFrame with 10 rows:

names <- letters[1:10]
x <- relist(GRanges(), PartitioningByEnd(integer(10), names=names))
mcols(x) <- DataFrame(foo=1:10)

Then give x to the rowRanges argument of SummarizedExperiment?

On Tue, Mar 31, 2015 at 3:49 PM, Michael Love
 wrote:

I forgot to ask my other question. I had gone in early March and fixed
my code to eliminate rowData<-, but the argument to SummarizedExperiment
was still called rowData, and a DataFrame could be provided. Then I
didn't check for a few weeks, but the argument for the rowData slot is
now called rowRanges. What's the trick to putting a DataFrame on an
empty GRanges, so I can get the old behavior but now using the rowRanges
argument?


I'm not sure what you meant by "so I can get the old behavior but
now using the rowRanges argument".

Just to clarify: the renaming of rowData to rowRanges is a change
of name only, not a change of behavior. More precisely the new
rowRanges() accessor should behave exactly as the old rowData()
accessor. The same applies to the 'rowRanges' argument of the
SummarizedExperiment() constructor. So whatever you were passing
before to the 'rowData' argument, you should still be able to pass
it to the new 'rowRanges' argument. Please let us know if it's not
the case as this is certainly not intended.

Thanks,
H.



On Tue, Mar 31, 2015 at 3:40 PM, Michael Love
 wrote:

With GenomicRanges 1.19.48, I'm still having issues with re-naming the
first assay and duplication of memory from my March 9 email. I tried
assayNames<- as well. My use case is if I am given a
SummarizedExperiment where the first element is not named "counts"
(albeit the SE is most likely coming from summarizeOverlaps() and
already named "counts"...).


sessionInfo()

R Under development (unstable) (2015-03-31 r68129)
Platform: x86_64-apple-darwin12.5.0 (64-bit)
Running under: OS X 10.8.5 (Mountain Lion)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4parallel  stats graphics  grDevices datasets  utils
methods   base

other attached packages:
[1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
S4Vectors_0.5.22
[5] BiocGenerics_0.13.10  testthat_0.9.1devtools_1.7.0knitr_1.9
[9] BiocInstaller_1.17.6

loaded via a namespace (and not attached):
[1] formatR_1.1XVector_0.7.4  tools_3.3.0stringr_0.6.2  evaluate_0.5.5

On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
 wrote:



On Mar 9, 2015 12:36 PM, "Martin Morgan"  wrote:


On 03/09/2015 08:07 AM, Michael Love wrote:


Some guidance on how to avoid duplication of the matrix for developers
would be greatly appreciated.



It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
extraction of assays (but obviously you don't have dimnames on the matrix). Row 
or column subsetting necessarily causes the subsetted assay data to be 
duplicated. There should not be any duplication when rowRanges() or colData() 
are changed without changing their dimension / ordering.



Thanks Martin for checking into the regression.

Sorry, I should have been more specific earlier, I meant more 
guidance/documentation in the man page for SE. I scanned the 'Extension' 
section but didn't find a note on withDimnames for extracting the matrix or 
this example of renaming the assays (it seems like this could easily be 
relevant for other package authors).

A prominent note there might help devs write more memory efficient packages.

The argument section mentions speed but I'd explicitly mention memory given 
that we're often storing big matrices:

"Setting withDimnames=FALSE  increases the speed with which assays are 
extracted."

(its entirely possible the info is there but i missed it)

Best,

Mike




Another example of a trouble point, is that if I am given an SE with
an unnamed assay and I need to give the assay a name, this also can
expand the memory used. I had found a solution (which works with
GenomicRanges 1.18 / current release) with:

names(assays(se, withDimnames=FALSE))[1] <- "foo"

But now I'm looking in devel and this appears to no longer work. The
memory used expands, equivalent to:

names(assays(se))[1] <- "foo"

Here's some code to try this:

m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
se <- SummarizedExperiment(m)
names(assays(se, withDimnames=FALSE))[1] <- "foo"
names(assays(se))[1] <- "foo"

while running gc() in between steps.



I think this is a regression of some sort, and I'll look into it. Thanks for 
the heads-up.

Martin





On Mon, Mar 9, 2015 at 10:36 AM, Kas

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-04-01 Thread Michael Love
Yes, you're right! Sorry for the noise. I forgot this was how it
always behaved. All I had to do was change the argument name.

On Wed, Apr 1, 2015 at 3:51 PM, Hervé Pagès  wrote:
> Hi Michael,
>
> On 04/01/2015 07:17 AM, Michael Love wrote:
>>
>> I'll retract those last two emails about empty GRanges. That's simply:
>>
>> se <- SummarizedExperiment(assays, colData=colData)
>> mcols(se) <- myDataFrame
>
>
> Glad you found a simple way to do what you wanted.
>
> More below...
>
>>
>> On Tue, Mar 31, 2015 at 4:40 PM, Michael Love
>>  wrote:
>>>
>>> Would this code inspired by the release version of GenomicRanges work?
>>> e.g. if I want to add a DataFrame with 10 rows:
>>>
>>> names <- letters[1:10]
>>> x <- relist(GRanges(), PartitioningByEnd(integer(10), names=names))
>>> mcols(x) <- DataFrame(foo=1:10)
>>>
>>> Then give x to the rowRanges argument of SummarizedExperiment?
>>>
>>> On Tue, Mar 31, 2015 at 3:49 PM, Michael Love
>>>  wrote:

 I forgot to ask my other question. I had gone in early March and fixed
 my code to eliminate rowData<-, but the argument to SummarizedExperiment
 was still called rowData, and a DataFrame could be provided. Then I
 didn't check for a few weeks, but the argument for the rowData slot is
 now called rowRanges. What's the trick to putting a DataFrame on an
 empty GRanges, so I can get the old behavior but now using the rowRanges
 argument?
>
>
> I'm not sure what you meant by "so I can get the old behavior but
> now using the rowRanges argument".
>
> Just to clarify: the renaming of rowData to rowRanges is a change
> of name only, not a change of behavior. More precisely the new
> rowRanges() accessor should behave exactly as the old rowData()
> accessor. The same applies to the 'rowRanges' argument of the
> SummarizedExperiment() constructor. So whatever you were passing
> before to the 'rowData' argument, you should still be able to pass
> it to the new 'rowRanges' argument. Please let us know if it's not
> the case as this is certainly not intended.
>
> Thanks,
> H.
>
>

 On Tue, Mar 31, 2015 at 3:40 PM, Michael Love
  wrote:
>
> With GenomicRanges 1.19.48, I'm still having issues with re-naming the
> first assay and duplication of memory from my March 9 email. I tried
> assayNames<- as well. My use case is if I am given a
> SummarizedExperiment where the first element is not named "counts"
> (albeit the SE is most likely coming from summarizeOverlaps() and
> already named "counts"...).
>
>> sessionInfo()
>
> R Under development (unstable) (2015-03-31 r68129)
> Platform: x86_64-apple-darwin12.5.0 (64-bit)
> Running under: OS X 10.8.5 (Mountain Lion)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats4parallel  stats graphics  grDevices datasets  utils
> methods   base
>
> other attached packages:
> [1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
> S4Vectors_0.5.22
> [5] BiocGenerics_0.13.10  testthat_0.9.1devtools_1.7.0
> knitr_1.9
> [9] BiocInstaller_1.17.6
>
> loaded via a namespace (and not attached):
> [1] formatR_1.1XVector_0.7.4  tools_3.3.0stringr_0.6.2
> evaluate_0.5.5
>
> On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
>  wrote:
>>
>>
>>
>> On Mar 9, 2015 12:36 PM, "Martin Morgan" 
>> wrote:
>>>
>>>
>>> On 03/09/2015 08:07 AM, Michael Love wrote:


 Some guidance on how to avoid duplication of the matrix for
 developers
 would be greatly appreciated.
>>>
>>>
>>>
>>> It's unsatisfactory, but using withDimnames=FALSE avoids duplication
>>> on extraction of assays (but obviously you don't have dimnames on the
>>> matrix). Row or column subsetting necessarily causes the subsetted assay
>>> data to be duplicated. There should not be any duplication when 
>>> rowRanges()
>>> or colData() are changed without changing their dimension / ordering.
>>>
>>
>> Thanks Martin for checking into the regression.
>>
>> Sorry, I should have been more specific earlier, I meant more
>> guidance/documentation in the man page for SE. I scanned the 'Extension'
>> section but didn't find a note on withDimnames for extracting the matrix 
>> or
>> this example of renaming the assays (it seems like this could easily be
>> relevant for other package authors).
>>
>> A prominent note there might help devs write more memory efficient
>> packages.
>>
>> The argument section mentions speed but I'd explicitly mention memory
>> given that we're often storing big matrices:
>>
>> "Setting withDimnames=FALSE  increases the speed with which assays are
>> extracted."
>>
>> (its entirely possible the info is there but i missed it)
>>>

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-04-01 Thread Martin Morgan

On 03/31/2015 12:40 PM, Michael Love wrote:

With GenomicRanges 1.19.48, I'm still having issues with re-naming the
first assay and duplication of memory from my March 9 email. I tried
assayNames<- as well. My use case is if I am given a
SummarizedExperiment where the first element is not named "counts"
(albeit the SE is most likely coming from summarizeOverlaps() and
already named "counts"...).


Thanks for the prompt Mike and sorry for the slow response. gc() is not the most 
effective tool to track memory use; I compiled my R with 
--enable-memory-profiling, and then used tracemem


  m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
  tracemem(m)
  se <- SummarizedExperiment(m)

The original behavior was

> names(assays(se)) <- "foo"
tracemem[0x7f49853a1010 -> 0x7f4981734010]: lapply lapply lapply lapply 
endoapply endoapply assays assays
tracemem[0x7f4981734010 -> 0x7f497f10e010]: lapply lapply lapply lapply 
endoapply endoapply assays<- assays<-

>

which shows a memory copy on the way out (the call stack ending with the assays 
access S4 generic then method) and on the way in, the assays<- setter generic 
and method). withDimnames=FALSE gave me


> names(assays(se, withDimnames=FALSE)) <- "foo"
tracemem[0x7f4981734010 -> 0x7f497f10e010]: lapply lapply lapply lapply 
endoapply endoapply assays<- assays<-

>

with the duplication on the way in. GenomicRanges 1.19.50 gives, on a fresh 'se'

> names(assays(se, withDimnames=FALSE)) <- "foo"
>

with no duplication. assayNames<- (which I guess is the 'preferred' setter) 
behaves this way too.


Thanks for your report and patience.

Martin




sessionInfo()

R Under development (unstable) (2015-03-31 r68129)
Platform: x86_64-apple-darwin12.5.0 (64-bit)
Running under: OS X 10.8.5 (Mountain Lion)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4parallel  stats graphics  grDevices datasets  utils
methods   base

other attached packages:
[1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
S4Vectors_0.5.22
[5] BiocGenerics_0.13.10  testthat_0.9.1devtools_1.7.0knitr_1.9
[9] BiocInstaller_1.17.6

loaded via a namespace (and not attached):
[1] formatR_1.1XVector_0.7.4  tools_3.3.0stringr_0.6.2  evaluate_0.5.5

On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
 wrote:



On Mar 9, 2015 12:36 PM, "Martin Morgan"  wrote:


On 03/09/2015 08:07 AM, Michael Love wrote:


Some guidance on how to avoid duplication of the matrix for developers
would be greatly appreciated.



It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
extraction of assays (but obviously you don't have dimnames on the matrix). Row 
or column subsetting necessarily causes the subsetted assay data to be 
duplicated. There should not be any duplication when rowRanges() or colData() 
are changed without changing their dimension / ordering.



Thanks Martin for checking into the regression.

Sorry, I should have been more specific earlier, I meant more 
guidance/documentation in the man page for SE. I scanned the 'Extension' 
section but didn't find a note on withDimnames for extracting the matrix or 
this example of renaming the assays (it seems like this could easily be 
relevant for other package authors).

A prominent note there might help devs write more memory efficient packages.

The argument section mentions speed but I'd explicitly mention memory given 
that we're often storing big matrices:

"Setting withDimnames=FALSE  increases the speed with which assays are 
extracted."

(its entirely possible the info is there but i missed it)

Best,

Mike




Another example of a trouble point, is that if I am given an SE with
an unnamed assay and I need to give the assay a name, this also can
expand the memory used. I had found a solution (which works with
GenomicRanges 1.18 / current release) with:

names(assays(se, withDimnames=FALSE))[1] <- "foo"

But now I'm looking in devel and this appears to no longer work. The
memory used expands, equivalent to:

names(assays(se))[1] <- "foo"

Here's some code to try this:

m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
se <- SummarizedExperiment(m)
names(assays(se, withDimnames=FALSE))[1] <- "foo"
names(assays(se))[1] <- "foo"

while running gc() in between steps.



I think this is a regression of some sort, and I'll look into it. Thanks for 
the heads-up.

Martin





On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
 wrote:


On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
wrote:


I am glad you are keeping this discussion alive Kasper.

On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:


It sounds like the proposed changes are already made.  However (like
others) I am still a bit mystified why this was necessary.  The old
version
did allow for a GRanges inside the DataFrame of the rowData, as far as I
recall.  So I assume this is for efficiency.  But why?  What kind of
data/use ca