Re: [R] Class that wraps Data Frame

Martin Morgan Fri, 31 Aug 2012 09:35:27 -0700

I guess there are two issues with data.frame. It comes with more thanyou probably want to support (e.g., list and matrix- like subsetter [,the user expecting to be able to independently modify any column). Andit comes with less than you'd like (e.g., support for a 'column' of S4objects). By making a class that contains ('is a') data.frame, youcommit to both limitations.

You're probably using data.frame as a way to implement some basicrestrictions -- equal-length columns, for instance. But there areadditional restrictions, too, columns x, y, z must be present and oftype integer, character, numeric respectively. For this scenario one isbetter off implementing an S4 class (which provides type checking andrequired columns), a validity method (for enforcing the equal-lengthconstraint), accessors, and sub-setting following the semantic thatyou'd like to support, e.g., just along the length of the required slots.

The richest place for this in Bioconductor is the IRanges package,though it can be a bit daunting from an architecture point of view. Acouple of things to point to. One is the DataFrame class, which is likea data.frame but supporting a broader (in particular S4) set of columnsand allowing 'metadata' (actually, DataFrame, so recursive) on eachcolumn. It is relevant if it is important to maintain S4 classes in adata.frame-like structure.

Another is the IRanges class, which in some ways fits your overall usecase. It is basically a rectangular data structure, but with required'columns' (the start and width of the range) and then arbitrary columnsthe user can add. It's implemented with slots for start and width, andthen 'has a' slot containing a DataFrame as 'metadata columns' (theactual implementation is more complicated than this). There are startand width accessors. Sub-setting is always list-like(single-dimensional, along the ranges). Users wanting to access one of'their' columns use $ or extract the metadata columns (via mcols()) as aDataFrame and then work on that. Maybe it's worth pointing out that thebasic definitions are column-oriented, an IRanges instance containsstart and width vectors; there is no 'IRange' class.

The GRanges class (in the GenomicRanges package) 'has a' IRanges, butadds additional required slots ('seqnames' to reference the names of thechromosome sequences to which the ranges refer, 'strand' to indicate thestrand to which the range belongs, etc.). So the pattern here avoids the'is a' relationship that simple class extension would imply.


The IRanges package is at

  http://bioconductor.org/packages/devel/bioc/html/IRanges.html

I've described the 'devel' version of Bioconductor

  http://bioconductor.org/developers/useDevel/

Martin


On 08/31/2012 08:39 AM, Bert Gunter wrote:

To add to what David said ...

Of course, there are already S3 "getters" and "setters" methods for data
frames ("[.data.frame" and "[<-.data.frame" )*. These could clearly be
extended -- i.e. the data.frame class could be extended and appropriate S3
methods written. Whether you use S3 or S4 depends on the degree of control,
type checking, reuse etc. you want/need. David's suggestion to look at
Bioconductor is a good one.

Cheers,
Bert
*If you are unfamiliar with the S3 extract methods, consult the R Language
Definition Manual.

On Fri, Aug 31, 2012 at 8:14 AM, David Winsemius <dwinsem...@comcast.net>wrote:


On Aug 31, 2012, at 5:57 AM, Ramiro Barrantes wrote:

Hello,

I have again a "good practices"/programming theory question regarding

data.frames.


One of the fundamental objects that I use is the data frame with a

particular set of columns that I would fill or get information from, and an
entire system would revolve around getting information from or putting
information to such data.frame.


On a different OOP programming language I would be tempted to create a

class that would "wrap-around" that data.frame and create "getters" and
"setters" methods that would return whatever information I need. I started
doing that using S4.


Does anyone have examples of packages that use that approach or any

suggestions?  It just seems to me that a class/object would be a better
idea because it would create a single, hopefully well validated way to
access information and edit the fundamental data.frame object, which would
be helpful if there are several programmers on the team and/or if some of
the data.frame manipulations are not straightforward and are best left
encapsulated in a method of a class, and then have people use that method.
  I would just like to know if there are reasons not do it that way and if
there are any examples of packages that use that approach and that I can
learn from.

You could argue that the entire BioConductor project represents such an
effort. It makes extensive use of S4 methods. I'm not a user so cannot
readily point to examples of S4 functions that have set. and get. methods
for particular sorts of dataframes, but I suspect you can pose the same
question on the BioC mailing list and get a more informed answer.

--
David Winsemius, MD
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Class that wraps Data Frame

Reply via email to