I guess there are two issues with data.frame. It comes with more than you probably want to support (e.g., list and matrix- like subsetter [, the user expecting to be able to independently modify any column). And it comes with less than you'd like (e.g., support for a 'column' of S4 objects). By making a class that contains ('is a') data.frame, you commit to both limitations.

You're probably using data.frame as a way to implement some basic restrictions -- equal-length columns, for instance. But there are additional restrictions, too, columns x, y, z must be present and of type integer, character, numeric respectively. For this scenario one is better off implementing an S4 class (which provides type checking and required columns), a validity method (for enforcing the equal-length constraint), accessors, and sub-setting following the semantic that you'd like to support, e.g., just along the length of the required slots.

The richest place for this in Bioconductor is the IRanges package, though it can be a bit daunting from an architecture point of view. A couple of things to point to. One is the DataFrame class, which is like a data.frame but supporting a broader (in particular S4) set of columns and allowing 'metadata' (actually, DataFrame, so recursive) on each column. It is relevant if it is important to maintain S4 classes in a data.frame-like structure.

Another is the IRanges class, which in some ways fits your overall use case. It is basically a rectangular data structure, but with required 'columns' (the start and width of the range) and then arbitrary columns the user can add. It's implemented with slots for start and width, and then 'has a' slot containing a DataFrame as 'metadata columns' (the actual implementation is more complicated than this). There are start and width accessors. Sub-setting is always list-like (single-dimensional, along the ranges). Users wanting to access one of 'their' columns use $ or extract the metadata columns (via mcols()) as a DataFrame and then work on that. Maybe it's worth pointing out that the basic definitions are column-oriented, an IRanges instance contains start and width vectors; there is no 'IRange' class.

The GRanges class (in the GenomicRanges package) 'has a' IRanges, but adds additional required slots ('seqnames' to reference the names of the chromosome sequences to which the ranges refer, 'strand' to indicate the strand to which the range belongs, etc.). So the pattern here avoids the 'is a' relationship that simple class extension would imply.

The IRanges package is at

  http://bioconductor.org/packages/devel/bioc/html/IRanges.html

I've described the 'devel' version of Bioconductor

  http://bioconductor.org/developers/useDevel/

Martin


On 08/31/2012 08:39 AM, Bert Gunter wrote:
To add to what David said ...

Of course, there are already S3 "getters" and "setters" methods for data
frames ("[.data.frame" and "[<-.data.frame" )*. These could clearly be
extended -- i.e. the data.frame class could be extended and appropriate S3
methods written. Whether you use S3 or S4 depends on the degree of control,
type checking, reuse etc. you want/need. David's suggestion to look at
Bioconductor is a good one.

Cheers,
Bert
*If you are unfamiliar with the S3 extract methods, consult the R Language
Definition Manual.

On Fri, Aug 31, 2012 at 8:14 AM, David Winsemius <dwinsem...@comcast.net>wrote:


On Aug 31, 2012, at 5:57 AM, Ramiro Barrantes wrote:

Hello,

I have again a "good practices"/programming theory question regarding
data.frames.

One of the fundamental objects that I use is the data frame with a
particular set of columns that I would fill or get information from, and an
entire system would revolve around getting information from or putting
information to such data.frame.

On a different OOP programming language I would be tempted to create a
class that would "wrap-around" that data.frame and create "getters" and
"setters" methods that would return whatever information I need. I started
doing that using S4.

Does anyone have examples of packages that use that approach or any
suggestions?  It just seems to me that a class/object would be a better
idea because it would create a single, hopefully well validated way to
access information and edit the fundamental data.frame object, which would
be helpful if there are several programmers on the team and/or if some of
the data.frame manipulations are not straightforward and are best left
encapsulated in a method of a class, and then have people use that method.
  I would just like to know if there are reasons not do it that way and if
there are any examples of packages that use that approach and that I can
learn from.

You could argue that the entire BioConductor project represents such an
effort. It makes extensive use of S4 methods. I'm not a user so cannot
readily point to examples of S4 functions that have set. and get. methods
for particular sorts of dataframes, but I suspect you can pose the same
question on the BioC mailing list and get a more informed answer.

--
David Winsemius, MD
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.






--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to