On 13-02-13 7:25 AM, peter dalgaard wrote:

On Feb 12, 2013, at 20:19 , Duncan Murdoch wrote:

I think you are misreading what Peter wrote.  He wasn't defending
that point of view, he was describing it.


Yes. However, that being said, there is the point that the whole
thing has been designed to work within the paradigm that I described,
and, for better or worse, things are reasonably coherent and
consistent within that framework.

The thing that always worries me, when people get bothered by some
aspect of software design, is that, if you change only that aspect,
you may find yourself with something that is incoherent and
inconsistent. I have quite a few times found myself realizing that
"Uncle John was right after all".

For instance, if you change the paradigm to say that "character
variables are character, unless explicitly turned into factors", and
then ameliorate the inconvenience by changing code that relies on
factors to convert character variables on the fly, then you will lose
the otherwise automatic consistency of level sets between subsets of
data. (So, the math department not only has zero female professors,
the entire female gender ceases to exist for that subgroup.)


Sure, if I have a file that contains a column named Sex and it is all M,
I can't expect R to automatically know that there is another
possibility.  That's always been a problem.  If we automatically convert
the data to factors when we read, then maybe we'll be lucky and some
other part of that file that we're planning to throw away will contain
an F, and we'll automatically construct the right factor.
(Except we don't:  lm and glm will throw away the F level if there are
none in the subset we pass to them, factor or not, because they use
drop.unused.levels=TRUE in their call to model.frame().)

There's also the possibility that there will be m and f in there, and
we'll get it wrong.

In R 2.15.2, we do the automatic conversion with a warning, but we do it
wrong, which leads to the inconsistency that Bill Dunlap reported.
R-devel drops the warning and comes closer to getting it right, but it's
really an impossible problem:  if we never see an F, we'll never set the
levels of the factor properly.  If we see a typo like m or f and don't
realize it's a typo, we'll have more than two Sex values.

The current R-devel implementation delays the conversion as much as it
can, and maybe it delays it too far.  It allows model.frame() to
continue to return character columns, as it does in 2.15.2.  This was to
support xtabs(), which treats character columns differently from
factors, and other unforeseen uses.  Another possibility would be to add
an argument ("stringsAsFactors"?) to model.frame() to let modelling
functions choose whether they want factors or not.  xtabs() would say
no, lm() and glm() would say yes.  I think the current implementation is
preferable because it won't require changes to well written existing
functions.

With the current R-devel implementation, it is easier than in 2.15.2 to
get errors thrown when the auto-conversion goes wrong.  I don't know of
any examples where you get incorrect results.  I think this is an
improvement.

I'd appreciate hearing of any bugs in it.

Duncan Murdoch

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to