Beating a dead horse...

I am an R beginner trying to understand this factor business.  While the
entire business of finding the median of factor may be silly from a
practical point of view, this email chain has helped me understand
something.  

I have looked at the median function and it tests to see if what is passed
to it is numeric.  If I were building a function, if I tested for mode
numeric, and if something told me it was numeric then like the median
function I would naively assume that I could do arithmetic on it:
> saywhut<-as.factor(c(NA,"1","1","1","1","2","10"))
> mode(saywhut)
[1] "numeric"

It appears to me that the when the median function tests for numeric it
doesn't have the desired result with an object of class factor (and maybe
other classes?) as was shown by the example.

I have a suspicion that something of class factor has at least two pieces,
one of which is the levels which can possibly be character or something else
and the other piece is the ordering of the levels which is of storage.mode
integer.  Is it this ordering that determines the mode of the factor??  

But if the mode of factor is truly numeric, why doesn't the median function
use the numeric piece for finding the median (like it did with odd n - not
that anyone would ever really want the median of a factor:)??  I think that
Simon Fear hit on the right idea because of the definition of median that is
used for an even number of observations takes the sum of the ordered middle
two observations.  It is the sum (called by the median function) that chokes
on a factor.

> sum(saywhut,na.rm=T)
Error in Summary.factor(..., na.rm = na.rm) : 
        "sum" not meaningful for factors

It appears that whoever built the sum function built in a test for factor
(Simon Fear's first suggestion for median)


On the other hand:
> sd(saywhut,na.rm=T)
[1] 3.614784
(Simon Fear's second suggestion for median)

Bytheway, mean treats factor in different way:
mean(saywhut)
[1] NA
Warning message: 
argument is not numeric or logical: returning NA in: mean.default(saywhut).


There is an R-FAQ that tells one how to convert a factor to 'numeric' but if
I had tested for something being numeric to begin with I never would have
guessed that I needed to convert it to numeric.  I think what this
conversion is really doing is getting rid of the machinery associated with
the class factor:
> #from the R-FAQ
> test<-as.numeric(as.character(saywhut))
> mode(test)
[1] "numeric"
> median(test,na.rm=T)
[1] 1

and bytheway:
> not.a.factor<-c(NA,"1","1","2","10")
> mode(not.a.factor)
[1] "character"
> median(not.a.factor,na.rm=T)
Error in median(not.a.factor, na.rm = T) : 
        need numeric data


<Simon Fear: It seems to me the best way to deal with this "bug" would
be to make calling median with a factor argument be an immediate error.>
Do you think that all base functions (sum, sd, mean, median,...) should deal
with this in a consistent way (This might be much more work.)?  Another
thing that would make things consistent would be to take the stop-work
behavior out of sum:)  

I don't think there is any real problem in the current behavior of factor as
long as the interaction between functions and classes produces this
stop-work behavior - preferably with a warning - and not unexpected side
effects. I am curious if there are other classes of mode numeric which
median-mean-sum-sd-etc might choke on.

<tongue-in-cheek on>
Of course, R would produce a median for factors by using the "correct"
defintion of a median of samples i.e., one that agrees with the definition
of median on a CDF, even though this concept gives most people apoplexy.
<off>
Thanks
Bob
Usual disclaimers....


-----Original Message-----
From: Simon Fear [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 31, 2003 6:18 AM
To: Christoph Bier
Cc: [EMAIL PROTECTED]
Subject: RE: [R] Weird problem with median on a factor

Final guess as to observed behaviour: in the first case after
removal of NAs there were an odd number of observations
(so that sum was not called within the code for median).
In your second call I suspect that even though you got
an integer answer, it was found as sum(2,2)/2.

It seems to me the best way to deal with this "bug" would
be to make calling median with a factor argument be an 
immediate error. Or just trust users never to attempt such
a thing ...  
 
Simon Fear 
Senior Statistician 
Syne qua non Ltd 
Tel: +44 (0) 1379 644449 
Fax: +44 (0) 1379 644445 
email: [EMAIL PROTECTED] 
web: http://www.synequanon.com 
  
Number of attachments included with this message: 0 
  
This message (and any associated files) is confidential and\...{{dropped}}

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

        [[alternative HTML version deleted]]

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Reply via email to