Beating a dead horse... I am an R beginner trying to understand this factor business. While the entire business of finding the median of factor may be silly from a practical point of view, this email chain has helped me understand something.
I have looked at the median function and it tests to see if what is passed to it is numeric. If I were building a function, if I tested for mode numeric, and if something told me it was numeric then like the median function I would naively assume that I could do arithmetic on it: > saywhut<-as.factor(c(NA,"1","1","1","1","2","10")) > mode(saywhut) [1] "numeric" It appears to me that the when the median function tests for numeric it doesn't have the desired result with an object of class factor (and maybe other classes?) as was shown by the example. I have a suspicion that something of class factor has at least two pieces, one of which is the levels which can possibly be character or something else and the other piece is the ordering of the levels which is of storage.mode integer. Is it this ordering that determines the mode of the factor?? But if the mode of factor is truly numeric, why doesn't the median function use the numeric piece for finding the median (like it did with odd n - not that anyone would ever really want the median of a factor:)?? I think that Simon Fear hit on the right idea because of the definition of median that is used for an even number of observations takes the sum of the ordered middle two observations. It is the sum (called by the median function) that chokes on a factor. > sum(saywhut,na.rm=T) Error in Summary.factor(..., na.rm = na.rm) : "sum" not meaningful for factors It appears that whoever built the sum function built in a test for factor (Simon Fear's first suggestion for median) On the other hand: > sd(saywhut,na.rm=T) [1] 3.614784 (Simon Fear's second suggestion for median) Bytheway, mean treats factor in different way: mean(saywhut) [1] NA Warning message: argument is not numeric or logical: returning NA in: mean.default(saywhut). There is an R-FAQ that tells one how to convert a factor to 'numeric' but if I had tested for something being numeric to begin with I never would have guessed that I needed to convert it to numeric. I think what this conversion is really doing is getting rid of the machinery associated with the class factor: > #from the R-FAQ > test<-as.numeric(as.character(saywhut)) > mode(test) [1] "numeric" > median(test,na.rm=T) [1] 1 and bytheway: > not.a.factor<-c(NA,"1","1","2","10") > mode(not.a.factor) [1] "character" > median(not.a.factor,na.rm=T) Error in median(not.a.factor, na.rm = T) : need numeric data <Simon Fear: It seems to me the best way to deal with this "bug" would be to make calling median with a factor argument be an immediate error.> Do you think that all base functions (sum, sd, mean, median,...) should deal with this in a consistent way (This might be much more work.)? Another thing that would make things consistent would be to take the stop-work behavior out of sum:) I don't think there is any real problem in the current behavior of factor as long as the interaction between functions and classes produces this stop-work behavior - preferably with a warning - and not unexpected side effects. I am curious if there are other classes of mode numeric which median-mean-sum-sd-etc might choke on. <tongue-in-cheek on> Of course, R would produce a median for factors by using the "correct" defintion of a median of samples i.e., one that agrees with the definition of median on a CDF, even though this concept gives most people apoplexy. <off> Thanks Bob Usual disclaimers.... -----Original Message----- From: Simon Fear [mailto:[EMAIL PROTECTED] Sent: Friday, October 31, 2003 6:18 AM To: Christoph Bier Cc: [EMAIL PROTECTED] Subject: RE: [R] Weird problem with median on a factor Final guess as to observed behaviour: in the first case after removal of NAs there were an odd number of observations (so that sum was not called within the code for median). In your second call I suspect that even though you got an integer answer, it was found as sum(2,2)/2. It seems to me the best way to deal with this "bug" would be to make calling median with a factor argument be an immediate error. Or just trust users never to attempt such a thing ... Simon Fear Senior Statistician Syne qua non Ltd Tel: +44 (0) 1379 644449 Fax: +44 (0) 1379 644445 email: [EMAIL PROTECTED] web: http://www.synequanon.com Number of attachments included with this message: 0 This message (and any associated files) is confidential and\...{{dropped}} ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help [[alternative HTML version deleted]] ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help