On Oct 16, 2009, at 11:33 AM, Alexander Peterhansl wrote:

Thank you Mohamed and Bill for your replies.  (I did not send the data
because it is unwieldy.)

Yes Bill, the issue arises directly from what you had guessed.  I was
working with a subset of the data (which implicitly had factors for the
complete data set).

On this, what is the best way take a subset of the data which ignores
these "extraneous" factors?

log<-data.frame(Flag=1:2,
RequestID=factor(letters[1:2],levels=letters[1:10]))
log2 <-subset(log, RequestID=="a")

levels(log2$RequestID)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

log2$RequestID <- factor(log2$RequestID)

You might think that log2 <-subset(log, RequestID=="a", drop=TRUE) might do that task, but it clearly doesn't.

--
DW

In other words, how do I take a subset which yields "a" as the only
level for log2?

Alex

-----Original Message-----
From: William Dunlap [mailto:wdun...@tibco.com]
Sent: Thursday, October 15, 2009 11:59 PM
To: Alexander Peterhansl; r-help@r-project.org
Subject: RE: [R] tapply() and using factor() on a factor

-----Original Message-----
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On Behalf Of Alexander
Peterhansl
Sent: Thursday, October 15, 2009 2:50 PM
To: r-help@r-project.org
Subject: [R] tapply() and using factor() on a factor

Dear List,
Shouldn't result1 and result2 be equal in the following case?

Note that log$RequestID is a factor.  That is,
is.factor(log$RequestID)
yields TRUE.

result1 <- tapply(log$Flag,factor(log$RequestID),sum)

result2 <- tapply(log$Flag,log$RequestID,sum)

Showing us the output of dput(log) (or str(log) and summary(log))
would let people discover the problem more readily.  Since you
didn't I'll guess what the dataset may contain.

If log$RequestID is a factor with lots of unused levels tapply
will output an NA for each unused level.  factor(log$RequestID)
will create a new set of levels, only those actually used,
so tapply will not be forced to fill those spots with NA's.  E.g.,

log<-data.frame(Flag=1:2, RequestID=factor(letters[1:2],
levels=letters[1:10]))
tapply(log$Flag, log$RequestID, sum)
a  b  c  d  e  f  g  h  i  j
1  2 NA NA NA NA NA NA NA NA
tapply(log$Flag, factor(log$RequestID), sum)
a b
1 2

I suppose tapply(X,INDEX,FUN) could call FUN(X[0]) to see
how to fill the cells with no data behind them, but it doesn't.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com




Yet, when I summarize the output, I get the following:

summary(result1)

  Min.    1st Qu.  Median  Mean 3rd Qu.    Max.

 11.00   11.00     11.00      26.06   11.00       101.00



summary(result2)

  Min. 1st Qu.  Median Mean 3rd Qu.    Max.    NA's

 11.00   11.00   11.00        26.06   11.00  101.00   978.00



Why does result2 have 978 NA's?



Any help on this would be appreciated.




David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to