[R] (Newbie) Aggregate for NA values

2006-02-24 Thread Vivek Satsangi
Folks,

Sorry if this question has been answered before or is obvious (or
worse, statistically bad). I don't understand what was said in one
of the search results that seems somewhat related.

I use aggregate to get a quick summary of the data. Part of what I am
looking for in the summary is, how much influence might the NA's have
had, if they were included, and is excluding them from the means
causing some sort of bias. So I want the summary stat for the NA's
also.

Here is a simple example session (edited to remove the typos I made,
comments added later):

 tmp_a - 1:10
 tmp_b - rep(1:5,2)
 tmp_c - rep(1:2,5)
 tmp_d - c(1,1,1,2,2,2,3,3,3,4)
 tmp_df - data.frame(tmp_a,tmp_b,tmp_c,tmp_d);
 tmp_df$tmp_c[9:10] - NA ;
 tmp_df
   tmp_a tmp_b tmp_c tmp_d
1  1 1 1 1
2  2 2 2 1
3  3 3 1 1
4  4 4 2 2
5  5 5 1 2
6  6 1 2 2
7  7 2 1 3
8  8 3 2 3
9  9 4NA 3
1010 5NA 4
 aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_b,tmp_df$tmp_c),mean);
  Group.1 Group.2 x
1   1   1 1
2   2   1 3
3   3   1 1
4   5   1 2
5   1   2 2
6   2   2 1
7   3   2 3
8   4   2 2
# Only one row for each (tmp_b, tmp_c) combination, NA's getting dropped.

 aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_c),mean);
  Group.1x
1   1 1.75
2   2 2.00

What I want in this last aggregate is, a mean for the values in tmp_d
that correspond to the tmp_c values of NA. Similarly, perhaps there is
a way to make the second last call to aggregate return the values of
tmp_d for the NA values of tmp_c also.

How can I achieve this?

--
-- Vivek Satsangi
Student, Rochester, NY USA

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] (Newbie) Aggregate for NA values

2006-02-24 Thread Adaikalavan Ramasamy
I think it makes perfect sense for R to drop it since 'NA' represents
uninformative information. I do not know if there is a elegant solution
but I would suggest that you make these 'NA' into an informative value.

Here is one possibility:

 df - data.frame( AA=1:10, BB=rep(1:5,2), CC=rep(1:2,5), DD=rnorm(10) )
 df[ 9:10, CC ] - NA

 df[is.na(df)] - lala   ## change NA's into informative category ##


 aggregate( df$DD, by=list( df$CC ), mean  )
 Group.1  x
   1   1  1.1533763
   2   2  0.6427338
   3lala -0.2745249

 aggregate( df$DD, by=list( df$BB, df$CC ), mean  )
  Group.1 Group.2   x
   11   1  0.47264081
   22   1  0.63795211
   33   1  1.66756015
   45   1  1.83535232
   51   2  0.89914287
   62   2  1.11102134
   73   2  0.22268699
   84   2  0.33808394
   94lala -0.60154608
   10   5lala  0.05249622

Regards, Adai



On Fri, 2006-02-24 at 10:16 -0500, Vivek Satsangi wrote:
 Folks,
 
 Sorry if this question has been answered before or is obvious (or
 worse, statistically bad). I don't understand what was said in one
 of the search results that seems somewhat related.
 
 I use aggregate to get a quick summary of the data. Part of what I am
 looking for in the summary is, how much influence might the NA's have
 had, if they were included, and is excluding them from the means
 causing some sort of bias. So I want the summary stat for the NA's
 also.
 
 Here is a simple example session (edited to remove the typos I made,
 comments added later):
 
  tmp_a - 1:10
  tmp_b - rep(1:5,2)
  tmp_c - rep(1:2,5)
  tmp_d - c(1,1,1,2,2,2,3,3,3,4)
  tmp_df - data.frame(tmp_a,tmp_b,tmp_c,tmp_d);
  tmp_df$tmp_c[9:10] - NA ;
  tmp_df
tmp_a tmp_b tmp_c tmp_d
 1  1 1 1 1
 2  2 2 2 1
 3  3 3 1 1
 4  4 4 2 2
 5  5 5 1 2
 6  6 1 2 2
 7  7 2 1 3
 8  8 3 2 3
 9  9 4NA 3
 1010 5NA 4
  aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_b,tmp_df$tmp_c),mean);
   Group.1 Group.2 x
 1   1   1 1
 2   2   1 3
 3   3   1 1
 4   5   1 2
 5   1   2 2
 6   2   2 1
 7   3   2 3
 8   4   2 2
 # Only one row for each (tmp_b, tmp_c) combination, NA's getting dropped.
 
  aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_c),mean);
   Group.1x
 1   1 1.75
 2   2 2.00
 
 What I want in this last aggregate is, a mean for the values in tmp_d
 that correspond to the tmp_c values of NA. Similarly, perhaps there is
 a way to make the second last call to aggregate return the values of
 tmp_d for the NA values of tmp_c also.
 
 How can I achieve this?
 
 --
 -- Vivek Satsangi
 Student, Rochester, NY USA
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html