Re: [R] persuade tabulate function to count NAs in a data frame

2011-03-19 Thread Jim Lemon

On 03/20/2011 01:58 AM, Bodnar Laszlo EB_HU wrote:

Hi,

I'd like to ask you a question again. It is basically about data frames, NAs 
and tabulate function.


Hi Bodnar,
The "freq" function in the prettyR package might do what you want.

Jim

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] persuade tabulate function to count NAs in a data frame

2011-03-19 Thread Gavin Simpson
On Sat, 2011-03-19 at 15:58 +0100, Bodnar Laszlo EB_HU wrote:
> Hi,

I'll top-post as the original Q is very lengthy:

tabs <-lapply(df[,2:6], 
  function(x, id){ t(table(addNA(x), id, useNA = "ifany")) }, df$id)

is one way of doing what you want. More details are here:

http://stackoverflow.com/questions/5362702/persuading-tabulate-function-to-count-nas-in-a-data-frame-in-r

where you also posted your Q.

HTH

G


> I'd like to ask you a question again. It is basically about data frames, NAs 
> and tabulate function.
> 
> I have this data frame. I already used this in one of the previous questions 
> of mine. It intentionally looks this simple, my real 'df' dataframe is much 
> bigger actually and again, I am not willing to annoy anyone with huge 
> databases... So, my database:
> 
> id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
> a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
> b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
> c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
> d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
> e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,4)
> df <-data.frame(id,a,b,c,d,e)
> df
> 
> I have managed to calculate the distributions of the numbers occurring in 
> columns 'b' to 'e' but considering the fact at the very same time that these 
> distributions should be 'groupped by' the id numbers in column 'id'. It works 
> fine, check it ->
> 
> matrix(matrix(unlist(lapply(df[,(-(1))],function(x) 
> tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2] 
> [[1]])),ncol=3,nrow=3,byrow=TRUE)
> matrix(matrix(unlist(lapply(df[,(-(1))],function(x) 
> tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,3] 
> [[2]])),ncol=3,nrow=3,byrow=TRUE)
> matrix(matrix(unlist(lapply(df[,(-(1))],function(x) 
> tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4] 
> [[3]])),ncol=3,nrow=3,byrow=TRUE)
> matrix(matrix(unlist(lapply(df[,(-(1))],function(x) 
> tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,5] 
> [[4]])),ncol=3,nrow=3,byrow=TRUE)
> matrix(matrix(unlist(lapply(df[,(-(1))],function(x) 
> tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,6] 
> [[5]])),ncol=4,nrow=3,byrow=TRUE)
> 
> Now my problem is: what if my data frame contains NA values here and there 
> and what if I want my in-built tabulate function to collect these NAs as 
> well? So what if I want it to count how many occurrences I have from these 
> NAs?
> 
> Here's my modified data frame with the NAs:
> id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
> a <-c(NA,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
> b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
> c <-c(1,3,2,3,2,1,2,3,3,2,2,3,NA,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
> d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
> e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,NA,1,4)
> df <-data.frame(id,a,b,c,d,e)
> df
> 
> At first I tried something like this (you see, the only thing I did was that 
> I tried to apply this "exclude=NULL" thing).
> unlist(lapply(df[,(-(1))],function(x) 
> tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,2],exclude=NULL [[1]])
> 
> At least my code realizes the fact that I have 4 different levels in column 
> 'a' (1,2,3,NA) and not only three (1,2,3). Check it here:
> nlevels(factor(df[,2],exclude=NULL))
> 
> But you see in the result that somehow it could not calculate the NAs. It says
> 3  0  6  0(!)  4  3  3  0  4  1  5  0
> 
> Instead of the correct:
> 3  0  6  1(!)  4  3  3  0  4  1  5  0
> 
> Or in case of:
> unlist(lapply(df[,(-(1))],function(x) 
> tapply(x,df$id,tabulate,nbins=nlevels(factor(df[,4],exclude=NULL [[3]])
> 
> It says
> 2  4  4  0  2  3  4  0(!)  1  5  4  0
> 
> Instead of the correct
> 2  4  4  0  2  3  4  1(!)  1  5  4  0
> etc.
> 
> Does someone have any ideas how to "persuade" the function tabulate to count 
> NAs? Is it possible at all?
> Thanks very much and have a pleasant weekend,
> Laszlo
> 
> 
> Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos és/vagy 
> jogilag, szakmailag vagy más módon védett információt tartalmazhat. 
> Amennyiben nem Ön a levél címzettje akkor a levél tartalmának közlése, 
> reprodukálása, másolása, vagy egyéb más úton történő terjesztése, 
> felhasználása szigorúan tilos. Amennyiben tévedésből kapta meg ezt az 
> üzenetet kérjük azonnal értesítse az üzenet küldőjét. Az Erste Bank Hungary 
> Zrt. (EBH) nem vállal felelősséget az információ teljes és pontos - 
> címzett(ek)hez történő - eljuttatásáért, valamint semmilyen késésért, 
> kapcsolat megszakadásból eredő hibáért, vagy az információ felhasználásából 
> vagy annak megbízhatatlanságából eredő kárért.
> 
> Az üzenetek EBH-n kívüli küldője vagy címzettje tudomásul veszi és 
> hozzájárul, hogy az üzene