Re: [R] Summary of variables with NA, empty

2012-10-24 Thread Lopez, Dan
The examples I gave--Null, Empty string, white space, etc where just examples 
based on SPSS Modeler's Data Audit node. 

I just want something that both identifies the columns having missing values-- 
regardless of what they technically are stored as(NA or a field with space bar 
hit a couple of times,etc) -- and tabulates based on what type of missing 
value. This is a basic data exploration step that I thought just maybe comes 
standard in R and that I just don't know of yet.

Hmisc::describe is good and may have to suffice. Missing for the example 
below using Hmisc::describe was 0 although there was a . And I know that's 
because of the technical difference. 

#EXAMPLE data - just one column in this case
 dput(sample(mydata$COMMUTE_BIN,100))
structure(c(2L, 5L, 3L, 2L, 6L, 3L, 2L, 3L, 4L, 2L, 2L, 4L, 3L, 
4L, 3L, 3L, 3L, 6L, 2L, 2L, 2L, 4L, 6L, 4L, 2L, 3L, 2L, 2L, 6L, 
3L, 2L, 6L, 3L, 2L, 3L, 4L, 4L, 4L, 5L, 7L, 3L, 5L, 2L, 3L, 2L, 
2L, 6L, 7L, 7L, 4L, 3L, 3L, 2L, 2L, 2L, 5L, 2L, 2L, 2L, 2L, 2L, 
2L, 5L, 2L, 3L, 3L, 6L, 4L, 6L, 2L, 7L, 4L, 6L, 2L, 3L, 2L, 2L, 
2L, 3L, 2L, 3L, 4L, 3L, 5L, 3L, 4L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 
4L, 5L, 3L, 2L, 2L, 3L, 1L), .Label = c(, 15, 15 - 24, 
25 - 34, 35 - 44, 45 - 54, 55+), class = factor)

As David mentioned maybe I will have to create my own function. Maybe something 
similar to what I got here for identifying a factor columns, column labels and 
number of levels.
#EXAMPLE of formula I will probably need to create for identifying and listing 
column names and counts of NA and  and other missing in a dataframe or 
table. In this case however I am listing factor columns and excluding columns 
w/ 32 levels
set.seed(1)
dat1- 
data.frame(col1=factor(sample(1:25,10,replace=TRUE)),col2=sample(letters[1:10],10,replace=TRUE),col3=factor(rep(1:5,each=2)))
PrintLvls2 - function(x) {print(data.frame(Lvls=sapply(x[sapply(x,function(x) 
is.factor(x)length(levels(x))=32)],nlevels), 
  
Names=sapply(x[sapply(x, function(x) is.factor(x)length(levels(x))=32)], 
 
function(y) paste0(levels(y), collapse=, ))), right=FALSE)}
 PrintLvls2(dat1)
 Lvls Names  
col1 92, 6, 7, 10, 15, 16, 17, 23, 24
col2 7b, c, d, e, g, h, j
col3 51, 2, 3, 4, 5 

Thanks.
Dan

-Original Message-
From: Bert Gunter [mailto:gunter.ber...@gene.com] 
Sent: Tuesday, October 23, 2012 3:15 PM
To: David Winsemius
Cc: Lopez, Dan; R help (r-help@r-project.org)
Subject: Re: [R] Summary of variables with NA, empty

To highlight:

Basically all Null values is a meaningless phrase in R. ?Null ?NA ?NaN have 
**very specific meanings** in R and have nothing to do with the various sorts 
of whitespace characters that David mentions (spaces, tabs...). If you wish to 
use R, you **must** understand the distinctions (the Intro to R tutorial 
discusses some of this -- have you read it?).

There is functionality to test for these sorts of things (is.na, is.null, etc). 
You need to put in the effort to learn about this if you mean to use R in any 
serious way, as these will occur in either data I/O (NA's) or data manipulation 
(e.g. 0/0)

-- Bert

On Tue, Oct 23, 2012 at 2:44 PM, David Winsemius dwinsem...@comcast.net wrote:

 On Oct 23, 2012, at 11:17 AM, Lopez, Dan wrote:

 Hi,

 Is there a function I can use on my dataframe to give me a concise summary 
 of variables that are NA,blank,etc? Basically all Null values, Empty 
 strings, white space, blank values. Ideally it would look something like the 
 below:

 # it should only includes the fields with NAs, blanks, etc. Added bonus 
 would be to include column Index.
 #Valid Records = records that are not NA, blank,etc #ColIndex - what 
 place is column in the original dataframe...1,2,3, ...xth

Valid Records  Null (NA?)Empty String  White 
 Space   Blank ValueColIndex

 Would a Valid Record be defined by grep([^ ], column)? ... i.e. has 
 a non-space character in it What is a ColIndex?
 How is an Empty String different than White Space or a Blank Value



 Var1   528   
  2
 Var2   40   20   
 10   10  
  3
 Var3   58
2 
  20
 ..


 I generally use describe from package:Hmisc. There are other versions of 
 describe in other packages. It's not going to classify items composed 
 entirely of a varying number of spaces and other non-character items like 
 tabs as a single group. And it's unclear what you will use as an operational 
 definition to separate blanks and white

Re: [R] Summary of variables with NA, empty

2012-10-24 Thread David Winsemius

On Oct 24, 2012, at 8:32 AM, Lopez, Dan wrote:

 The examples I gave--Null, Empty string, white space, etc where just examples 
 based on SPSS Modeler's Data Audit node. 
 
 I just want something that both identifies the columns having missing 
 values-- regardless of what they technically are stored as(NA or a field with 
 space bar hit a couple of times,etc) -- and tabulates based on what type of 
 missing value. This is a basic data exploration step that I thought just 
 maybe comes standard in R and that I just don't know of yet.

In none of your examples below do you create factor columns with any of the 
features you say that you are hoping to identify. No NA's, no white-space 
levels, no 999 values.  I do not (and never have) used SPSS Modeler's Data 
Audit, so definition by analogy is not going to work for me.
 
 
 Hmisc::describe is good and may have to suffice. Missing for the example 
 below using Hmisc::describe was 0 although there was a . And I know that's 
 because of the technical difference. 

Right. Hmisc::describe counts the number of NA's. A value of  or   is not 
the same as the R NA missing.

  is.na(factor())
[1] FALSE
 is.na(factor(NA))
[1] TRUE
 is.na(factor( ))
[1] FALSE

If you want to test for an empty string use nchar(vec) == 0

I offered an earlier suggestion for a grepl test for all spaces.


 #EXAMPLE data - just one column in this case
 dput(sample(mydata$COMMUTE_BIN,100))
 structure(c(2L, 5L, 3L, 2L, 6L, 3L, 2L, 3L, 4L, 2L, 2L, 4L, 3L, 
 4L, 3L, 3L, 3L, 6L, 2L, 2L, 2L, 4L, 6L, 4L, 2L, 3L, 2L, 2L, 6L, 
 3L, 2L, 6L, 3L, 2L, 3L, 4L, 4L, 4L, 5L, 7L, 3L, 5L, 2L, 3L, 2L, 
 2L, 6L, 7L, 7L, 4L, 3L, 3L, 2L, 2L, 2L, 5L, 2L, 2L, 2L, 2L, 2L, 
 2L, 5L, 2L, 3L, 3L, 6L, 4L, 6L, 2L, 7L, 4L, 6L, 2L, 3L, 2L, 2L, 
 2L, 3L, 2L, 3L, 4L, 3L, 5L, 3L, 4L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 
 4L, 5L, 3L, 2L, 2L, 3L, 1L), .Label = c(, 15, 15 - 24, 
 25 - 34, 35 - 44, 45 - 54, 55+), class = factor)
 
 As David mentioned maybe I will have to create my own function. Maybe 
 something similar to what I got here for identifying a factor columns, column 
 labels and number of levels.
 #EXAMPLE of formula I will probably need to create for identifying and 
 listing column names and counts of NA and  and other missing in a 
 dataframe or table. In this case however I am listing factor columns and 
 excluding columns w/ 32 levels
 set.seed(1)
 dat1- 
 data.frame(col1=factor(sample(1:25,10,replace=TRUE)),col2=sample(letters[1:10],10,replace=TRUE),col3=factor(rep(1:5,each=2)))

 PrintLvls2 - function(x) 
 {print(data.frame(Lvls=sapply(x[sapply(x,function(x) 
 is.factor(x)length(levels(x))=32)],nlevels), 
 
 Names=sapply(x[sapply(x, function(x) is.factor(x)length(levels(x))=32)], 

 function(y) paste0(levels(y), collapse=, ))), right=FALSE)}
 PrintLvls2(dat1)
 Lvls Names  
 col1 92, 6, 7, 10, 15, 16, 17, 23, 24
 col2 7b, c, d, e, g, h, j
 col3 51, 2, 3, 4, 5 

I find it good to put in counter-examples such as a column that is non-factor. 
I thought that a non-factor column would probably break your code, but happily 
it survived. You might think about writing two functions: one to pick the 
columns to be assessed and the other to return a structured object from the 
candidates.

-- 
david.

 
 Thanks.
 Dan
 
 -Original Message-
 From: Bert Gunter [mailto:gunter.ber...@gene.com] 
 Sent: Tuesday, October 23, 2012 3:15 PM
 To: David Winsemius
 Cc: Lopez, Dan; R help (r-help@r-project.org)
 Subject: Re: [R] Summary of variables with NA, empty
 
 To highlight:
 
 Basically all Null values is a meaningless phrase in R. ?Null ?NA ?NaN have 
 **very specific meanings** in R and have nothing to do with the various sorts 
 of whitespace characters that David mentions (spaces, tabs...). If you wish 
 to use R, you **must** understand the distinctions (the Intro to R tutorial 
 discusses some of this -- have you read it?).
 
 There is functionality to test for these sorts of things (is.na, is.null, 
 etc). You need to put in the effort to learn about this if you mean to use R 
 in any serious way, as these will occur in either data I/O (NA's) or data 
 manipulation (e.g. 0/0)
 
 -- Bert
 
 On Tue, Oct 23, 2012 at 2:44 PM, David Winsemius dwinsem...@comcast.net 
 wrote:
 
 On Oct 23, 2012, at 11:17 AM, Lopez, Dan wrote:
 
 Hi,
 
 Is there a function I can use on my dataframe to give me a concise summary 
 of variables that are NA,blank,etc? Basically all Null values, Empty 
 strings, white space, blank values. Ideally it would look something like 
 the below:
 
 # it should only includes the fields with NAs, blanks, etc. Added bonus 
 would be to include column Index.
 #Valid Records = records that are not NA, blank,etc #ColIndex - what 
 place is column in the original dataframe...1,2,3, ...xth
 
   Valid Records  Null (NA

Re: [R] Summary of variables with NA, empty

2012-10-23 Thread David Winsemius

On Oct 23, 2012, at 11:17 AM, Lopez, Dan wrote:

 Hi,
 
 Is there a function I can use on my dataframe to give me a concise summary of 
 variables that are NA,blank,etc? Basically all Null values, Empty strings, 
 white space, blank values. Ideally it would look something like the below:
 
 # it should only includes the fields with NAs, blanks, etc. Added bonus would 
 be to include column Index.
 #Valid Records = records that are not NA, blank,etc
 #ColIndex - what place is column in the original dataframe...1,2,3, ...xth
 
Valid Records  Null (NA?)Empty String  White Space 
   Blank ValueColIndex

Would a Valid Record be defined by grep([^ ], column)? ... i.e. has a 
non-space character in it
What is a ColIndex?
How is an Empty String different than White Space or a Blank Value



 Var1   528
 2
 Var2   40   20
10   10
3
 Var3   58 
   2   
20
 ..
 

I generally use describe from package:Hmisc. There are other versions of 
describe in other packages. It's not going to classify items composed entirely 
of a varying number of spaces and other non-character items like tabs as a 
single group. And it's unclear what you will use as an operational definition 
to separate blanks and white-space. You will probably need to code that 
yourself. You might want to look at the code for Hmisc::describe as a starting 
point.


 I now there is summary() but I am not sure if that always displays NAs and 
 blanks especially with factor variables that have several levels (lumps them 
 in 'Other' when I run the entire dataframe).


 In these instances I can run the individual field separately and see all 
 levels but that would be inefficient to do for a dataframe with over 50 
 variables.

How were you going to run the individual field? If you show us code, there 
might be more rapid progress. It would probably be very easy to turn that into 
a function that could then be run with `lapply`.
 
 
-- 

David Winsemius, MD
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Summary of variables with NA, empty

2012-10-23 Thread Bert Gunter
To highlight:

Basically all Null values is a meaningless phrase in R. ?Null ?NA
?NaN have **very specific meanings** in R and have nothing to do with
the various sorts of whitespace characters that David mentions
(spaces, tabs...). If you wish to use R, you **must** understand the
distinctions (the Intro to R tutorial discusses some of this -- have
you read it?).

There is functionality to test for these sorts of things (is.na,
is.null, etc). You need to put in the effort to learn about this if
you mean to use R in any serious way, as these will occur in either
data I/O (NA's) or data manipulation (e.g. 0/0)

-- Bert

On Tue, Oct 23, 2012 at 2:44 PM, David Winsemius dwinsem...@comcast.net wrote:

 On Oct 23, 2012, at 11:17 AM, Lopez, Dan wrote:

 Hi,

 Is there a function I can use on my dataframe to give me a concise summary 
 of variables that are NA,blank,etc? Basically all Null values, Empty 
 strings, white space, blank values. Ideally it would look something like the 
 below:

 # it should only includes the fields with NAs, blanks, etc. Added bonus 
 would be to include column Index.
 #Valid Records = records that are not NA, blank,etc
 #ColIndex - what place is column in the original dataframe...1,2,3, ...xth

Valid Records  Null (NA?)Empty String  White 
 Space   Blank ValueColIndex

 Would a Valid Record be defined by grep([^ ], column)? ... i.e. has a 
 non-space character in it
 What is a ColIndex?
 How is an Empty String different than White Space or a Blank Value



 Var1   528   
  2
 Var2   40   20   
 10   10  
  3
 Var3   58
2 
  20
 ..


 I generally use describe from package:Hmisc. There are other versions of 
 describe in other packages. It's not going to classify items composed 
 entirely of a varying number of spaces and other non-character items like 
 tabs as a single group. And it's unclear what you will use as an operational 
 definition to separate blanks and white-space. You will probably need to code 
 that yourself. You might want to look at the code for Hmisc::describe as a 
 starting point.


 I now there is summary() but I am not sure if that always displays NAs and 
 blanks especially with factor variables that have several levels (lumps them 
 in 'Other' when I run the entire dataframe).


 In these instances I can run the individual field separately and see all 
 levels but that would be inefficient to do for a dataframe with over 50 
 variables.

 How were you going to run the individual field? If you show us code, there 
 might be more rapid progress. It would probably be very easy to turn that 
 into a function that could then be run with `lapply`.


 --

 David Winsemius, MD
 Alameda, CA, USA

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.