Re: [R] Summary of variables with NA, empty
The examples I gave--Null, Empty string, white space, etc where just examples based on SPSS Modeler's Data Audit node. I just want something that both identifies the columns having missing values-- regardless of what they technically are stored as(NA or a field with space bar hit a couple of times,etc) -- and tabulates based on what type of missing value. This is a basic data exploration step that I thought just maybe comes standard in R and that I just don't know of yet. Hmisc::describe is good and may have to suffice. Missing for the example below using Hmisc::describe was 0 although there was a . And I know that's because of the technical difference. #EXAMPLE data - just one column in this case dput(sample(mydata$COMMUTE_BIN,100)) structure(c(2L, 5L, 3L, 2L, 6L, 3L, 2L, 3L, 4L, 2L, 2L, 4L, 3L, 4L, 3L, 3L, 3L, 6L, 2L, 2L, 2L, 4L, 6L, 4L, 2L, 3L, 2L, 2L, 6L, 3L, 2L, 6L, 3L, 2L, 3L, 4L, 4L, 4L, 5L, 7L, 3L, 5L, 2L, 3L, 2L, 2L, 6L, 7L, 7L, 4L, 3L, 3L, 2L, 2L, 2L, 5L, 2L, 2L, 2L, 2L, 2L, 2L, 5L, 2L, 3L, 3L, 6L, 4L, 6L, 2L, 7L, 4L, 6L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 4L, 3L, 5L, 3L, 4L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 4L, 5L, 3L, 2L, 2L, 3L, 1L), .Label = c(, 15, 15 - 24, 25 - 34, 35 - 44, 45 - 54, 55+), class = factor) As David mentioned maybe I will have to create my own function. Maybe something similar to what I got here for identifying a factor columns, column labels and number of levels. #EXAMPLE of formula I will probably need to create for identifying and listing column names and counts of NA and and other missing in a dataframe or table. In this case however I am listing factor columns and excluding columns w/ 32 levels set.seed(1) dat1- data.frame(col1=factor(sample(1:25,10,replace=TRUE)),col2=sample(letters[1:10],10,replace=TRUE),col3=factor(rep(1:5,each=2))) PrintLvls2 - function(x) {print(data.frame(Lvls=sapply(x[sapply(x,function(x) is.factor(x)length(levels(x))=32)],nlevels), Names=sapply(x[sapply(x, function(x) is.factor(x)length(levels(x))=32)], function(y) paste0(levels(y), collapse=, ))), right=FALSE)} PrintLvls2(dat1) Lvls Names col1 92, 6, 7, 10, 15, 16, 17, 23, 24 col2 7b, c, d, e, g, h, j col3 51, 2, 3, 4, 5 Thanks. Dan -Original Message- From: Bert Gunter [mailto:gunter.ber...@gene.com] Sent: Tuesday, October 23, 2012 3:15 PM To: David Winsemius Cc: Lopez, Dan; R help (r-help@r-project.org) Subject: Re: [R] Summary of variables with NA, empty To highlight: Basically all Null values is a meaningless phrase in R. ?Null ?NA ?NaN have **very specific meanings** in R and have nothing to do with the various sorts of whitespace characters that David mentions (spaces, tabs...). If you wish to use R, you **must** understand the distinctions (the Intro to R tutorial discusses some of this -- have you read it?). There is functionality to test for these sorts of things (is.na, is.null, etc). You need to put in the effort to learn about this if you mean to use R in any serious way, as these will occur in either data I/O (NA's) or data manipulation (e.g. 0/0) -- Bert On Tue, Oct 23, 2012 at 2:44 PM, David Winsemius dwinsem...@comcast.net wrote: On Oct 23, 2012, at 11:17 AM, Lopez, Dan wrote: Hi, Is there a function I can use on my dataframe to give me a concise summary of variables that are NA,blank,etc? Basically all Null values, Empty strings, white space, blank values. Ideally it would look something like the below: # it should only includes the fields with NAs, blanks, etc. Added bonus would be to include column Index. #Valid Records = records that are not NA, blank,etc #ColIndex - what place is column in the original dataframe...1,2,3, ...xth Valid Records Null (NA?)Empty String White Space Blank ValueColIndex Would a Valid Record be defined by grep([^ ], column)? ... i.e. has a non-space character in it What is a ColIndex? How is an Empty String different than White Space or a Blank Value Var1 528 2 Var2 40 20 10 10 3 Var3 58 2 20 .. I generally use describe from package:Hmisc. There are other versions of describe in other packages. It's not going to classify items composed entirely of a varying number of spaces and other non-character items like tabs as a single group. And it's unclear what you will use as an operational definition to separate blanks and white
Re: [R] Summary of variables with NA, empty
On Oct 24, 2012, at 8:32 AM, Lopez, Dan wrote: The examples I gave--Null, Empty string, white space, etc where just examples based on SPSS Modeler's Data Audit node. I just want something that both identifies the columns having missing values-- regardless of what they technically are stored as(NA or a field with space bar hit a couple of times,etc) -- and tabulates based on what type of missing value. This is a basic data exploration step that I thought just maybe comes standard in R and that I just don't know of yet. In none of your examples below do you create factor columns with any of the features you say that you are hoping to identify. No NA's, no white-space levels, no 999 values. I do not (and never have) used SPSS Modeler's Data Audit, so definition by analogy is not going to work for me. Hmisc::describe is good and may have to suffice. Missing for the example below using Hmisc::describe was 0 although there was a . And I know that's because of the technical difference. Right. Hmisc::describe counts the number of NA's. A value of or is not the same as the R NA missing. is.na(factor()) [1] FALSE is.na(factor(NA)) [1] TRUE is.na(factor( )) [1] FALSE If you want to test for an empty string use nchar(vec) == 0 I offered an earlier suggestion for a grepl test for all spaces. #EXAMPLE data - just one column in this case dput(sample(mydata$COMMUTE_BIN,100)) structure(c(2L, 5L, 3L, 2L, 6L, 3L, 2L, 3L, 4L, 2L, 2L, 4L, 3L, 4L, 3L, 3L, 3L, 6L, 2L, 2L, 2L, 4L, 6L, 4L, 2L, 3L, 2L, 2L, 6L, 3L, 2L, 6L, 3L, 2L, 3L, 4L, 4L, 4L, 5L, 7L, 3L, 5L, 2L, 3L, 2L, 2L, 6L, 7L, 7L, 4L, 3L, 3L, 2L, 2L, 2L, 5L, 2L, 2L, 2L, 2L, 2L, 2L, 5L, 2L, 3L, 3L, 6L, 4L, 6L, 2L, 7L, 4L, 6L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 4L, 3L, 5L, 3L, 4L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 4L, 5L, 3L, 2L, 2L, 3L, 1L), .Label = c(, 15, 15 - 24, 25 - 34, 35 - 44, 45 - 54, 55+), class = factor) As David mentioned maybe I will have to create my own function. Maybe something similar to what I got here for identifying a factor columns, column labels and number of levels. #EXAMPLE of formula I will probably need to create for identifying and listing column names and counts of NA and and other missing in a dataframe or table. In this case however I am listing factor columns and excluding columns w/ 32 levels set.seed(1) dat1- data.frame(col1=factor(sample(1:25,10,replace=TRUE)),col2=sample(letters[1:10],10,replace=TRUE),col3=factor(rep(1:5,each=2))) PrintLvls2 - function(x) {print(data.frame(Lvls=sapply(x[sapply(x,function(x) is.factor(x)length(levels(x))=32)],nlevels), Names=sapply(x[sapply(x, function(x) is.factor(x)length(levels(x))=32)], function(y) paste0(levels(y), collapse=, ))), right=FALSE)} PrintLvls2(dat1) Lvls Names col1 92, 6, 7, 10, 15, 16, 17, 23, 24 col2 7b, c, d, e, g, h, j col3 51, 2, 3, 4, 5 I find it good to put in counter-examples such as a column that is non-factor. I thought that a non-factor column would probably break your code, but happily it survived. You might think about writing two functions: one to pick the columns to be assessed and the other to return a structured object from the candidates. -- david. Thanks. Dan -Original Message- From: Bert Gunter [mailto:gunter.ber...@gene.com] Sent: Tuesday, October 23, 2012 3:15 PM To: David Winsemius Cc: Lopez, Dan; R help (r-help@r-project.org) Subject: Re: [R] Summary of variables with NA, empty To highlight: Basically all Null values is a meaningless phrase in R. ?Null ?NA ?NaN have **very specific meanings** in R and have nothing to do with the various sorts of whitespace characters that David mentions (spaces, tabs...). If you wish to use R, you **must** understand the distinctions (the Intro to R tutorial discusses some of this -- have you read it?). There is functionality to test for these sorts of things (is.na, is.null, etc). You need to put in the effort to learn about this if you mean to use R in any serious way, as these will occur in either data I/O (NA's) or data manipulation (e.g. 0/0) -- Bert On Tue, Oct 23, 2012 at 2:44 PM, David Winsemius dwinsem...@comcast.net wrote: On Oct 23, 2012, at 11:17 AM, Lopez, Dan wrote: Hi, Is there a function I can use on my dataframe to give me a concise summary of variables that are NA,blank,etc? Basically all Null values, Empty strings, white space, blank values. Ideally it would look something like the below: # it should only includes the fields with NAs, blanks, etc. Added bonus would be to include column Index. #Valid Records = records that are not NA, blank,etc #ColIndex - what place is column in the original dataframe...1,2,3, ...xth Valid Records Null (NA
[R] Summary of variables with NA, empty
Hi, Is there a function I can use on my dataframe to give me a concise summary of variables that are NA,blank,etc? Basically all Null values, Empty strings, white space, blank values. Ideally it would look something like the below: # it should only includes the fields with NAs, blanks, etc. Added bonus would be to include column Index. #Valid Records = records that are not NA, blank,etc #ColIndex - what place is column in the original dataframe...1,2,3, ...xth Valid Records Null (NA?) Empty String White Space Blank ValueColIndex Var1 52 8 2 Var2 40 20 10 10 3 Var3 58 2 20 .. I now there is summary() but I am not sure if that always displays NAs and blanks especially with factor variables that have several levels (lumps them in 'Other' when I run the entire dataframe). In these instances I can run the individual field separately and see all levels but that would be inefficient to do for a dataframe with over 50 variables. Dan [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Summary of variables with NA, empty
On Oct 23, 2012, at 11:17 AM, Lopez, Dan wrote: Hi, Is there a function I can use on my dataframe to give me a concise summary of variables that are NA,blank,etc? Basically all Null values, Empty strings, white space, blank values. Ideally it would look something like the below: # it should only includes the fields with NAs, blanks, etc. Added bonus would be to include column Index. #Valid Records = records that are not NA, blank,etc #ColIndex - what place is column in the original dataframe...1,2,3, ...xth Valid Records Null (NA?)Empty String White Space Blank ValueColIndex Would a Valid Record be defined by grep([^ ], column)? ... i.e. has a non-space character in it What is a ColIndex? How is an Empty String different than White Space or a Blank Value Var1 528 2 Var2 40 20 10 10 3 Var3 58 2 20 .. I generally use describe from package:Hmisc. There are other versions of describe in other packages. It's not going to classify items composed entirely of a varying number of spaces and other non-character items like tabs as a single group. And it's unclear what you will use as an operational definition to separate blanks and white-space. You will probably need to code that yourself. You might want to look at the code for Hmisc::describe as a starting point. I now there is summary() but I am not sure if that always displays NAs and blanks especially with factor variables that have several levels (lumps them in 'Other' when I run the entire dataframe). In these instances I can run the individual field separately and see all levels but that would be inefficient to do for a dataframe with over 50 variables. How were you going to run the individual field? If you show us code, there might be more rapid progress. It would probably be very easy to turn that into a function that could then be run with `lapply`. -- David Winsemius, MD Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Summary of variables with NA, empty
To highlight: Basically all Null values is a meaningless phrase in R. ?Null ?NA ?NaN have **very specific meanings** in R and have nothing to do with the various sorts of whitespace characters that David mentions (spaces, tabs...). If you wish to use R, you **must** understand the distinctions (the Intro to R tutorial discusses some of this -- have you read it?). There is functionality to test for these sorts of things (is.na, is.null, etc). You need to put in the effort to learn about this if you mean to use R in any serious way, as these will occur in either data I/O (NA's) or data manipulation (e.g. 0/0) -- Bert On Tue, Oct 23, 2012 at 2:44 PM, David Winsemius dwinsem...@comcast.net wrote: On Oct 23, 2012, at 11:17 AM, Lopez, Dan wrote: Hi, Is there a function I can use on my dataframe to give me a concise summary of variables that are NA,blank,etc? Basically all Null values, Empty strings, white space, blank values. Ideally it would look something like the below: # it should only includes the fields with NAs, blanks, etc. Added bonus would be to include column Index. #Valid Records = records that are not NA, blank,etc #ColIndex - what place is column in the original dataframe...1,2,3, ...xth Valid Records Null (NA?)Empty String White Space Blank ValueColIndex Would a Valid Record be defined by grep([^ ], column)? ... i.e. has a non-space character in it What is a ColIndex? How is an Empty String different than White Space or a Blank Value Var1 528 2 Var2 40 20 10 10 3 Var3 58 2 20 .. I generally use describe from package:Hmisc. There are other versions of describe in other packages. It's not going to classify items composed entirely of a varying number of spaces and other non-character items like tabs as a single group. And it's unclear what you will use as an operational definition to separate blanks and white-space. You will probably need to code that yourself. You might want to look at the code for Hmisc::describe as a starting point. I now there is summary() but I am not sure if that always displays NAs and blanks especially with factor variables that have several levels (lumps them in 'Other' when I run the entire dataframe). In these instances I can run the individual field separately and see all levels but that would be inefficient to do for a dataframe with over 50 variables. How were you going to run the individual field? If you show us code, there might be more rapid progress. It would probably be very easy to turn that into a function that could then be run with `lapply`. -- David Winsemius, MD Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.