On Oct 23, 2012, at 11:17 AM, Lopez, Dan wrote:

> Hi,
> 
> Is there a function I can use on my dataframe to give me a concise summary of 
> variables that are NA,blank,etc? Basically all Null values, Empty strings, 
> white space, blank values. Ideally it would look something like the below:
> 
> # it should only includes the fields with NAs, blanks, etc. Added bonus would 
> be to include column Index.
> #Valid Records = records that are not NA, blank,etc
> #ColIndex - what place is column in the original dataframe...1,2,3, ...xth
> 
>                Valid Records  Null (NA?)        Empty String      White Space 
>       Blank Value        ColIndex

Would a "Valid Record" be defined by grep([^ ], column)? ... i.e. has a 
non-space character in it
What is a "ColIndex"?
How is an "Empty String" different than "White Space" or a "Blank Value"



> Var1                       52        8                                        
>                                 2
> Var2                       40           20                                    
>        10                           10                                        
>    3
> Var3                       58                                                 
>           2                                                                   
>            20
> ..
> 

I generally use describe from package:Hmisc. There are other versions of 
describe in other packages. It's not going to classify items composed entirely 
of a varying number of spaces and other non-character items like tabs as a 
single group. And it's unclear what you will use as an operational definition 
to separate blanks and white-space. You will probably need to code that 
yourself. You might want to look at the code for Hmisc::describe as a starting 
point.


> I now there is summary() but I am not sure if that always displays NAs and 
> blanks especially with factor variables that have several levels (lumps them 
> in 'Other' when I run the entire dataframe).


> In these instances I can run the individual field separately and see all 
> levels but that would be inefficient to do for a dataframe with over 50 
> variables.

How were you going to "run the individual field"? If you show us code, there 
might be more rapid progress. It would probably be very easy to turn that into 
a function that could then be "run" with `lapply`.
> 
> 
-- 

David Winsemius, MD
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to