[R] Removing columns that are na or constant

2012-11-20 Thread Brian Feeny

I have a dataset that has many columns which are NA or constant, and so I 
remove them like so:


same - sapply(dataset, function(.col){ 
  all(is.na(.col))  || all(.col[1L] == .col) 
}) 
dataset - dataset[!same] 

This works GREAT (thanks to the r-users list archive I found this)

however, then when I do my data sampling like so:

testSize - floor(nrow(x) * 10/100)
test - sample(1:nrow(x), testSize)

train_data - x[-test,]
test_data - x[test, -1]
test_class - x[test, 1]

It is now possible that test_data or train_data contain columns that are 
constants, however as one dataset they did not.

So the solution for me is to just re-run lines to remove all constants..not 
a problem, but is this normal?  is this how I should
be handling this in R?  many models I am attempting to use (SVM, lda, etc) 
don't like if a column has all the same value...
so as a beginner, this is how I am handling it in R, but I am looking for 
someone to sanity check what I am doing is sound.

Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Removing columns that are na or constant

2012-11-20 Thread Rui Barradas

Hello,

Inline.
Em 20-11-2012 22:03, Brian Feeny escreveu:

I have a dataset that has many columns which are NA or constant, and so I 
remove them like so:


same - sapply(dataset, function(.col){
   all(is.na(.col))  || all(.col[1L] == .col)
})
dataset - dataset[!same]

This works GREAT (thanks to the r-users list archive I found this)

however, then when I do my data sampling like so:

testSize - floor(nrow(x) * 10/100)
test - sample(1:nrow(x), testSize)
 
train_data - x[-test,]

test_data - x[test, -1]
test_class - x[test, 1]

It is now possible that test_data or train_data contain columns that are 
constants, however as one dataset they did not.


Suppose they do. If you now remove those columns from one of train_data 
or test_data, and not from the other, then their structures are no 
longer the same.


So the solution for me is to just re-run lines to remove all constants


Or write a function. I would have the function return the indices of the 
good columns and then intersect the results for train_data and test_data.


notSame - function(dataset){
same - sapply(dataset, function(.col){
all(is.na(.col))  || all(.col[1L] == .col)
})
which(!same)
}

good1 - notSame(train_data)
good2 - notSame(test_data)
dataset - dataset[intersect(good1, good2)]


Now you can sample from a safe subset of your dataset.


..not a problem, but is this normal?  is this how I should
be handling this in R?  many models I am attempting to use (SVM, lda, etc) 
don't like if a column has all the same value...
so as a beginner, this is how I am handling it in R, but I am looking for 
someone to sanity check what I am doing is sound.


Only you can tell whether it's sound to eliminate variables from your 
analysis, and which ones.


Hope this helps,

Rui Barradas


Brian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.