Hi,

I've implemented a class that does Chi-squared feature selection for 
RDD[LabeledPoint]. It also computes basic class/feature occurrence statistics 
and other methods like mutual information or information gain can be easily 
implemented. I would like to make a pull request. However, MLlib master branch 
doesn't have any feature selection methods implemented. So, I need to create a 
proper interface that my class will extend or mix. It should be easy to use 
from developers and users prospective.

I was thinking that there should be FeatureEvaluator that for each feature from 
RDD[LabeledPoint] returns RDD[((featureIndex: Int, label: Double), value: 
Double)].
Then there should be FeatureSelector that selects top N features or top N 
features group by class etc.
And the simplest one, FeatureFilter that filters the data based on set of 
feature indices.

Additionally, there should be the interface for FeatureEvaluators that don't 
use class labels, i.e. for RDD[Vector].

I am concerned that such design looks rather "disconnected" because there are 3 
disconnected objects.

As a result of use, I would like to see something like "val filteredData = 
Filter(data, ChiSquared(data).selectTop(100))".

Any ideas or suggestions?

Best regards, Alexander

Reply via email to