Hi Ignacio, Please create a JIRA and send a PR for the information gain computation, so it is easy to track the progress.
The sparse vector support for NaiveBayes is already implemented in branch-1.0 and master. You only need to provide an RDD of sparse vectors (created from Vectors.sparse). MLUtils.loadLibSVMData reads sparse features in LIBSVM format. Best, Xiangrui On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas <ignacio.zendejas...@gmail.com> wrote: > Hi, again - > > As part of the next step, I'd like to make a more substantive contribution > and propose some initial work on feature selection, primarily as it relates > to text classification. > > Specifically, I'd like to contribute very straightforward code to perform > information gain feature evaluation. Below's a good primer that shows that > Information Gain is a very good option in many cases. If successful, BNS > (introduced in the paper), would be another approach worth looking into as > it actually improves the f score with a smaller feature space. > > http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf > > And here's my first cut: > https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8 > > I don't like that I do two passes to compute the class priors and joint > distributions, so I'll look into using combineByKey as in the NaiveBayes > implementation. Also, this is still untested code, but it gets my ideas > out there and think it'd be best to define a FeatureEval trait or whatnot > that helps with ranking and selecting. > > I also realize the above methods are probably more suitable for MLI than > MLlib, but there doesn't seem to be much activity on the former. > > Second, is there a plan to support sparse vector representations for > NaiveBayes. This will probably be more efficient in, for example, text > classification tasks with lots of features (consider the case where n-grams > with n > 1 are used). > > And on a related note, MLUtils.loadLabeledData doesn't support loading > sparse data. Any plans here to do so? There also doesn't seem to be a > defined file format for MLlib. Has there been any consideration to support > multiple standard formats, rather than defining one: eg, csv, tsv, Weka's > arff, etc? > > Thanks for your time, > Ignacio