Hi Ignacio,

Please create a JIRA and send a PR for the information gain
computation, so it is easy to track the progress.

The sparse vector support for NaiveBayes is already implemented in
branch-1.0 and master. You only need to provide an RDD of sparse
vectors (created from Vectors.sparse).

MLUtils.loadLibSVMData reads sparse features in LIBSVM format.

Best,
Xiangrui

On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas
<ignacio.zendejas...@gmail.com> wrote:
> Hi, again -
>
> As part of the next step, I'd like to make a more substantive contribution
> and propose some initial work on feature selection, primarily as it relates
> to text classification.
>
> Specifically, I'd like to contribute very straightforward code to perform
> information gain feature evaluation. Below's a good primer that shows that
> Information Gain is a very good option in many cases. If successful, BNS
> (introduced in the paper), would be another approach worth looking into as
> it actually improves the f score with a smaller feature space.
>
> http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf
>
> And here's my first cut:
> https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8
>
> I don't like that I do two passes to compute the class priors and joint
> distributions, so I'll look into using combineByKey as in the NaiveBayes
> implementation.  Also, this is still untested code, but it gets my ideas
> out there and think it'd be best to define a FeatureEval trait or whatnot
> that helps with ranking and selecting.
>
> I also realize the above methods are probably more suitable for MLI than
> MLlib, but there doesn't seem to be much activity on the former.
>
> Second, is there a plan to support sparse vector representations for
> NaiveBayes. This will probably be more efficient in, for example, text
> classification tasks with lots of features (consider the case where n-grams
> with n > 1 are used).
>
> And on a related note, MLUtils.loadLabeledData doesn't support loading
> sparse data. Any plans here to do so? There also doesn't seem to be a
> defined file format for MLlib. Has there been any consideration to support
> multiple standard formats, rather than defining one: eg, csv, tsv, Weka's
> arff, etc?
>
> Thanks for your time,
> Ignacio

Reply via email to