Re: Using TF-IDF from MLlib

Joseph Bradley Mon, 16 Mar 2015 17:17:35 -0700

This was brought up again in
https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one item
which was asked about the reliability of zipping RDDs.  Basically, it
should be reliable, and if it is not, then it should be reported as a bug.
This general approach should work (with explicit types to make it clear):


val data: RDD[LabeledPoint] = ...
val labels: RDD[Double] = data.map(_.label)
val features1: RDD[Vector] = data.map(_.features)
val features2: RDD[Vector] = new
HashingTF(numFeatures=100).transform(features1)
val features3: RDD[Vector] = idfModel.transform(features2)
val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
features) => LabeledPoint(label, features))

If you run into problems with zipping like this, please report them!

Thanks,
Joseph

On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng <men...@gmail.com> wrote:

> Hopefully the new pipeline API addresses this problem. We have a code
> example here:
>
>
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
>
> -Xiangrui
>
> On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <andy.petre...@gmail.com>
> wrote:
> > Here is what I did for this case :
> https://github.com/andypetrella/tf-idf
> >
> >
> > Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit :
> >
> >> Given (label, terms) you can just transform the values to a TF vector,
> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
> >> make a LabeledPoint from (label, vector) pairs. Is that what you're
> >> looking for?
> >>
> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao <y...@ford.com> wrote:
> >> > I found the TF-IDF feature extraction and all the MLlib code that work
> >> > with
> >> > pure Vector RDD very difficult to work with due to the lack of ability
> >> > to
> >> > associate vector back to the original data. Why can't Spark MLlib
> >> > support
> >> > LabeledPoint?
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> >
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
> >> > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> > For additional commands, e-mail: user-h...@spark.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Using TF-IDF from MLlib

Reply via email to