Re: Using TF-IDF from MLlib

Sean Owen Mon, 16 Mar 2015 18:06:57 -0700

Dang I can't seem to find the JIRA now but I am sure we had a discussion
with Matei about this and the conclusion was that RDD order is not
guaranteed unless a sort is involved.
On Mar 17, 2015 12:14 AM, "Joseph Bradley" <jos...@databricks.com> wrote:


> This was brought up again in
> https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one item
> which was asked about the reliability of zipping RDDs.  Basically, it
> should be reliable, and if it is not, then it should be reported as a bug.
> This general approach should work (with explicit types to make it clear):
>
> val data: RDD[LabeledPoint] = ...
> val labels: RDD[Double] = data.map(_.label)
> val features1: RDD[Vector] = data.map(_.features)
> val features2: RDD[Vector] = new
> HashingTF(numFeatures=100).transform(features1)
> val features3: RDD[Vector] = idfModel.transform(features2)
> val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
> features) => LabeledPoint(label, features))
>
> If you run into problems with zipping like this, please report them!
>
> Thanks,
> Joseph
>
> On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng <men...@gmail.com> wrote:
>
>> Hopefully the new pipeline API addresses this problem. We have a code
>> example here:
>>
>>
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
>>
>> -Xiangrui
>>
>> On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <andy.petre...@gmail.com>
>> wrote:
>> > Here is what I did for this case :
>> https://github.com/andypetrella/tf-idf
>> >
>> >
>> > Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit :
>> >
>> >> Given (label, terms) you can just transform the values to a TF vector,
>> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
>> >> make a LabeledPoint from (label, vector) pairs. Is that what you're
>> >> looking for?
>> >>
>> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao <y...@ford.com> wrote:
>> >> > I found the TF-IDF feature extraction and all the MLlib code that
>> work
>> >> > with
>> >> > pure Vector RDD very difficult to work with due to the lack of
>> ability
>> >> > to
>> >> > associate vector back to the original data. Why can't Spark MLlib
>> >> > support
>> >> > LabeledPoint?
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> >
>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
>> >> > Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> > For additional commands, e-mail: user-h...@spark.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Using TF-IDF from MLlib

Reply via email to