Re: Using TF-IDF from MLlib

Shivaram Venkataraman Mon, 16 Mar 2015 18:47:19 -0700

FWIW the JIRA I was thinking about is
https://issues.apache.org/jira/browse/SPARK-3098


On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I vaguely remember that JIRA and AFAIK Matei's point was that the order is
> not guaranteed *after* a shuffle. If you only use operations like map which
> preserve partitioning, ordering should be guaranteed from what I know.
>
> On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Dang I can't seem to find the JIRA now but I am sure we had a discussion
>> with Matei about this and the conclusion was that RDD order is not
>> guaranteed unless a sort is involved.
>> On Mar 17, 2015 12:14 AM, "Joseph Bradley" <jos...@databricks.com> wrote:
>>
>>> This was brought up again in
>>> https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one
>>> item which was asked about the reliability of zipping RDDs.  Basically, it
>>> should be reliable, and if it is not, then it should be reported as a bug.
>>> This general approach should work (with explicit types to make it clear):
>>>
>>> val data: RDD[LabeledPoint] = ...
>>> val labels: RDD[Double] = data.map(_.label)
>>> val features1: RDD[Vector] = data.map(_.features)
>>> val features2: RDD[Vector] = new
>>> HashingTF(numFeatures=100).transform(features1)
>>> val features3: RDD[Vector] = idfModel.transform(features2)
>>> val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
>>> features) => LabeledPoint(label, features))
>>>
>>> If you run into problems with zipping like this, please report them!
>>>
>>> Thanks,
>>> Joseph
>>>
>>> On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>>
>>>> Hopefully the new pipeline API addresses this problem. We have a code
>>>> example here:
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
>>>>
>>>> -Xiangrui
>>>>
>>>> On Mon, Dec 29, 2014 at 5:22 AM, andy petrella <andy.petre...@gmail.com>
>>>> wrote:
>>>> > Here is what I did for this case :
>>>> https://github.com/andypetrella/tf-idf
>>>> >
>>>> >
>>>> > Le lun 29 déc. 2014 11:31, Sean Owen <so...@cloudera.com> a écrit :
>>>> >
>>>> >> Given (label, terms) you can just transform the values to a TF
>>>> vector,
>>>> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
>>>> >> make a LabeledPoint from (label, vector) pairs. Is that what you're
>>>> >> looking for?
>>>> >>
>>>> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao <y...@ford.com> wrote:
>>>> >> > I found the TF-IDF feature extraction and all the MLlib code that
>>>> work
>>>> >> > with
>>>> >> > pure Vector RDD very difficult to work with due to the lack of
>>>> ability
>>>> >> > to
>>>> >> > associate vector back to the original data. Why can't Spark MLlib
>>>> >> > support
>>>> >> > LabeledPoint?
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > View this message in context:
>>>> >> >
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
>>>> >> > Sent from the Apache Spark User List mailing list archive at
>>>> Nabble.com.
>>>> >> >
>>>> >> >
>>>> ---------------------------------------------------------------------
>>>> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> >> > For additional commands, e-mail: user-h...@spark.apache.org
>>>> >> >
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>>> >>
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>

Re: Using TF-IDF from MLlib

Reply via email to