[ 
https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364266#comment-14364266
 ] 

Joseph K. Bradley commented on SPARK-6340:
------------------------------------------

You should be able to reliably zip the RDDs back together.  I just send an 
update to that post, which I'll copy here:

{quote}
This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340  
so I'll answer one item which was asked about the reliability of zipping RDDs.  
Basically, it should be reliable, and if it is not, then it should be reported 
as a bug.  This general approach should work (with explicit types to make it 
clear):

{code}
val data: RDD[LabeledPoint] = ...
val labels: RDD[Double] = data.map(_.label)
val features1: RDD[Vector] = data.map(_.features)
val features2: RDD[Vector] = new HashingTF(numFeatures=100).transform(features1)
val features3: RDD[Vector] = idfModel.transform(features2)
val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, features) 
=> LabeledPoint(label, features))
{code}
{quote}

Do report it if you run into problems with this!  Thanks.

> mllib.IDF for LabelPoints
> -------------------------
>
>                 Key: SPARK-6340
>                 URL: https://issues.apache.org/jira/browse/SPARK-6340
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>         Environment: python 2.7.8
> pyspark
> OS: Linux Mint 17 Qiana (Cinnamon 64-bit)
>            Reporter: Kian Ho
>            Priority: Minor
>              Labels: feature
>
> as per: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528
> Having the IDF.fit accept LabelPoints would be useful since, correct me if 
> i'm wrong, there currently isn't a way of keeping track of which labels 
> belong to which documents if one needs to apply a conventional tf-idf 
> transformation on labelled text data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to