RE: Using TF-IDF from MLlib

Daniel, Ronald (ELS-SDG) Fri, 21 Nov 2014 15:26:36 -0800

Thanks for the info Andy. A big help.

One thing - I think you can figure out which document is responsible for which 
vector without checking in more code.
Start with a PairRDD of [doc_id, doc_string] for each document and split that 
into one RDD for each column.
The values in the doc_string RDD get split and turned into a Seq and fed to 
TFIDF.
You can take the resulting RDD[Vector]s and zip them with the doc_id RDD. 
Presto!


Best regards,
Ron

RE: Using TF-IDF from MLlib

Reply via email to