Thanks for the info Andy. A big help.

One thing - I think you can figure out which document is responsible for which 
vector without checking in more code.
Start with a PairRDD of [doc_id, doc_string] for each document and split that 
into one RDD for each column.
The values in the doc_string RDD get split and turned into a Seq and fed to 
TFIDF.
You can take the resulting RDD[Vector]s and zip them with the doc_id RDD. 
Presto!

Best regards,
Ron



Reply via email to