Thanks for the info Andy. A big help. One thing - I think you can figure out which document is responsible for which vector without checking in more code. Start with a PairRDD of [doc_id, doc_string] for each document and split that into one RDD for each column. The values in the doc_string RDD get split and turned into a Seq and fed to TFIDF. You can take the resulting RDD[Vector]s and zip them with the doc_id RDD. Presto!
Best regards, Ron