Yeah, I initially used zip but I was wondering how reliable it is. I mean, it's the order guaranteed? What if some mode fail, and the data is pulled out from different nodes? And even if it can work, I found this implicit semantic quite uncomfortable, don't you?
My0.2c Le ven 21 nov. 2014 15:26, Daniel, Ronald (ELS-SDG) <r.dan...@elsevier.com> a écrit : > Thanks for the info Andy. A big help. > > One thing - I think you can figure out which document is responsible for > which vector without checking in more code. > Start with a PairRDD of [doc_id, doc_string] for each document and split > that into one RDD for each column. > The values in the doc_string RDD get split and turned into a Seq and fed > to TFIDF. > You can take the resulting RDD[Vector]s and zip them with the doc_id RDD. > Presto! > > Best regards, > Ron > > > >