Yeah, I initially used zip but I was wondering how reliable it is. I mean,
it's the order guaranteed? What if some mode fail, and the data is pulled
out from different nodes?
And even if it can work, I found this implicit semantic quite
uncomfortable, don't you?

My0.2c

Le ven 21 nov. 2014 15:26, Daniel, Ronald (ELS-SDG) <r.dan...@elsevier.com>
a écrit :

> Thanks for the info Andy. A big help.
>
> One thing - I think you can figure out which document is responsible for
> which vector without checking in more code.
> Start with a PairRDD of [doc_id, doc_string] for each document and split
> that into one RDD for each column.
> The values in the doc_string RDD get split and turned into a Seq and fed
> to TFIDF.
> You can take the resulting RDD[Vector]s and zip them with the doc_id RDD.
> Presto!
>
> Best regards,
> Ron
>
>
>
>

Reply via email to