In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train
method, is there a cleaner way to create the Vectors than this?

data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4),
r.getDouble(5), r.getDouble(6))}


Second, once I train the model and call predict on my vectorized dataset,
what's the best way to relate the cluster assignments back to the original
data frame?


That is, I started with df1, which has a bunch of domain information in
each row and also the doubles I use to cluster.  I vectorize the doubles
and then train on them.  I use the resulting model to predict clusters for
the vectors.  I'd like to look at the original domain information in light
of the clusters to which they are now assigned.

Reply via email to