In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train method, is there a cleaner way to create the Vectors than this?
data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6))} Second, once I train the model and call predict on my vectorized dataset, what's the best way to relate the cluster assignments back to the original data frame? That is, I started with df1, which has a bunch of domain information in each row and also the doubles I use to cluster. I vectorize the doubles and then train on them. I use the resulting model to predict clusters for the vectors. I'd like to look at the original domain information in light of the clusters to which they are now assigned.