SPARK-7879 <https://issues.apache.org/jira/browse/SPARK-7879> seems to
address your use case (running KMeans on a dataframe and having the results
added as an additional column)

On Wed, Jul 1, 2015 at 5:53 PM, Eric Friedman <eric.d.fried...@gmail.com>
wrote:

> In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train
> method, is there a cleaner way to create the Vectors than this?
>
> data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3),
> r.getDouble(4), r.getDouble(5), r.getDouble(6))}
>
>
> Second, once I train the model and call predict on my vectorized dataset,
> what's the best way to relate the cluster assignments back to the original
> data frame?
>
>
> That is, I started with df1, which has a bunch of domain information in
> each row and also the doubles I use to cluster.  I vectorize the doubles
> and then train on them.  I use the resulting model to predict clusters for
> the vectors.  I'd like to look at the original domain information in light
> of the clusters to which they are now assigned.
>
>
>

Reply via email to