Hi everyone, I am trying to use mllib.clustering.GaussianMixture, but am blocked by the fact that the API only accepts RDD[Vector].
Broadly speaking I need to run the clustering on an RDD[(key, Iterable[Vector]), e.g. (fabricated): val WebsiteUserAgeRDD : RDD[url, userAgeVector] val ageClusterByUrl = WebsiteUserAgeRDD.groupby(_.url).mapValues(GaussianMixture.setK(x).run) This obviously does not work, as the mapValues function is called on Iterable[Vector] but requires RDD[Vector] As I see it, parallelizing this Iterable is not possible, would result in an RDD of RDDs? Anyone has an idea how to cluster an RDD of (key, Iterable[Vector]) like in above groupBy result? Many thanks, Fabian
