Hi everyone,

I am trying to use mllib.clustering.GaussianMixture, but am blocked by the
fact that the API only accepts RDD[Vector].

Broadly speaking I need to run the clustering on an RDD[(key,
Iterable[Vector]), e.g. (fabricated):

val WebsiteUserAgeRDD : RDD[url, userAgeVector]

val ageClusterByUrl =
WebsiteUserAgeRDD.groupby(_.url).mapValues(GaussianMixture.setK(x).run)

This obviously does not work, as the mapValues function is called on
Iterable[Vector] but requires RDD[Vector]
As I see it, parallelizing this Iterable is not possible, would result in
an RDD of RDDs?

Anyone has an idea how to cluster an RDD of (key, Iterable[Vector]) like in
above groupBy result?

Many thanks,
Fabian

Reply via email to