Hi Matt, similarly to what Christoph does, I first derive the cluster id for the elements of my original dataset, and then I use a classification algorithm (cluster ids being the classes here).
For this method to be useful you need a "human-readable" model, tree-based models are generally a good choice (e.g., Decision Tree). However, since those models tend to be verbose, you still need a way to summarize them to facilitate readability (there must be some literature on this topic, although I have no pointers to provide). Hth, Alessandro On 1 March 2018 at 21:59, Christoph Brücke <carabo...@gmail.com> wrote: > Hi Matt, > > I see. You could use the trained model to predict the cluster id for each > training point. Now you should be able to create a dataset with your > original input data and the associated cluster id for each data point in > the input data. Now you can group this dataset by cluster id and aggregate > over the original 5 features. E.g., get the mean for numerical data or the > value that occurs the most for categorical data. > > The exact aggregation is use-case dependent. > > I hope this helps, > Christoph > > Am 01.03.2018 21:40 schrieb "Matt Hicks" <m...@outr.com>: > > Thanks for the response Christoph. > > I'm converting large amounts of data into clustering training and I'm just > having a hard time reasoning about reversing the clusters (in code) back to > the original format to properly understand the dominant values in each > cluster. > > For example, if I have five fields of data and I've trained ten clusters > of data I'd like to output the five fields of data as represented by each > of the ten clusters. > > > > On Thu, Mar 1, 2018 2:36 PM, Christoph Brücke carabo...@gmail.com wrote: > >> Hi matt, >> >> the cluster are defined by there centroids / cluster centers. All the >> points belonging to a certain cluster are closer to its than to the >> centroids of any other cluster. >> >> What I typically do is to convert the cluster centers back to the >> original input format or of that is not possible use the point nearest to >> the cluster center and use this as a representation of the whole cluster. >> >> Can you be a little bit more specific about your use-case? >> >> Best, >> Christoph >> >> Am 01.03.2018 20:53 schrieb "Matt Hicks" <m...@outr.com>: >> >> I'm using K Means clustering for a project right now, and it's working >> very well. However, I'd like to determine from the clusters what >> information distinctions define each cluster so I can explain the "reasons" >> data fits into a specific cluster. >> >> Is there a proper way to do this in Spark ML? >> >> >