I see. So, basically, kind of like dummy variables like with regressions. Thanks, Sean.
On Jul 11, 2014, at 10:11 AM, Sean Owen <so...@cloudera.com> wrote: > Since you can't define your own distance function, you will need to > convert these to numeric dimensions. 1-of-n encoding can work OK, > depending on your use case. So a dimension that takes on 3 categorical > values, becomes 3 dimensions, of which all are 0 except one that has > value 1. > > On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan <wen.p...@mac.com> wrote: >> Hi Folks, >> >> Does any one have experience or recommendations on incorporating categorical >> features (attributes) into k-means clustering in Spark? In other words, I >> want to cluster on a set of attributes that include categorical variables. >> >> I know I could probably implement some custom code to parse and calculate my >> own similarity function, but I wanted to reach out before I did so. I’d >> also prefer to take advantage of the k-means\parallel initialization feature >> of the model in MLlib, so an MLlib-based implementation would be preferred. >> >> Thanks in advance. >> >> Best, >> >> -Wen
signature.asc
Description: Message signed with OpenPGP using GPGMail