I see.  So, basically, kind of like dummy variables like with regressions.  
Thanks, Sean.

On Jul 11, 2014, at 10:11 AM, Sean Owen <so...@cloudera.com> wrote:

> Since you can't define your own distance function, you will need to
> convert these to numeric dimensions. 1-of-n encoding can work OK,
> depending on your use case. So a dimension that takes on 3 categorical
> values, becomes 3 dimensions, of which all are 0 except one that has
> value 1.
> 
> On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan <wen.p...@mac.com> wrote:
>> Hi Folks,
>> 
>> Does any one have experience or recommendations on incorporating categorical 
>> features (attributes) into k-means clustering in Spark?  In other words, I 
>> want to cluster on a set of attributes that include categorical variables.
>> 
>> I know I could probably implement some custom code to parse and calculate my 
>> own similarity function, but I wanted to reach out before I did so.  I’d 
>> also prefer to take advantage of the k-means\parallel initialization feature 
>> of the model in MLlib, so an MLlib-based implementation would be preferred.
>> 
>> Thanks in advance.
>> 
>> Best,
>> 
>> -Wen

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to