Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-17 Thread Xiangrui Meng
You can try hashing to control the feature dimension. MLlib's k-means implementation can handle sparse data efficiently if the number of features is not huge. -Xiangrui On Tue, Jun 16, 2015 at 2:44 PM, Rex X dnsr...@gmail.com wrote: Hi Sujit, That's a good point. But 1-hot encoding will make

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Rex X
Hi Sujit, That's a good point. But 1-hot encoding will make our data changing from Terabytes to Petabytes, because we have tens of categorical attributes, and some of them contain thousands of categorical values. Is there any way to make a good balance of data size and right representation of

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Sujit Pal
Hi Rexx, In general (ie not Spark specific), its best to convert categorical data to 1-hot encoding rather than integers - that way the algorithm doesn't use the ordering implicit in the integer representation. -sujit On Tue, Jun 16, 2015 at 1:17 PM, Rex X dnsr...@gmail.com wrote: Is it

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Rex X
Is it necessary to convert categorical data into integers? Any tips would be greatly appreciated! -Rex On Sun, Jun 14, 2015 at 10:05 AM, Rex X dnsr...@gmail.com wrote: For clustering analysis, we need a way to measure distances. When the data contains different levels of measurement -

What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-14 Thread Rex X
For clustering analysis, we need a way to measure distances. When the data contains different levels of measurement - *binary / categorical (nominal), counts (ordinal), and ratio (scale)* To be concrete, for example, working with attributes of *city, zip, satisfaction_level, price* In the