Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

Sujit Pal Tue, 16 Jun 2015 13:29:21 -0700

Hi Rexx,

In general (ie not Spark specific), its best to convert categorical data to
1-hot encoding rather than integers - that way the algorithm doesn't use
the ordering implicit in the integer representation.


-sujit


On Tue, Jun 16, 2015 at 1:17 PM, Rex X <dnsr...@gmail.com> wrote:

> Is it necessary to convert categorical data into integers?
>
> Any tips would be greatly appreciated!
>
> -Rex
>
> On Sun, Jun 14, 2015 at 10:05 AM, Rex X <dnsr...@gmail.com> wrote:
>
>> For clustering analysis, we need a way to measure distances.
>>
>> When the data contains different levels of measurement -
>> *binary / categorical (nominal), counts (ordinal), and ratio (scale)*
>>
>> To be concrete, for example, working with attributes of
>> *city, zip, satisfaction_level, price*
>>
>> In the meanwhile, the real data usually also contains string attributes,
>> for example, book titles. The distance between two strings can be measured
>> by minimum-edit-distance.
>>
>>
>> In SPSS, it provides Two-Step Cluster, which can handle both ratio scale
>> and ordinal numbers.
>>
>>
>> What is right algorithm to do hierarchical clustering analysis with all
>> these four-kind attributes above with *MLlib*?
>>
>>
>> If we cannot find a right metric to measure the distance, an alternative
>> solution is to do a topological data analysis (e.g. linkage, and etc).
>> Can we do such kind of analysis with *GraphX*?
>>
>>
>> -Rex
>>
>>
>

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

Reply via email to