Hi Sujit,

That's a good point. But 1-hot encoding will make our data changing from
Terabytes to Petabytes, because we have tens of categorical attributes, and
some of them contain thousands of categorical values.

Is there any way to make a good balance of data size and right
representation of categories?


-Rex


On Tue, Jun 16, 2015 at 1:27 PM, Sujit Pal <sujitatgt...@gmail.com> wrote:

> Hi Rexx,
>
> In general (ie not Spark specific), its best to convert categorical data
> to 1-hot encoding rather than integers - that way the algorithm doesn't use
> the ordering implicit in the integer representation.
>
> -sujit
>
>
> On Tue, Jun 16, 2015 at 1:17 PM, Rex X <dnsr...@gmail.com> wrote:
>
>> Is it necessary to convert categorical data into integers?
>>
>> Any tips would be greatly appreciated!
>>
>> -Rex
>>
>> On Sun, Jun 14, 2015 at 10:05 AM, Rex X <dnsr...@gmail.com> wrote:
>>
>>> For clustering analysis, we need a way to measure distances.
>>>
>>> When the data contains different levels of measurement -
>>> *binary / categorical (nominal), counts (ordinal), and ratio (scale)*
>>>
>>> To be concrete, for example, working with attributes of
>>> *city, zip, satisfaction_level, price*
>>>
>>> In the meanwhile, the real data usually also contains string attributes,
>>> for example, book titles. The distance between two strings can be measured
>>> by minimum-edit-distance.
>>>
>>>
>>> In SPSS, it provides Two-Step Cluster, which can handle both ratio scale
>>> and ordinal numbers.
>>>
>>>
>>> What is right algorithm to do hierarchical clustering analysis with all
>>> these four-kind attributes above with *MLlib*?
>>>
>>>
>>> If we cannot find a right metric to measure the distance, an alternative
>>> solution is to do a topological data analysis (e.g. linkage, and etc).
>>> Can we do such kind of analysis with *GraphX*?
>>>
>>>
>>> -Rex
>>>
>>>
>>
>

Reply via email to