Re: [Dev] [ML] Categorical or Numerical column?

Supun Sethunga Fri, 14 Aug 2015 06:54:15 -0700

Hi all,

+1 for a hybrid solution. But still a -1 for using skewness even in the
hybrid solution :D


One good example why we shouldn't use skenwness is the income distribution
graph in [1]. There, regardless of whether Im using the raw data (then its
a continuous feature) or whether Im breaking them in to intervals and
categorized the income in to several levels, I would get the same shape for
the distribution. i.e skewness would be significant.

So the point  Im trying to make is, categorical features as well as a
continuous features can be skewed/symmetric, and we cant really distinguish.

[1]
https://cdn2.vox-cdn.com/uploads/chorus_asset/file/2930990/Distribution_of_Annual_Household_Income_in_the_United_States_2012.0.png


On Fri, Aug 14, 2015 at 1:03 AM, Nirmal Fernando <nir...@wso2.com> wrote:

> Thanks Thushan. Good suggestion on the frequency.
>
> *solutions*
>
> 1. Categorical threshold: if # of distinct values are less than X, it is a
> categorical feature.
> 2. Make all features with only integers (no decimals) categorical.
> 3. Skewness: if skewness of a distribution of a feature is less than X, it
> is a categorical feature.
> 4. Gaps between consecutive distinct values
> 5. Frequency of distinct values - if they repeat enough, then it is a
> categorical feature.
> 6. Combined solution
>
> So, I guess as suggested by many of you, we need to build a combined
> solution.
>
> On Fri, Aug 14, 2015 at 10:29 AM, Thushan Ganegedara <thu...@gmail.com>
> wrote:
>
>> Moreover, I think a hybrid approach as follows might work well.
>>
>> 1. Select a sample
>>
>> 2. Filter columns by the data type and find potential categorical
>> variables (integer / string)
>>
>> 3. Filter further by checking if same values are repeated multiple times
>> in the dataset.
>>
>> On Fri, Aug 14, 2015 at 2:53 PM, Nirmal Fernando <nir...@wso2.com> wrote:
>>
>>> Thanks for all the input.
>>>
>>> So let me summarise;
>>>
>>> *the problem*
>>>
>>> * We need to determine whether a feature is a categorical one or not, to
>>> draw certain graphs to explore a dataset, before a user starts to build
>>> analyses (before user input).
>>> * We can't get a 100% accuracy, hence it is of course a suggestion that
>>> we do.
>>> * Question is, what would be the most accurate method.
>>>
>>> *solutions*
>>>
>>> 1. Categorical threshold: if # of distinct values are less than X, it is
>>> a categorical feature.
>>> 2. Make all features with only integers (no decimals) categorical.
>>> 3. Skewness: if skewness of a distribution of a feature is less than X,
>>> it is a categorical feature.
>>> 4. Gaps between consecutive distinct values
>>> 5. Combined solution
>>>
>>> On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena <
>>> mahesha...@wso2.com> wrote:
>>>
>>>> Another approach to distinguish between categorical and numerical
>>>> features can be elaborated as follows:
>>>>
>>>> First, we take out the unique values from the column and sort them. If
>>>> it's a categorical feature, then the gaps between the elements of this
>>>> sorted list should be equal. In a numerical feature, this is extremely
>>>> unlikely to happen. This behavior of valid in most scenarios, but there are
>>>> a few exceptions as well. eg: when a numerical ID is used as the
>>>> categorical label - 19933, 19913, 18832, ...
>>>>
>>>> This is a very simple hack that can be easily implemented, but not a
>>>> standard technique.
>>>>
>>>> WDYT?
>>>>
>>>> On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <srin...@wso2.com>
>>>> wrote:
>>>>
>>>>> I mean current approach and skewness?
>>>>>
>>>>> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <srin...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>> Can we use a combination of both?
>>>>>>
>>>>>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <sup...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>> When a dataset is large, in general its said to be approximates to a
>>>>>>> Normal Distribution. :)  True it Hypothetical, but the point they make 
>>>>>>> is,
>>>>>>> when the datasets are large, then properties of a distribution like
>>>>>>> skewness, variance and etc. become closer to the properties Normal
>>>>>>> Distribution in most cases..
>>>>>>>
>>>>>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <nir...@wso2.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Supun,
>>>>>>>>
>>>>>>>> Thanks for the reply.
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <sup...@wso2.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Nirmal,
>>>>>>>>>
>>>>>>>>> IMO don't think we would be able to use skewness in this case.
>>>>>>>>> Skewness says how symmetric the distribution is. For example, if we
>>>>>>>>> consider a numerical/continuous feature (not categorical) which is 
>>>>>>>>> Normally
>>>>>>>>> Distributed, then the skewness would be 0. Also for a categorical 
>>>>>>>>> (encoded)
>>>>>>>>> feature having a systematic distribution, then again the skewness 
>>>>>>>>> would be
>>>>>>>>> 0.
>>>>>>>>>
>>>>>>>>
>>>>>>>> What's the probability of you see a normal distribution of a real
>>>>>>>> dataset? IMO it's very less and also since what we're doing here is a
>>>>>>>> suggestion, do you see it as an issue?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> We did have this concern at the beginning as well, regarding how
>>>>>>>>> we could determine whether a feature is categorical or Continuous. 
>>>>>>>>> Usually
>>>>>>>>> this is strictly dependent on the domain of the dataset (i.e. user 
>>>>>>>>> have to
>>>>>>>>> decide this with the knowledge about the data). That was the idea 
>>>>>>>>> behind
>>>>>>>>> letting user change the data type.. But since we needed a default 
>>>>>>>>> option,
>>>>>>>>> we had to go for the threshold thing, which was the olny option we 
>>>>>>>>> could
>>>>>>>>> come-up with. I did a bit of research on this too, but only to find no
>>>>>>>>> other solution :(
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Supun
>>>>>>>>>
>>>>>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <nir...@wso2.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> We have a feature in ML where we suggest a given data column of a
>>>>>>>>>> dataset is categorical or numerical. Currently, how we determine 
>>>>>>>>>> this is by
>>>>>>>>>> using a threshold value (The maximum number of categories that
>>>>>>>>>> can have in a non-string categorical feature. If exceeds, the
>>>>>>>>>> feature will be treated as a numerical feature.). But this is
>>>>>>>>>> not a successful measurement for most of the datasets.
>>>>>>>>>>
>>>>>>>>>> Can we use 'skewness' of a distribution as a measurement to
>>>>>>>>>> determine this? Can we say, a column is numerical, if the modulus of 
>>>>>>>>>> the
>>>>>>>>>> skewness of the distribution is less than a certain threshold (say 
>>>>>>>>>> 0.01) ?
>>>>>>>>>>
>>>>>>>>>> *References*:
>>>>>>>>>>
>>>>>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Thanks & regards,
>>>>>>>>>> Nirmal
>>>>>>>>>>
>>>>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>>>>> Mobile: +94715779733
>>>>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Supun Sethunga*
>>>>>>>>> Software Engineer
>>>>>>>>> WSO2, Inc.
>>>>>>>>> http://wso2.com/
>>>>>>>>> lean | enterprise | middleware
>>>>>>>>> Mobile : +94 716546324
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Thanks & regards,
>>>>>>>> Nirmal
>>>>>>>>
>>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>>> Mobile: +94715779733
>>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Supun Sethunga*
>>>>>>> Software Engineer
>>>>>>> WSO2, Inc.
>>>>>>> http://wso2.com/
>>>>>>> lean | enterprise | middleware
>>>>>>> Mobile : +94 716546324
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ============================
>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>>> Site: http://people.apache.org/~hemapani/
>>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>>> Phone: 0772360902
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ============================
>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>> Site: http://people.apache.org/~hemapani/
>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>> Phone: 0772360902
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Pruthuvi Maheshakya Wijewardena
>>>> Software Engineer
>>>> WSO2 : http://wso2.com/
>>>> Email: mahesha...@wso2.com
>>>> Mobile: +94711228855
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Team Lead - WSO2 Machine Learner
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list
>>> Dev@wso2.org
>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Thushan Ganegedara
>> School of IT
>> University of Sydney, Australia
>>
>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>
> _______________________________________________
> Dev mailing list
> Dev@wso2.org
> http://wso2.org/cgi-bin/mailman/listinfo/dev
>
>


-- 
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324

_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [ML] Categorical or Numerical column?

Reply via email to