Re: [Dev] [ML] Categorical or Numerical column?

Supun Sethunga Thu, 13 Aug 2015 08:18:15 -0700

When a dataset is large, in general its said to be approximates to a Normal
Distribution. :)  True it Hypothetical, but the point they make is, when
the datasets are large, then properties of a distribution like skewness,
variance and etc. become closer to the properties Normal Distribution in
most cases..


On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <[email protected]> wrote:

> Hi Supun,
>
> Thanks for the reply.
>
> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <[email protected]> wrote:
>
>> Hi Nirmal,
>>
>> IMO don't think we would be able to use skewness in this case. Skewness
>> says how symmetric the distribution is. For example, if we consider a
>> numerical/continuous feature (not categorical) which is Normally
>> Distributed, then the skewness would be 0. Also for a categorical (encoded)
>> feature having a systematic distribution, then again the skewness would be
>> 0.
>>
>
> What's the probability of you see a normal distribution of a real dataset?
> IMO it's very less and also since what we're doing here is a suggestion, do
> you see it as an issue?
>
>
>>
>> We did have this concern at the beginning as well, regarding how we could
>> determine whether a feature is categorical or Continuous. Usually this is
>> strictly dependent on the domain of the dataset (i.e. user have to decide
>> this with the knowledge about the data). That was the idea behind letting
>> user change the data type.. But since we needed a default option, we had to
>> go for the threshold thing, which was the olny option we could come-up
>> with. I did a bit of research on this too, but only to find no other
>> solution :(
>>
>> Thanks,
>> Supun
>>
>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <[email protected]> wrote:
>>
>>> Hi All,
>>>
>>> We have a feature in ML where we suggest a given data column of a
>>> dataset is categorical or numerical. Currently, how we determine this is by
>>> using a threshold value (The maximum number of categories that can have
>>> in a non-string categorical feature. If exceeds, the feature will be
>>> treated as a numerical feature.). But this is not a successful
>>> measurement for most of the datasets.
>>>
>>> Can we use 'skewness' of a distribution as a measurement to determine
>>> this? Can we say, a column is numerical, if the modulus of the skewness of
>>> the distribution is less than a certain threshold (say 0.01) ?
>>>
>>> *References*:
>>>
>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Team Lead - WSO2 Machine Learner
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>
>>
>> --
>> *Supun Sethunga*
>> Software Engineer
>> WSO2, Inc.
>> http://wso2.com/
>> lean | enterprise | middleware
>> Mobile : +94 716546324
>>
>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


-- 
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324

_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [ML] Categorical or Numerical column?

Reply via email to