When a dataset is large, in general its said to be approximates to a Normal Distribution. :) True it Hypothetical, but the point they make is, when the datasets are large, then properties of a distribution like skewness, variance and etc. become closer to the properties Normal Distribution in most cases..
On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <nir...@wso2.com> wrote: > Hi Supun, > > Thanks for the reply. > > On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <sup...@wso2.com> wrote: > >> Hi Nirmal, >> >> IMO don't think we would be able to use skewness in this case. Skewness >> says how symmetric the distribution is. For example, if we consider a >> numerical/continuous feature (not categorical) which is Normally >> Distributed, then the skewness would be 0. Also for a categorical (encoded) >> feature having a systematic distribution, then again the skewness would be >> 0. >> > > What's the probability of you see a normal distribution of a real dataset? > IMO it's very less and also since what we're doing here is a suggestion, do > you see it as an issue? > > >> >> We did have this concern at the beginning as well, regarding how we could >> determine whether a feature is categorical or Continuous. Usually this is >> strictly dependent on the domain of the dataset (i.e. user have to decide >> this with the knowledge about the data). That was the idea behind letting >> user change the data type.. But since we needed a default option, we had to >> go for the threshold thing, which was the olny option we could come-up >> with. I did a bit of research on this too, but only to find no other >> solution :( >> >> Thanks, >> Supun >> >> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <nir...@wso2.com> wrote: >> >>> Hi All, >>> >>> We have a feature in ML where we suggest a given data column of a >>> dataset is categorical or numerical. Currently, how we determine this is by >>> using a threshold value (The maximum number of categories that can have >>> in a non-string categorical feature. If exceeds, the feature will be >>> treated as a numerical feature.). But this is not a successful >>> measurement for most of the datasets. >>> >>> Can we use 'skewness' of a distribution as a measurement to determine >>> this? Can we say, a column is numerical, if the modulus of the skewness of >>> the distribution is less than a certain threshold (say 0.01) ? >>> >>> *References*: >>> >>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm >>> >>> -- >>> >>> Thanks & regards, >>> Nirmal >>> >>> Team Lead - WSO2 Machine Learner >>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>> Mobile: +94715779733 >>> Blog: http://nirmalfdo.blogspot.com/ >>> >>> >>> >> >> >> -- >> *Supun Sethunga* >> Software Engineer >> WSO2, Inc. >> http://wso2.com/ >> lean | enterprise | middleware >> Mobile : +94 716546324 >> > > > > -- > > Thanks & regards, > Nirmal > > Team Lead - WSO2 Machine Learner > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > > -- *Supun Sethunga* Software Engineer WSO2, Inc. http://wso2.com/ lean | enterprise | middleware Mobile : +94 716546324
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev