I mean current approach and skewness? On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <[email protected]> wrote:
> Can we use a combination of both? > > On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <[email protected]> wrote: > >> When a dataset is large, in general its said to be approximates to a >> Normal Distribution. :) True it Hypothetical, but the point they make is, >> when the datasets are large, then properties of a distribution like >> skewness, variance and etc. become closer to the properties Normal >> Distribution in most cases.. >> >> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <[email protected]> >> wrote: >> >>> Hi Supun, >>> >>> Thanks for the reply. >>> >>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <[email protected]> wrote: >>> >>>> Hi Nirmal, >>>> >>>> IMO don't think we would be able to use skewness in this case. Skewness >>>> says how symmetric the distribution is. For example, if we consider a >>>> numerical/continuous feature (not categorical) which is Normally >>>> Distributed, then the skewness would be 0. Also for a categorical (encoded) >>>> feature having a systematic distribution, then again the skewness would be >>>> 0. >>>> >>> >>> What's the probability of you see a normal distribution of a real >>> dataset? IMO it's very less and also since what we're doing here is a >>> suggestion, do you see it as an issue? >>> >>> >>>> >>>> We did have this concern at the beginning as well, regarding how we >>>> could determine whether a feature is categorical or Continuous. Usually >>>> this is strictly dependent on the domain of the dataset (i.e. user have to >>>> decide this with the knowledge about the data). That was the idea behind >>>> letting user change the data type.. But since we needed a default option, >>>> we had to go for the threshold thing, which was the olny option we could >>>> come-up with. I did a bit of research on this too, but only to find no >>>> other solution :( >>>> >>>> Thanks, >>>> Supun >>>> >>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <[email protected]> >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> We have a feature in ML where we suggest a given data column of a >>>>> dataset is categorical or numerical. Currently, how we determine this is >>>>> by >>>>> using a threshold value (The maximum number of categories that can >>>>> have in a non-string categorical feature. If exceeds, the feature >>>>> will be treated as a numerical feature.). But this is not a >>>>> successful measurement for most of the datasets. >>>>> >>>>> Can we use 'skewness' of a distribution as a measurement to determine >>>>> this? Can we say, a column is numerical, if the modulus of the skewness of >>>>> the distribution is less than a certain threshold (say 0.01) ? >>>>> >>>>> *References*: >>>>> >>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm >>>>> >>>>> -- >>>>> >>>>> Thanks & regards, >>>>> Nirmal >>>>> >>>>> Team Lead - WSO2 Machine Learner >>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>> Mobile: +94715779733 >>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Supun Sethunga* >>>> Software Engineer >>>> WSO2, Inc. >>>> http://wso2.com/ >>>> lean | enterprise | middleware >>>> Mobile : +94 716546324 >>>> >>> >>> >>> >>> -- >>> >>> Thanks & regards, >>> Nirmal >>> >>> Team Lead - WSO2 Machine Learner >>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>> Mobile: +94715779733 >>> Blog: http://nirmalfdo.blogspot.com/ >>> >>> >>> >> >> >> -- >> *Supun Sethunga* >> Software Engineer >> WSO2, Inc. >> http://wso2.com/ >> lean | enterprise | middleware >> Mobile : +94 716546324 >> > > > > -- > ============================ > Blog: http://srinathsview.blogspot.com twitter:@srinath_perera > Site: http://people.apache.org/~hemapani/ > Photos: http://www.flickr.com/photos/hemapani/ > Phone: 0772360902 > -- ============================ Blog: http://srinathsview.blogspot.com twitter:@srinath_perera Site: http://people.apache.org/~hemapani/ Photos: http://www.flickr.com/photos/hemapani/ Phone: 0772360902
_______________________________________________ Dev mailing list [email protected] http://wso2.org/cgi-bin/mailman/listinfo/dev
