Re: [Dev] [ML] Categorical or Numerical column?

2015-08-14 Thread Supun Sethunga
> > Combined solution; > * if a feature contains strings -> categorical > * Frequency of distinct values - if they repeat enough (80% default) and > if it doesn't have decimal values, then it is a categorical feature. +1 On Fri, Aug 14, 2015 at 12:13 PM, Nirmal Fernando wrote: > Combined solut

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-14 Thread Nirmal Fernando
Combined solution; * if a feature contains strings -> categorical * Frequency of distinct values - if they repeat enough (80% default) and if it doesn't have decimal values, then it is a categorical feature. On Fri, Aug 14, 2015 at 7:22 PM, Supun Sethunga wrote: > Hi all, > > +1 for a hybrid so

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-14 Thread Supun Sethunga
Hi all, +1 for a hybrid solution. But still a -1 for using skewness even in the hybrid solution :D One good example why we shouldn't use skenwness is the income distribution graph in [1]. There, regardless of whether Im using the raw data (then its a continuous feature) or whether Im breaking the

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Nirmal Fernando
Thanks Thushan. Good suggestion on the frequency. *solutions* 1. Categorical threshold: if # of distinct values are less than X, it is a categorical feature. 2. Make all features with only integers (no decimals) categorical. 3. Skewness: if skewness of a distribution of a feature is less than X,

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Thushan Ganegedara
Moreover, I think a hybrid approach as follows might work well. 1. Select a sample 2. Filter columns by the data type and find potential categorical variables (integer / string) 3. Filter further by checking if same values are repeated multiple times in the dataset. On Fri, Aug 14, 2015 at 2:53

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Nirmal Fernando
Thanks for all the input. So let me summarise; *the problem* * We need to determine whether a feature is a categorical one or not, to draw certain graphs to explore a dataset, before a user starts to build analyses (before user input). * We can't get a 100% accuracy, hence it is of course a sugg

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Maheshakya Wijewardena
Another approach to distinguish between categorical and numerical features can be elaborated as follows: First, we take out the unique values from the column and sort them. If it's a categorical feature, then the gaps between the elements of this sorted list should be equal. In a numerical feature

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Srinath Perera
I mean current approach and skewness? On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera wrote: > Can we use a combination of both? > > On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga wrote: > >> When a dataset is large, in general its said to be approximates to a >> Normal Distribution. :) True

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Srinath Perera
Can we use a combination of both? On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga wrote: > When a dataset is large, in general its said to be approximates to a > Normal Distribution. :) True it Hypothetical, but the point they make is, > when the datasets are large, then properties of a distrib

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Thushan Ganegedara
Hi all, To add to what Supun said, yes, the normal (or gaussian) distribution is considered to be a common naturally occuring phenomena. There are many ML techniques that assumes gauss distribution and applies really well to the real world problems. For example, Gaussian processes assumes Gaussian

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Seshika Fernando
In addition, there are lots of datasets in economics, stocks, physics that are normally or approximate normally distributed, which will be used for predictive modelling On 13 Aug 2015 20:46, "Supun Sethunga" wrote: > When a dataset is large, in general its said to be approximates to a > Normal Di

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Supun Sethunga
When a dataset is large, in general its said to be approximates to a Normal Distribution. :) True it Hypothetical, but the point they make is, when the datasets are large, then properties of a distribution like skewness, variance and etc. become closer to the properties Normal Distribution in most

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Nirmal Fernando
Hi Supun, Thanks for the reply. On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga wrote: > Hi Nirmal, > > IMO don't think we would be able to use skewness in this case. Skewness > says how symmetric the distribution is. For example, if we consider a > numerical/continuous feature (not categorical

Re: [Dev] [ML] Categorical or Numerical column?

2015-08-13 Thread Supun Sethunga
Hi Nirmal, IMO don't think we would be able to use skewness in this case. Skewness says how symmetric the distribution is. For example, if we consider a numerical/continuous feature (not categorical) which is Normally Distributed, then the skewness would be 0. Also for a categorical (encoded) feat

[Dev] [ML] Categorical or Numerical column?

2015-08-12 Thread Nirmal Fernando
Hi All, We have a feature in ML where we suggest a given data column of a dataset is categorical or numerical. Currently, how we determine this is by using a threshold value (The maximum number of categories that can have in a non-string categorical feature. If exceeds, the feature will be treated