>
> Combined solution;
> * if a feature contains strings -> categorical
> * Frequency of distinct values - if they repeat enough (80% default) and
> if it doesn't have decimal values, then it is a categorical feature.
+1
On Fri, Aug 14, 2015 at 12:13 PM, Nirmal Fernando wrote:
> Combined solut
Combined solution;
* if a feature contains strings -> categorical
* Frequency of distinct values - if they repeat enough (80% default) and if
it doesn't have decimal values, then it is a categorical feature.
On Fri, Aug 14, 2015 at 7:22 PM, Supun Sethunga wrote:
> Hi all,
>
> +1 for a hybrid so
Hi all,
+1 for a hybrid solution. But still a -1 for using skewness even in the
hybrid solution :D
One good example why we shouldn't use skenwness is the income distribution
graph in [1]. There, regardless of whether Im using the raw data (then its
a continuous feature) or whether Im breaking the
Thanks Thushan. Good suggestion on the frequency.
*solutions*
1. Categorical threshold: if # of distinct values are less than X, it is a
categorical feature.
2. Make all features with only integers (no decimals) categorical.
3. Skewness: if skewness of a distribution of a feature is less than X,
Moreover, I think a hybrid approach as follows might work well.
1. Select a sample
2. Filter columns by the data type and find potential categorical variables
(integer / string)
3. Filter further by checking if same values are repeated multiple times in
the dataset.
On Fri, Aug 14, 2015 at 2:53
Thanks for all the input.
So let me summarise;
*the problem*
* We need to determine whether a feature is a categorical one or not, to
draw certain graphs to explore a dataset, before a user starts to build
analyses (before user input).
* We can't get a 100% accuracy, hence it is of course a sugg
Another approach to distinguish between categorical and numerical features
can be elaborated as follows:
First, we take out the unique values from the column and sort them. If it's
a categorical feature, then the gaps between the elements of this sorted
list should be equal. In a numerical feature
I mean current approach and skewness?
On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera wrote:
> Can we use a combination of both?
>
> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga wrote:
>
>> When a dataset is large, in general its said to be approximates to a
>> Normal Distribution. :) True
Can we use a combination of both?
On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga wrote:
> When a dataset is large, in general its said to be approximates to a
> Normal Distribution. :) True it Hypothetical, but the point they make is,
> when the datasets are large, then properties of a distrib
Hi all,
To add to what Supun said, yes, the normal (or gaussian) distribution is
considered to be a common naturally occuring phenomena. There are many ML
techniques that assumes gauss distribution and applies really well to the
real world problems. For example, Gaussian processes assumes Gaussian
In addition, there are lots of datasets in economics, stocks, physics that
are normally or approximate normally distributed, which will be used for
predictive modelling
On 13 Aug 2015 20:46, "Supun Sethunga" wrote:
> When a dataset is large, in general its said to be approximates to a
> Normal Di
When a dataset is large, in general its said to be approximates to a Normal
Distribution. :) True it Hypothetical, but the point they make is, when
the datasets are large, then properties of a distribution like skewness,
variance and etc. become closer to the properties Normal Distribution in
most
Hi Supun,
Thanks for the reply.
On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga wrote:
> Hi Nirmal,
>
> IMO don't think we would be able to use skewness in this case. Skewness
> says how symmetric the distribution is. For example, if we consider a
> numerical/continuous feature (not categorical
Hi Nirmal,
IMO don't think we would be able to use skewness in this case. Skewness
says how symmetric the distribution is. For example, if we consider a
numerical/continuous feature (not categorical) which is Normally
Distributed, then the skewness would be 0. Also for a categorical (encoded)
feat
Hi All,
We have a feature in ML where we suggest a given data column of a dataset
is categorical or numerical. Currently, how we determine this is by using a
threshold value (The maximum number of categories that can have in a
non-string categorical feature. If exceeds, the feature will be treated
15 matches
Mail list logo