Hi,
You are right that neither approach is satisfactory, Integer encoding
introduces ordinality where there wasn't any before (so you could get a
split like categorical_variable>3 which means nothing in terms of the
original features), meanwhile one-hot encoding has the bias problem
mentioned.
For now I'd recommend using the R package.
This isn't too easy to fix without a pandas dependency as you'd need a nice
way of encoding factors.
Best,
James McMurray
On 29 October 2014 17:27, Xin Shuai <[email protected]> wrote:
> Hi, Michael:
> Thank you for your comment. Actually, I use one-hot coding strategy but I
> don't think it satisfactory.
> I do hope that Scikit-learn developer can improve it because it is a big
> issue for decision tree method.
>
> On Wed, Oct 29, 2014 at 12:18 PM, Michael Eickenberg <
> [email protected]> wrote:
>
>> Hi Xin,
>>
>> as far as I know the only ways of working around this problem right now
>> are one-hot encoding or using integer numbers to represent your classes.
>> The former augments your feature space but can cause biases if different
>> categorical features can take different numbers of values (leading to more
>> columns for one feature, leading to it being selected disproportionately
>> often). The latter avoids the problem of the former, but since decisions
>> are binary, the trees can only distinguish integer features from a certain
>> depth onwards.
>>
>> I cannot comment on future developments, but I have the feeling that
>> better treatment of categorical features may be on the plan :)
>>
>> Michael
>>
>> On Wed, Oct 29, 2014 at 5:09 PM, Xin Shuai <[email protected]> wrote:
>>
>>> Hi,:
>>> I'm a fan of Scikit-learn and it is my favorite ML package.
>>> However, I found this package DOES NOT deal with categorical variable
>>> for tree-based method. So I need to convert categorical variable into dummy
>>> variable before I can use tree method. Actually, this is counterintuitive
>>> to the original decision tree method.
>>> Any improvement on that?
>>> --
>>> Xin(David) Shuai
>>> PhD of Complex System in School of Informatics & Computing
>>> Indiana University Bloomington
>>> 812-606-8969
>>>
>>> The way to success is to do as much as important things, and as less as
>>> unimportant things, as you can...
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Xin(David) Shuai
> PhD of Complex System in School of Informatics & Computing
> Indiana University Bloomington
> 812-606-8969
>
> The way to success is to do as much as important things, and as less as
> unimportant things, as you can...
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general