Mathieu,
I have no experience with forests on sparse data, nor have I seen much work
on the topic. I would be curious to investigate however, there may be
problems how which this is useful. I know that Arnaud tried forests on
(densified) 20newsgroups and it seems to work well actually.
In particular, I have the feeling that supporting sparse data in decision
trees would be helpful for boosting. I am quite confident GBRT would work
its way on such data. What do you think Peter?
Gilles
On 22 January 2014 09:48, Mathieu Blondel <[email protected]> wrote:
> Hi,
>
> Something I was wondering is whether sparse support in decision trees
> would actually be useful. Do decision trees (or ensembles of them like
> random forests) work better than linear models for high-dimensional data?
>
> It would be nice to take the News20 dataset, pre-select the top 10k
> features (or more if possible) then measure test accuracy on the densified
> dataset. I would be very interested in hearing the results.
>
> And regardless of accuracy, some algorithms (e.g., GMM) scale very poorly
> with n_features. I wonder if it's not the case for decision trees too.
>
> Gilles, Peter, Arnaud, any opinion / experience?
>
> Mathieu
>
>
> On Wed, Jan 22, 2014 at 2:13 PM, Maheshakya Wijewardena <
> [email protected]> wrote:
>
>> Hi,
>>
>> I have been using Scikit-learn One hot encoder for data encoding and the
>> resulting array supports only for a few models such as logistic regression,
>> SVC, etc. When I convert those sparse matrices with list comprehension or
>> toarray() function to dense matrices, resulting arrays become too large for
>> those classifiers such as Decision trees or any other tree based
>> classifier.
>> I saw a GSOC project idea of implementing this as mentioned here.
>>
>> https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014
>> I'm looking forward to apply for GSOC this year as well, so I would like
>> start working on this. From where can I get support for this. (There're no
>> possible mentors assigned for this)
>>
>> Regards,
>> Maheshakya
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general