I'm testing different classifiers for a BoW problem and last week I got
disappointed that I couldn't use scikit's DecisionTree.
However, using NaiveBayes was awesome! Thanks for this great piece of
software.
So, if you are planning to add the support for scipy sparse matrix on
DecisionTree, I'd like to help.

Gilles, I read /sklearn/tree/tree.py and found that there are 4 methods
that receive X as a dense matrix:
BaseDecisionTree.fit()
BaseDecisionTree.predict()
DecisionTreeClassifier.predict_proba()
DecisionTreeClassifier.predict_log_proba()

fit() calls some Cython classes, that I think you referred to:
_tree.BestSplitter
_tree.PresortBestSplitter
_tree.RandomSplitter
_tree.Gini
_tree.Entropy
_tree.MSE
_tree.FriedmanMSE


Where to start?


On Thu, Jan 23, 2014 at 9:42 AM, Gilles Louppe <[email protected]> wrote:

> >  How much code in our current implementation depends on the data
> representation?
>
> Not much actually. It now basically boils down to simply write a new
> splitter object. Everything else remains the same. So basically, I would
> say that it amounts to 300~ lines of Cython (out of the 2300 lines in our
> implementation).
>
>
> On 23 January 2014 12:37, Mathieu Blondel <[email protected]> wrote:
>
>> > I will try using sparse data on 20newsgroups data and let you know the
>> results.
>>
>> What I was suggesting is to densify the News20 dataset (using a subset of
>> the features so that it fits in memory) and try it on our current
>> implementation. Of course it will be really slow but the goal is to
>> evaluate how decision trees would fare in terms of *accuracy*. A correct
>> sparse implementation should give exactly the same results (unless it
>> handles the zeros differently).
>>
>> Don't get me wrong, adding sparse data support to decision trees is not
>> necessarily a bad idea: I'm just trying to evaluate the cost (in terms of
>> maintenance) vs. benefits. How much code in our current implementation
>> depends on the data representation?
>>
>> Mathieu
>>
>>
>> On Thu, Jan 23, 2014 at 7:00 PM, Olivier Grisel <[email protected]
>> > wrote:
>>
>>> 2014/1/23 Maheshakya Wijewardena <[email protected]>:
>>> > Hi
>>> >
>>> > As I think, using sparse data we can enhance the descriptiveness of
>>> the data
>>> > while keeping its' smaller compared to the dense data without loosing
>>> > information.
>>>
>>> I don't understand what you mean by "sparse data we can enhance the
>>> descriptiveness of the data".
>>>
>>> > I will try using sparse data on 20newsgroups data and let you know the
>>> > results.
>>>
>>> What do you mean? 20newsgroups data is inherently sparse in the sense
>>> as extracted BoW features are mostly zero valued. The problem is that
>>> the current implementation of Decision Trees requires a dense
>>> *representation* of that sparse data to work. To make Decision Trees
>>> work on a spase representation (e.g. a CSC sparse matrix) would
>>> require to re-implement a lot of the code.
>>>
>>> > Arnaud,
>>> > I've gone through those messages and I've already started working on
>>> > patches. Last year I've done a project of a module in our university.
>>> It was
>>> > to implement Bagging in Scikit-learn. As Gilles had already begun
>>> that, I
>>> > was not able to get my code merged. Moreover I have not implemented
>>> feature
>>> > bootstrapping as it was beyond the scope of my original proposal to the
>>> > project.
>>> >
>>> https://github.com/maheshakya/scikit-learn/blob/bagging2/sklearn/ensemble/bagging.py
>>> >
>>> > I would appreciate if you can review and give some feedback on my
>>> > implementation and what can I do further.
>>>
>>> I don't really see the point in spending time reviewing past
>>> alternative implementations of existing features. There are already
>>> 129 pull requests that need reviewer's time:
>>>
>>>   https://github.com/scikit-learn/scikit-learn/pulls
>>>
>>> In my opinion it would be much more productive to fix bugs in the
>>> current code base.
>>>
>>> --
>>> Olivier
>>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>> Critical Workloads, Development Environments & Everything In Between.
>>> Get a Quote or Start a Free Trial Today.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to