> How much code in our current implementation depends on the data
representation?
Not much actually. It now basically boils down to simply write a new
splitter object. Everything else remains the same. So basically, I would
say that it amounts to 300~ lines of Cython (out of the 2300 lines in our
implementation).
On 23 January 2014 12:37, Mathieu Blondel <[email protected]> wrote:
> > I will try using sparse data on 20newsgroups data and let you know the
> results.
>
> What I was suggesting is to densify the News20 dataset (using a subset of
> the features so that it fits in memory) and try it on our current
> implementation. Of course it will be really slow but the goal is to
> evaluate how decision trees would fare in terms of *accuracy*. A correct
> sparse implementation should give exactly the same results (unless it
> handles the zeros differently).
>
> Don't get me wrong, adding sparse data support to decision trees is not
> necessarily a bad idea: I'm just trying to evaluate the cost (in terms of
> maintenance) vs. benefits. How much code in our current implementation
> depends on the data representation?
>
> Mathieu
>
>
> On Thu, Jan 23, 2014 at 7:00 PM, Olivier Grisel
> <[email protected]>wrote:
>
>> 2014/1/23 Maheshakya Wijewardena <[email protected]>:
>> > Hi
>> >
>> > As I think, using sparse data we can enhance the descriptiveness of the
>> data
>> > while keeping its' smaller compared to the dense data without loosing
>> > information.
>>
>> I don't understand what you mean by "sparse data we can enhance the
>> descriptiveness of the data".
>>
>> > I will try using sparse data on 20newsgroups data and let you know the
>> > results.
>>
>> What do you mean? 20newsgroups data is inherently sparse in the sense
>> as extracted BoW features are mostly zero valued. The problem is that
>> the current implementation of Decision Trees requires a dense
>> *representation* of that sparse data to work. To make Decision Trees
>> work on a spase representation (e.g. a CSC sparse matrix) would
>> require to re-implement a lot of the code.
>>
>> > Arnaud,
>> > I've gone through those messages and I've already started working on
>> > patches. Last year I've done a project of a module in our university.
>> It was
>> > to implement Bagging in Scikit-learn. As Gilles had already begun that,
>> I
>> > was not able to get my code merged. Moreover I have not implemented
>> feature
>> > bootstrapping as it was beyond the scope of my original proposal to the
>> > project.
>> >
>> https://github.com/maheshakya/scikit-learn/blob/bagging2/sklearn/ensemble/bagging.py
>> >
>> > I would appreciate if you can review and give some feedback on my
>> > implementation and what can I do further.
>>
>> I don't really see the point in spending time reviewing past
>> alternative implementations of existing features. There are already
>> 129 pull requests that need reviewer's time:
>>
>> https://github.com/scikit-learn/scikit-learn/pulls
>>
>> In my opinion it would be much more productive to fix bugs in the
>> current code base.
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>>
>> ------------------------------------------------------------------------------
>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>> Critical Workloads, Development Environments & Everything In Between.
>> Get a Quote or Start a Free Trial Today.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> Critical Workloads, Development Environments & Everything In Between.
> Get a Quote or Start a Free Trial Today.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general