Given the license, I don’t think you can read the code to make a new bsd
implementation.
Arnaud
On 17 Mar 2014, at 15:19, vamsi kaushik <[email protected]> wrote:
> Hi,
> And of the fest implementation i would say we just rewrite the algorithm in
> cython instead of writing wrappers for it, this dosn't need licence and is
> readable for other contributors and also easily changeable in future. This
> boils down to a few 500 lines of code in cython.
>
>
> On Mon, Mar 17, 2014 at 7:26 PM, vamsi kaushik <[email protected]>
> wrote:
> hi arnaud,
>
> yeah i have already gone through that implementation. I feel at least we
> should have separate splitter classes for the node splitting to be more
> efficient and flexible.
> And as of the choice of matrix, I don't have one in mind but mostly the
> elements are sorted along the feature columns. so I would say, csr maybe. But
> there can be
> other efficient formats too.
>
> And as of the input matrices, we can follow the same procedure followed in
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx
>
> so as to not break the user code. And moreover i have read somewhere on the
> internet about sparse coding format, is it possible to use a more efficient
> sparse structure within the tree algorithm. But the input types remain the
> same, they get converted to another sparse structure once they are passed to
> the tree, since we are working with non-zero element pointers, this should
> not be memory/time inefficient. But my question is, is this acceptable?
>
>
> On Mon, Mar 17, 2014 at 6:49 PM, Arnaud Joly <[email protected]> wrote:
> Hi,
>
>
> The support for sparse matrices should exploit as much as possible the
> sparsity structure of the matrix
> without blowing up memory as to obtain optimal performance. Note that the
> choice of the best sparse matrix
> representation is still open (csc, csr, coo, ... matrix?).
>
> To my knowldege, there is one implementation of sparse decision tree
> https://github.com/fest/fest. But I don’t think that the license is BSD
> compatible.
>
> If input sparsity is added, adaboost, GBRT, bagging and random
> forest should support sparse input data.
>
> Best,
> Arnaud
>
>
> On 16 Mar 2014, at 02:49, vamsi kaushik <[email protected]> wrote:
>
>> Hi arnaud, Eltermann
>>
>> Thanks for the reply. Firstly I am a noob here so excuse me for any stupid
>> questions. Eltermann Just my 2 cents, I felt your approach(in BestSplitter )
>> somewhat inefficient when you are calling the get_sparse_item. Firstly even
>> though we have indices of all the non zero items, it is being called
>> node_samples times, iterating over nnz times. An easier way of doing this
>> would be to sort the samples in sparse matrix along features and then
>> extracting if necessary. because the whole point in using Xf is to sort them
>> which becomes difficult when they are in an n-d array. like for example
>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
>> gives the non zero elements. My solution might not be perfect but my point
>> is when we are adding support for sparse matrices, why not exploit their
>> structure as much as we can. Arnaud is this feasible ? Eltermann anything
>> wrong in my thinking ?
>>
>>
>> cheers,
>> kaushik varanasi
>> On Wed, Mar 12, 2014 at 8:29 PM, Arnaud Joly <[email protected]> wrote:
>> For the number of contributions, I would advise you to do as many
>> as possible to get familiar with the codebase and scikit-learn practices.
>> Of course starting to think and to work on how to add sparsity support to
>> decision tree
>> is important to grasp the challenges and write a convincing proposal.
>> Thus, the right proportion is difficult to suggest.
>>
>> The goal of this GSOC is to have an efficient and a maintenable
>> implementation.
>> The overall project plan should be dictated by this goal.
>>
>> Best,
>> Arnaud
>>
>> On 12 Mar 2014, at 13:44, Gilles Louppe <[email protected]> wrote:
>>
>> > On 12 March 2014 13:08, Felipe Eltermann <[email protected]>
>> > wrote:
>> >> Hello Vamsi,
>> >>
>> >> "Firstly, regarding the implementation of sparse functions. _tree.pxy is
>> >> the
>> >> back end cython code to handle the operations Splitting, Evaluating
>> >> impurities at nodes and then constructing the tree."
>> >> That's right.
>> >>
>> >> "The basic implementation that i have in mind is that we duplicate the
>> >> splitter classes and The tree architecture itself."
>> >> Duplicating the code doesn't seem to be the best approach. It will be
>> >> simpler if the same splitter class should handle both dense and sparse
>> >> matrices.
>> >> I started a PR that is already in progress:
>> >> https://github.com/scikit-learn/scikit-learn/pull/2848
>> >> If you want to help, you can read up the commits and help on the open
>> >> issues.
>> >> (however, I don't think this apply as a GSoC project though)
>> >
>> > Just my 2 cents, but I am not sure having both dense and sparse
>> > support in the same codebase is the best choice. Both implementations
>> > should in my opinion be separated, with the common code factored out
>> > to avoid code dupplication.
>> >
>> > In particular, having an *efficient* sparse splitter will require
>> > specific changes to the splitter. I am really not sure having this
>> > code intermingled with the dense support is the best option.
>> >
>> >>
>> >> On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik
>> >> <[email protected]>
>> >> wrote:
>> >>>
>> >>> hi Joly,
>> >>>
>> >>> I actually have a few questions:
>> >>>
>> >>> Firstly, regarding the implementation of sparse functions. _tree.pxy is
>> >>> the back end cython code to handle the operations Splitting, Evaluating
>> >>> impurities at nodes and then constructing the tree. The basic
>> >>> implementation
>> >>> that i have in mind is that we duplicate the splitter classes and The
>> >>> tree
>> >>> architecture itself. At each leaf node we might have a sparse matrix
>> >>> representing the output data and at every non-leaf node, splitting
>> >>> should be
>> >>> done on the sparse matrix at that node and produce few other nodes with
>> >>> sparse output. So the project boils down to implementing good access and
>> >>> construction methods for sparse matrices in cython. I just want to know
>> >>> if
>> >>> this is a possible implementation criterion ?. And i would also like to
>> >>> know
>> >>> the implementation plan that you have in mind ?
>> >>>
>> >>> Secondly, In my previous mail i have shown you my contributions. I would
>> >>> just like to know if the patches are upto the mark for a selection. If
>> >>> they
>> >>> are, i will turn my attention to the DT sparse problem, else i will work
>> >>> on
>> >>> the issues a bit more.
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Mar 11, 2014 at 1:12 PM, Arnaud Joly <[email protected]> wrote:
>> >>>>
>> >>>> Thanks for your contribution.
>> >>>>
>> >>>> Keep up!
>> >>>>
>> >>>> Arnaud
>> >>>>
>> >>>> On 10 Mar 2014, at 23:51, vamsi kaushik <[email protected]>
>> >>>> wrote:
>> >>>>
>> >>>> My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
>> >>>>
>> >>>> And i have already contributed a little.
>> >>>> Please take a look at :
>> >>>>
>> >>>> https://github.com/scikit-learn/scikit-learn/pull/2905
>> >>>>
>> >>>> and also:
>> >>>> https://github.com/scikit-learn/scikit-learn/pull/2895
>> >>>> https://github.com/scikit-learn/scikit-learn/issues/2727
>> >>>>
>> >>>> cheers,
>> >>>> kaushik
>> >>>>
>> >>>>
>> >>>> On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
>> >>>> <[email protected]> wrote:
>> >>>>>
>> >>>>> hi Arnaud,
>> >>>>>
>> >>>>> Vamsi kaushik is actually me.
>> >>>>> Thanks for your reply, i'll get to the issue soon
>> >>>>>
>> >>>>> cheers,
>> >>>>> vamsi kaushik
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Mar 10, 2014 at 3:16 PM, Arnaud Joly <[email protected]> wrote:
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> Anything concerning the GSOC should pass by the scikit-learn
>> >>>>>> mailing list.
>> >>>>>>
>> >>>>>> Thanks for your interest in the subject. If you intend to apply for a
>> >>>>>> GSOC, I suggest you to read
>> >>>>>>
>> >>>>>> https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
>> >>>>>> and start contributing to scikit-learn.
>> >>>>>>
>> >>>>>> Right now, three people have shown interest for this topic: Maheshakya
>> >>>>>> (who is now
>> >>>>>> applying for a GSOC about LSH), Vamsi Kaushik and you. Several
>> >>>>>> candidates may apply for the
>> >>>>>> same subject. However it is likely that if a GSOC is awarded for a
>> >>>>>> given subject, only the best proposal will
>> >>>>>> be selected.
>> >>>>>>
>> >>>>>> Best regards,
>> >>>>>> Arnaud
>> >>>>>>
>> >>>>>>
>> >>>>>> On 28 Feb 2014, at 22:33, João <[email protected]> wrote:
>> >>>>>>
>> >>>>>> Hi Arnaud,
>> >>>>>>
>> >>>>>> While browsing the current GSoC projects I saw an interesting one for
>> >>>>>> which you are assigned as mentor: "Add scipy.sparse matrix support to
>> >>>>>> the
>> >>>>>> Decision Tree".
>> >>>>>>
>> >>>>>> I am considering applying for this project as I have already faced the
>> >>>>>> necessity of sparse matrices in sklearn and the outcome was not
>> >>>>>> totally
>> >>>>>> satisfactory.
>> >>>>>> Right now I am participating in a kaggle contest
>> >>>>>> (http://www.kaggle.com/c/lshtc/) and facing several difficulties even
>> >>>>>> with
>> >>>>>> algorithms that already support spase matrices (even simple
>> >>>>>> algorithms such
>> >>>>>> as NB). In general I get some memory error and sometimes segfaults.
>> >>>>>> I would be happy if I could implement the necessary support for DT
>> >>>>>> (and
>> >>>>>> related algorithms such as random forest and extra trees) and, if the
>> >>>>>> time
>> >>>>>> constraint allows, improve as much as possible the general support for
>> >>>>>> sparse matrices.
>> >>>>>>
>> >>>>>> With this email, I want to show my interest and ask you if there are
>> >>>>>> already any candidates for this place.
>> >>>>>>
>> >>>>>> Best regards,
>> >>>>>> --
>> >>>>>> João
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> ------------------------------------------------------------------------------
>> >>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>> >>>>>> "Graph Databases" is the definitive new guide to graph databases and
>> >>>>>> their
>> >>>>>> applications. Written by three acclaimed leaders in the field,
>> >>>>>> this first edition is now available. Download your free book today!
>> >>>>>> http://p.sf.net/sfu/13534_NeoTech
>> >>>>>> _______________________________________________
>> >>>>>> Scikit-learn-general mailing list
>> >>>>>> [email protected]
>> >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>> ------------------------------------------------------------------------------
>> >>>> Learn Graph Databases - Download FREE O'Reilly Book
>> >>>> "Graph Databases" is the definitive new guide to graph databases and
>> >>>> their
>> >>>> applications. Written by three acclaimed leaders in the field,
>> >>>> this first edition is now available. Download your free book today!
>> >>>>
>> >>>> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
>> >>>> Scikit-learn-general mailing list
>> >>>> [email protected]
>> >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> ------------------------------------------------------------------------------
>> >>>> Learn Graph Databases - Download FREE O'Reilly Book
>> >>>> "Graph Databases" is the definitive new guide to graph databases and
>> >>>> their
>> >>>> applications. Written by three acclaimed leaders in the field,
>> >>>> this first edition is now available. Download your free book today!
>> >>>> http://p.sf.net/sfu/13534_NeoTech
>> >>>> _______________________________________________
>> >>>> Scikit-learn-general mailing list
>> >>>> [email protected]
>> >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> ------------------------------------------------------------------------------
>> >>> Learn Graph Databases - Download FREE O'Reilly Book
>> >>> "Graph Databases" is the definitive new guide to graph databases and
>> >>> their
>> >>> applications. Written by three acclaimed leaders in the field,
>> >>> this first edition is now available. Download your free book today!
>> >>> http://p.sf.net/sfu/13534_NeoTech
>> >>> _______________________________________________
>> >>> Scikit-learn-general mailing list
>> >>> [email protected]
>> >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >>>
>> >>
>> >>
>> >> ------------------------------------------------------------------------------
>> >> Learn Graph Databases - Download FREE O'Reilly Book
>> >> "Graph Databases" is the definitive new guide to graph databases and their
>> >> applications. Written by three acclaimed leaders in the field,
>> >> this first edition is now available. Download your free book today!
>> >> http://p.sf.net/sfu/13534_NeoTech
>> >> _______________________________________________
>> >> Scikit-learn-general mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >>
>> >
>> > ------------------------------------------------------------------------------
>> > Learn Graph Databases - Download FREE O'Reilly Book
>> > "Graph Databases" is the definitive new guide to graph databases and their
>> > applications. Written by three acclaimed leaders in the field,
>> > this first edition is now available. Download your free book today!
>> > http://p.sf.net/sfu/13534_NeoTech
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general