Hi,

The support for sparse matrices should exploit as much as possible the sparsity 
structure of the matrix
without blowing up memory as to obtain optimal performance. Note that the 
choice of the best sparse matrix 
representation  is still open (csc, csr, coo, ... matrix?). 

To my knowldege, there is one implementation of sparse decision tree
https://github.com/fest/fest. But I don’t think that the license is BSD 
compatible.

If input sparsity is added, adaboost, GBRT, bagging and random
forest should support sparse input data.

Best,
Arnaud


On 16 Mar 2014, at 02:49, vamsi kaushik <[email protected]> wrote:

> Hi arnaud, Eltermann
> 
> Thanks for the reply. Firstly I am a noob here so excuse me for any stupid 
> questions. Eltermann Just my 2 cents, I felt your approach(in BestSplitter ) 
> somewhat inefficient when you are calling the get_sparse_item. Firstly even 
> though we have indices of all the non zero items, it is being called 
> node_samples times, iterating over nnz times. An easier way of doing this 
> would be to sort the samples in sparse matrix along features and then 
> extracting if necessary. because the whole point in using Xf is to sort them 
> which becomes difficult when they are in an n-d array. like for example
> http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
> gives the non zero elements. My solution might not be perfect but my point is 
> when we are adding support for sparse matrices, why not exploit their 
> structure as much as we can. Arnaud is this feasible ? Eltermann anything 
> wrong in my thinking ?
> 
> 
> cheers,
> kaushik varanasi
> On Wed, Mar 12, 2014 at 8:29 PM, Arnaud Joly <[email protected]> wrote:
> For the number of contributions, I would advise you to do as many
> as possible to get familiar with the codebase and scikit-learn practices.
> Of course starting to think and to work on how to add sparsity support to 
> decision tree
> is important to grasp the challenges and write a convincing proposal.
> Thus, the right proportion is difficult to suggest.
> 
> The goal of this GSOC is to have an efficient and a maintenable 
> implementation.
> The overall project plan should be dictated by this goal.
> 
> Best,
> Arnaud
> 
> On 12 Mar 2014, at 13:44, Gilles Louppe <[email protected]> wrote:
> 
> > On 12 March 2014 13:08, Felipe Eltermann <[email protected]> wrote:
> >> Hello Vamsi,
> >>
> >> "Firstly, regarding the implementation of sparse functions. _tree.pxy is 
> >> the
> >> back end cython code to handle the operations Splitting, Evaluating
> >> impurities at nodes and then constructing the tree."
> >> That's right.
> >>
> >> "The basic implementation that i have in mind is that we duplicate the
> >> splitter classes and The tree architecture itself."
> >> Duplicating the code doesn't seem to be the best approach. It will be
> >> simpler if the same splitter class should handle both dense and sparse
> >> matrices.
> >> I started a PR that is already in progress:
> >> https://github.com/scikit-learn/scikit-learn/pull/2848
> >> If you want to help, you can read up the commits and help on the open
> >> issues.
> >> (however, I don't think this apply as a GSoC project though)
> >
> > Just my 2 cents, but I am not sure having both dense and sparse
> > support in the same codebase is the best choice. Both implementations
> > should in my opinion be separated, with the common code factored out
> > to avoid code dupplication.
> >
> > In particular, having an *efficient* sparse splitter will require
> > specific changes to the splitter. I am really not sure having this
> > code intermingled with the dense support is the best option.
> >
> >>
> >> On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik 
> >> <[email protected]>
> >> wrote:
> >>>
> >>> hi Joly,
> >>>
> >>> I actually have a few questions:
> >>>
> >>> Firstly, regarding the implementation of sparse functions. _tree.pxy is
> >>> the back end cython code to handle the operations Splitting, Evaluating
> >>> impurities at nodes and then constructing the tree. The basic 
> >>> implementation
> >>> that i have in mind is that we duplicate the splitter classes and The tree
> >>> architecture itself. At each leaf node we might have a sparse matrix
> >>> representing the output data and at every non-leaf node, splitting should 
> >>> be
> >>> done on the sparse matrix at that node and produce few other nodes with
> >>> sparse output. So the project boils down to implementing good access and
> >>> construction methods for sparse matrices in cython. I just want to know if
> >>> this is a possible implementation criterion ?. And i would also like to 
> >>> know
> >>> the implementation plan that you have in mind ?
> >>>
> >>> Secondly, In my previous mail i have shown you my contributions. I would
> >>> just like to know if the patches are upto the mark for a selection. If 
> >>> they
> >>> are, i will turn my attention to the DT sparse problem, else i will work 
> >>> on
> >>> the issues a bit more.
> >>>
> >>>
> >>>
> >>> On Tue, Mar 11, 2014 at 1:12 PM, Arnaud Joly <[email protected]> wrote:
> >>>>
> >>>> Thanks for your contribution.
> >>>>
> >>>> Keep up!
> >>>>
> >>>> Arnaud
> >>>>
> >>>> On 10 Mar 2014, at 23:51, vamsi kaushik <[email protected]>
> >>>> wrote:
> >>>>
> >>>> My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
> >>>>
> >>>> And i have already contributed a little.
> >>>> Please take a look at :
> >>>>
> >>>> https://github.com/scikit-learn/scikit-learn/pull/2905
> >>>>
> >>>> and also:
> >>>> https://github.com/scikit-learn/scikit-learn/pull/2895
> >>>> https://github.com/scikit-learn/scikit-learn/issues/2727
> >>>>
> >>>> cheers,
> >>>> kaushik
> >>>>
> >>>>
> >>>> On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> hi Arnaud,
> >>>>>
> >>>>> Vamsi kaushik is actually me.
> >>>>> Thanks for your reply, i'll get to the issue soon
> >>>>>
> >>>>> cheers,
> >>>>> vamsi kaushik
> >>>>>
> >>>>>
> >>>>> On Mon, Mar 10, 2014 at 3:16 PM, Arnaud Joly <[email protected]> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Anything concerning the GSOC should pass by the scikit-learn
> >>>>>> mailing list.
> >>>>>>
> >>>>>> Thanks for your interest in the subject. If you intend to apply for a
> >>>>>> GSOC, I suggest you to read
> >>>>>>
> >>>>>> https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
> >>>>>> and start contributing to scikit-learn.
> >>>>>>
> >>>>>> Right now, three people have shown interest for this topic: Maheshakya
> >>>>>> (who is now
> >>>>>> applying for a GSOC about LSH),  Vamsi Kaushik  and you. Several
> >>>>>> candidates may apply for the
> >>>>>> same subject. However it is likely that if a GSOC is awarded for a
> >>>>>> given subject, only the best proposal will
> >>>>>> be selected.
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Arnaud
> >>>>>>
> >>>>>>
> >>>>>> On 28 Feb 2014, at 22:33, João <[email protected]> wrote:
> >>>>>>
> >>>>>> Hi Arnaud,
> >>>>>>
> >>>>>> While browsing the current GSoC projects I saw an interesting one for
> >>>>>> which you are assigned as mentor: "Add scipy.sparse matrix support to 
> >>>>>> the
> >>>>>> Decision Tree".
> >>>>>>
> >>>>>> I am considering applying for this project as I have already faced the
> >>>>>> necessity of sparse matrices in sklearn and the outcome was not totally
> >>>>>> satisfactory.
> >>>>>> Right now I am participating in a kaggle contest
> >>>>>> (http://www.kaggle.com/c/lshtc/) and facing several difficulties even 
> >>>>>> with
> >>>>>> algorithms that already support spase matrices (even simple algorithms 
> >>>>>> such
> >>>>>> as NB). In general I get some memory error and sometimes segfaults.
> >>>>>> I would be happy if I could implement the necessary support for DT (and
> >>>>>> related algorithms such as random forest and extra trees) and, if the 
> >>>>>> time
> >>>>>> constraint allows, improve as much as possible the general support for
> >>>>>> sparse matrices.
> >>>>>>
> >>>>>> With this email, I want to show my interest and ask you if there are
> >>>>>> already any candidates for this place.
> >>>>>>
> >>>>>> Best regards,
> >>>>>> --
> >>>>>> João
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ------------------------------------------------------------------------------
> >>>>>> Learn Graph Databases - Download FREE O'Reilly Book
> >>>>>> "Graph Databases" is the definitive new guide to graph databases and
> >>>>>> their
> >>>>>> applications. Written by three acclaimed leaders in the field,
> >>>>>> this first edition is now available. Download your free book today!
> >>>>>> http://p.sf.net/sfu/13534_NeoTech
> >>>>>> _______________________________________________
> >>>>>> Scikit-learn-general mailing list
> >>>>>> [email protected]
> >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> ------------------------------------------------------------------------------
> >>>> Learn Graph Databases - Download FREE O'Reilly Book
> >>>> "Graph Databases" is the definitive new guide to graph databases and
> >>>> their
> >>>> applications. Written by three acclaimed leaders in the field,
> >>>> this first edition is now available. Download your free book today!
> >>>>
> >>>> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
> >>>> Scikit-learn-general mailing list
> >>>> [email protected]
> >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ------------------------------------------------------------------------------
> >>>> Learn Graph Databases - Download FREE O'Reilly Book
> >>>> "Graph Databases" is the definitive new guide to graph databases and
> >>>> their
> >>>> applications. Written by three acclaimed leaders in the field,
> >>>> this first edition is now available. Download your free book today!
> >>>> http://p.sf.net/sfu/13534_NeoTech
> >>>> _______________________________________________
> >>>> Scikit-learn-general mailing list
> >>>> [email protected]
> >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>>>
> >>>
> >>>
> >>>
> >>> ------------------------------------------------------------------------------
> >>> Learn Graph Databases - Download FREE O'Reilly Book
> >>> "Graph Databases" is the definitive new guide to graph databases and their
> >>> applications. Written by three acclaimed leaders in the field,
> >>> this first edition is now available. Download your free book today!
> >>> http://p.sf.net/sfu/13534_NeoTech
> >>> _______________________________________________
> >>> Scikit-learn-general mailing list
> >>> [email protected]
> >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>>
> >>
> >>
> >> ------------------------------------------------------------------------------
> >> Learn Graph Databases - Download FREE O'Reilly Book
> >> "Graph Databases" is the definitive new guide to graph databases and their
> >> applications. Written by three acclaimed leaders in the field,
> >> this first edition is now available. Download your free book today!
> >> http://p.sf.net/sfu/13534_NeoTech
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >
> > ------------------------------------------------------------------------------
> > Learn Graph Databases - Download FREE O'Reilly Book
> > "Graph Databases" is the definitive new guide to graph databases and their
> > applications. Written by three acclaimed leaders in the field,
> > this first edition is now available. Download your free book today!
> > http://p.sf.net/sfu/13534_NeoTech
> > _______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to