Hi Arnaud,
I am looking forward to include these in my proposal
http://cam.zcu.cz/~brandner/PSfiles/pooch.pdf
http://arxiv.org/pdf/1301.6685v1.pdf
as algorithms for sparse storage within the decision tree. Have your say on
this one
cheers,
kaushik Varanasi
On Tue, Mar 18, 2014 at 7:45 PM, vamsi kaushik
<[email protected]>wrote:
> Hi Arnaud,
>
> I just saw your mail regarding the fest implementation and dropped it from
> the proposal.
> Im currently working on my proposal and will post it on the wiki by
> tonight.
> The core idea is to alter the _tree.pyx module and then extend sparse
> support to ensemble methods and im also new to the community so i couldn't
> fix a lot of patches.
> I would like to know if there are other important points to mention in the
> proposal regarding the project
>
> cheers,
> kaushik Varanasi
>
>
> On Mon, Mar 17, 2014 at 7:49 PM, vamsi kaushik <
> [email protected]> wrote:
>
>> Hi,
>> And of the fest implementation i would say we just rewrite the algorithm
>> in cython instead of writing wrappers for it, this dosn't need licence and
>> is readable for other contributors and also easily changeable in future.
>> This boils down to a few 500 lines of code in cython.
>>
>>
>> On Mon, Mar 17, 2014 at 7:26 PM, vamsi kaushik <
>> [email protected]> wrote:
>>
>>> hi arnaud,
>>>
>>> yeah i have already gone through that implementation. I feel at least we
>>> should have separate splitter classes for the node splitting to be more
>>> efficient and flexible.
>>> And as of the choice of matrix, I don't have one in mind but mostly the
>>> elements are sorted along the feature columns. so I would say, csr maybe.
>>> But there can be
>>> other efficient formats too.
>>>
>>> And as of the input matrices, we can follow the same procedure followed
>>> in
>>>
>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx
>>>
>>> so as to not break the user code. And moreover i have read somewhere on
>>> the internet about sparse coding format, is it possible to use a more
>>> efficient sparse structure within the tree algorithm. But the input types
>>> remain the same, they get converted to another sparse structure once they
>>> are passed to the tree, since we are working with non-zero element
>>> pointers, this should not be memory/time inefficient. But my question is,
>>> is this acceptable?
>>>
>>>
>>> On Mon, Mar 17, 2014 at 6:49 PM, Arnaud Joly <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> The support for sparse matrices should exploit as much as possible the
>>>> sparsity structure of the matrix
>>>> without blowing up memory as to obtain optimal performance. Note that
>>>> the choice of the best sparse matrix
>>>> representation is still open (csc, csr, coo, ... matrix?).
>>>>
>>>> To my knowldege, there is one implementation of sparse decision tree
>>>> https://github.com/fest/fest. But I don't think that the license is
>>>> BSD compatible.
>>>>
>>>> If input sparsity is added, adaboost, GBRT, bagging and random
>>>> forest should support sparse input data.
>>>>
>>>> Best,
>>>> Arnaud
>>>>
>>>>
>>>> On 16 Mar 2014, at 02:49, vamsi kaushik <[email protected]>
>>>> wrote:
>>>>
>>>> Hi arnaud, Eltermann
>>>>
>>>> Thanks for the reply. Firstly I am a noob here so excuse me for any
>>>> stupid questions. Eltermann Just my 2 cents, I felt your approach(in
>>>> BestSplitter ) somewhat inefficient when you are calling the
>>>> get_sparse_item. Firstly even though we have indices of all the non zero
>>>> items, it is being called node_samples times, iterating over nnz times. An
>>>> easier way of doing this would be to sort the samples in sparse matrix
>>>> along features and then extracting if necessary. because the whole point in
>>>> using Xf is to sort them which becomes difficult when they are in an n-d
>>>> array. like for example
>>>>
>>>> http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
>>>> gives the non zero elements. My solution might not be perfect but my
>>>> point is when we are adding support for sparse matrices, why not exploit
>>>> their structure as much as we can. Arnaud is this feasible ? Eltermann
>>>> anything wrong in my thinking ?
>>>>
>>>>
>>>> cheers,
>>>> kaushik varanasi
>>>> On Wed, Mar 12, 2014 at 8:29 PM, Arnaud Joly <[email protected]> wrote:
>>>>
>>>>> For the number of contributions, I would advise you to do as many
>>>>> as possible to get familiar with the codebase and scikit-learn
>>>>> practices.
>>>>> Of course starting to think and to work on how to add sparsity support
>>>>> to decision tree
>>>>> is important to grasp the challenges and write a convincing proposal.
>>>>> Thus, the right proportion is difficult to suggest.
>>>>>
>>>>> The goal of this GSOC is to have an efficient and a maintenable
>>>>> implementation.
>>>>> The overall project plan should be dictated by this goal.
>>>>>
>>>>> Best,
>>>>> Arnaud
>>>>>
>>>>> On 12 Mar 2014, at 13:44, Gilles Louppe <[email protected]> wrote:
>>>>>
>>>>> > On 12 March 2014 13:08, Felipe Eltermann <[email protected]>
>>>>> wrote:
>>>>> >> Hello Vamsi,
>>>>> >>
>>>>> >> "Firstly, regarding the implementation of sparse functions.
>>>>> _tree.pxy is the
>>>>> >> back end cython code to handle the operations Splitting, Evaluating
>>>>> >> impurities at nodes and then constructing the tree."
>>>>> >> That's right.
>>>>> >>
>>>>> >> "The basic implementation that i have in mind is that we duplicate
>>>>> the
>>>>> >> splitter classes and The tree architecture itself."
>>>>> >> Duplicating the code doesn't seem to be the best approach. It will
>>>>> be
>>>>> >> simpler if the same splitter class should handle both dense and
>>>>> sparse
>>>>> >> matrices.
>>>>> >> I started a PR that is already in progress:
>>>>> >> https://github.com/scikit-learn/scikit-learn/pull/2848
>>>>> >> If you want to help, you can read up the commits and help on the
>>>>> open
>>>>> >> issues.
>>>>> >> (however, I don't think this apply as a GSoC project though)
>>>>> >
>>>>> > Just my 2 cents, but I am not sure having both dense and sparse
>>>>> > support in the same codebase is the best choice. Both implementations
>>>>> > should in my opinion be separated, with the common code factored out
>>>>> > to avoid code dupplication.
>>>>> >
>>>>> > In particular, having an *efficient* sparse splitter will require
>>>>> > specific changes to the splitter. I am really not sure having this
>>>>> > code intermingled with the dense support is the best option.
>>>>> >
>>>>> >>
>>>>> >> On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <
>>>>> [email protected]>
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> hi Joly,
>>>>> >>>
>>>>> >>> I actually have a few questions:
>>>>> >>>
>>>>> >>> Firstly, regarding the implementation of sparse functions.
>>>>> _tree.pxy is
>>>>> >>> the back end cython code to handle the operations Splitting,
>>>>> Evaluating
>>>>> >>> impurities at nodes and then constructing the tree. The basic
>>>>> implementation
>>>>> >>> that i have in mind is that we duplicate the splitter classes and
>>>>> The tree
>>>>> >>> architecture itself. At each leaf node we might have a sparse
>>>>> matrix
>>>>> >>> representing the output data and at every non-leaf node, splitting
>>>>> should be
>>>>> >>> done on the sparse matrix at that node and produce few other nodes
>>>>> with
>>>>> >>> sparse output. So the project boils down to implementing good
>>>>> access and
>>>>> >>> construction methods for sparse matrices in cython. I just want to
>>>>> know if
>>>>> >>> this is a possible implementation criterion ?. And i would also
>>>>> like to know
>>>>> >>> the implementation plan that you have in mind ?
>>>>> >>>
>>>>> >>> Secondly, In my previous mail i have shown you my contributions. I
>>>>> would
>>>>> >>> just like to know if the patches are upto the mark for a
>>>>> selection. If they
>>>>> >>> are, i will turn my attention to the DT sparse problem, else i
>>>>> will work on
>>>>> >>> the issues a bit more.
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Tue, Mar 11, 2014 at 1:12 PM, Arnaud Joly <[email protected]>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Thanks for your contribution.
>>>>> >>>>
>>>>> >>>> Keep up!
>>>>> >>>>
>>>>> >>>> Arnaud
>>>>> >>>>
>>>>> >>>> On 10 Mar 2014, at 23:51, vamsi kaushik <
>>>>> [email protected]>
>>>>> >>>> wrote:
>>>>> >>>>
>>>>> >>>> My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
>>>>> >>>>
>>>>> >>>> And i have already contributed a little.
>>>>> >>>> Please take a look at :
>>>>> >>>>
>>>>> >>>> https://github.com/scikit-learn/scikit-learn/pull/2905
>>>>> >>>>
>>>>> >>>> and also:
>>>>> >>>> https://github.com/scikit-learn/scikit-learn/pull/2895
>>>>> >>>> https://github.com/scikit-learn/scikit-learn/issues/2727
>>>>> >>>>
>>>>> >>>> cheers,
>>>>> >>>> kaushik
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
>>>>> >>>> <[email protected]> wrote:
>>>>> >>>>>
>>>>> >>>>> hi Arnaud,
>>>>> >>>>>
>>>>> >>>>> Vamsi kaushik is actually me.
>>>>> >>>>> Thanks for your reply, i'll get to the issue soon
>>>>> >>>>>
>>>>> >>>>> cheers,
>>>>> >>>>> vamsi kaushik
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> On Mon, Mar 10, 2014 at 3:16 PM, Arnaud Joly <[email protected]>
>>>>> wrote:
>>>>> >>>>>>
>>>>> >>>>>> Hi,
>>>>> >>>>>>
>>>>> >>>>>> Anything concerning the GSOC should pass by the scikit-learn
>>>>> >>>>>> mailing list.
>>>>> >>>>>>
>>>>> >>>>>> Thanks for your interest in the subject. If you intend to apply
>>>>> for a
>>>>> >>>>>> GSOC, I suggest you to read
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
>>>>> >>>>>> and start contributing to scikit-learn.
>>>>> >>>>>>
>>>>> >>>>>> Right now, three people have shown interest for this topic:
>>>>> Maheshakya
>>>>> >>>>>> (who is now
>>>>> >>>>>> applying for a GSOC about LSH), Vamsi Kaushik and you. Several
>>>>> >>>>>> candidates may apply for the
>>>>> >>>>>> same subject. However it is likely that if a GSOC is awarded
>>>>> for a
>>>>> >>>>>> given subject, only the best proposal will
>>>>> >>>>>> be selected.
>>>>> >>>>>>
>>>>> >>>>>> Best regards,
>>>>> >>>>>> Arnaud
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> On 28 Feb 2014, at 22:33, João <[email protected]> wrote:
>>>>> >>>>>>
>>>>> >>>>>> Hi Arnaud,
>>>>> >>>>>>
>>>>> >>>>>> While browsing the current GSoC projects I saw an interesting
>>>>> one for
>>>>> >>>>>> which you are assigned as mentor: "Add scipy.sparse matrix
>>>>> support to the
>>>>> >>>>>> Decision Tree".
>>>>> >>>>>>
>>>>> >>>>>> I am considering applying for this project as I have already
>>>>> faced the
>>>>> >>>>>> necessity of sparse matrices in sklearn and the outcome was not
>>>>> totally
>>>>> >>>>>> satisfactory.
>>>>> >>>>>> Right now I am participating in a kaggle contest
>>>>> >>>>>> (http://www.kaggle.com/c/lshtc/) and facing several
>>>>> difficulties even with
>>>>> >>>>>> algorithms that already support spase matrices (even simple
>>>>> algorithms such
>>>>> >>>>>> as NB). In general I get some memory error and sometimes
>>>>> segfaults.
>>>>> >>>>>> I would be happy if I could implement the necessary support for
>>>>> DT (and
>>>>> >>>>>> related algorithms such as random forest and extra trees) and,
>>>>> if the time
>>>>> >>>>>> constraint allows, improve as much as possible the general
>>>>> support for
>>>>> >>>>>> sparse matrices.
>>>>> >>>>>>
>>>>> >>>>>> With this email, I want to show my interest and ask you if
>>>>> there are
>>>>> >>>>>> already any candidates for this place.
>>>>> >>>>>>
>>>>> >>>>>> Best regards,
>>>>> >>>>>> --
>>>>> >>>>>> João
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> >>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> >>>>>> "Graph Databases" is the definitive new guide to graph
>>>>> databases and
>>>>> >>>>>> their
>>>>> >>>>>> applications. Written by three acclaimed leaders in the field,
>>>>> >>>>>> this first edition is now available. Download your free book
>>>>> today!
>>>>> >>>>>> http://p.sf.net/sfu/13534_NeoTech
>>>>> >>>>>> _______________________________________________
>>>>> >>>>>> Scikit-learn-general mailing list
>>>>> >>>>>> [email protected]
>>>>> >>>>>>
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>> >>>>>>
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> ------------------------------------------------------------------------------
>>>>> >>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> >>>> "Graph Databases" is the definitive new guide to graph databases
>>>>> and
>>>>> >>>> their
>>>>> >>>> applications. Written by three acclaimed leaders in the field,
>>>>> >>>> this first edition is now available. Download your free book
>>>>> today!
>>>>> >>>>
>>>>> >>>>
>>>>> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
>>>>> >>>> Scikit-learn-general mailing list
>>>>> >>>> [email protected]
>>>>> >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> ------------------------------------------------------------------------------
>>>>> >>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> >>>> "Graph Databases" is the definitive new guide to graph databases
>>>>> and
>>>>> >>>> their
>>>>> >>>> applications. Written by three acclaimed leaders in the field,
>>>>> >>>> this first edition is now available. Download your free book
>>>>> today!
>>>>> >>>> http://p.sf.net/sfu/13534_NeoTech
>>>>> >>>> _______________________________________________
>>>>> >>>> Scikit-learn-general mailing list
>>>>> >>>> [email protected]
>>>>> >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>> >>>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> ------------------------------------------------------------------------------
>>>>> >>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> >>> "Graph Databases" is the definitive new guide to graph databases
>>>>> and their
>>>>> >>> applications. Written by three acclaimed leaders in the field,
>>>>> >>> this first edition is now available. Download your free book today!
>>>>> >>> http://p.sf.net/sfu/13534_NeoTech
>>>>> >>> _______________________________________________
>>>>> >>> Scikit-learn-general mailing list
>>>>> >>> [email protected]
>>>>> >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> ------------------------------------------------------------------------------
>>>>> >> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> >> "Graph Databases" is the definitive new guide to graph databases
>>>>> and their
>>>>> >> applications. Written by three acclaimed leaders in the field,
>>>>> >> this first edition is now available. Download your free book today!
>>>>> >> http://p.sf.net/sfu/13534_NeoTech
>>>>> >> _______________________________________________
>>>>> >> Scikit-learn-general mailing list
>>>>> >> [email protected]
>>>>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>> >>
>>>>> >
>>>>> >
>>>>> ------------------------------------------------------------------------------
>>>>> > Learn Graph Databases - Download FREE O'Reilly Book
>>>>> > "Graph Databases" is the definitive new guide to graph databases and
>>>>> their
>>>>> > applications. Written by three acclaimed leaders in the field,
>>>>> > this first edition is now available. Download your free book today!
>>>>> > http://p.sf.net/sfu/13534_NeoTech
>>>>> > _______________________________________________
>>>>> > Scikit-learn-general mailing list
>>>>> > [email protected]
>>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>>> their
>>>>> applications. Written by three acclaimed leaders in the field,
>>>>> this first edition is now available. Download your free book today!
>>>>> http://p.sf.net/sfu/13534_NeoTech
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their
>>>> applications. Written by three acclaimed leaders in the field,
>>>> this first edition is now available. Download your free book today!
>>>>
>>>> http://p.sf.net/sfu/13534_NeoTech_______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their
>>>> applications. Written by three acclaimed leaders in the field,
>>>> this first edition is now available. Download your free book today!
>>>> http://p.sf.net/sfu/13534_NeoTech
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>
>
------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general