GitHub user manishamde opened a pull request:

    https://github.com/apache/spark/pull/886

    SPARK-1536: multiclass classification support for decision tree

    The ability to perform multiclass classification is a big advantage for 
using decision trees and was a highly requested feature for mllib. This pull 
request adds multiclass classification support to the MLlib decision tree. It 
also adds sample weights support using WeightedLabeledPoint class for handling 
unbalanced datasets during classification. It will also support algorithms such 
as AdaBoost which requires instances to be weighted.
    
    It handles the special case where the categorical variables cannot be 
ordered for multiclass classification and thus the optimizations used for 
speeding up binary classification cannot be directly used for multiclass 
classification with categorical variables. More specifically, for m categories 
in a categorical feature, it analyses all the 2^(m-1) - 1 categorical splits 
provided that #splits are less than the maxBins provided in the input. This 
condition will not be met for features with large number of categories -- using 
decision trees is not recommended for such datasets in general since the 
categorical features are favored over continuous features. Moreover, the user 
can use a combination of tricks (increasing bin size of the tree algorithms, 
use binary encoding for categorical features or use one-vs-all classification 
strategy) to avoid these constraints.
    
    The new code is accompanied by unit tests and has also been tested on the 
iris and covtype datasets.
    
    cc: @mengxr, @etrain, @hirakendu, @atalwalkar, @srowen

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/manishamde/spark multiclass

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/886.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #886
    
----
commit 50b143a4385f209fbc1793f3e03134cab3ab9583
Author: Manish Amde <manish...@gmail.com>
Date:   2014-04-20T20:33:03Z

    adding support for very deep trees

commit abc5a23bf80d792a345d723b44bff3ee217cd5ac
Author: Evan Sparks <spa...@cs.berkeley.edu>
Date:   2014-04-22T01:41:36Z

    Parameterizing max memory.

commit 2f6072c12a1466d783da258d4aa1bde789e7e875
Author: manishamde <manish...@gmail.com>
Date:   2014-04-22T03:43:47Z

    Merge pull request #5 from etrain/deep_tree
    
    Parameterizing max memory.

commit 2f1e093c5187a1ed532f9c19b25f8a2a6a46e27a
Author: Manish Amde <manish...@gmail.com>
Date:   2014-04-22T03:49:46Z

    minor: added doc for maxMemory parameter

commit 02877721328a560f210a7906061108ce5dd4bbbe
Author: Evan Sparks <spa...@cs.berkeley.edu>
Date:   2014-04-22T18:13:27Z

    Fixing scalastyle issue.

commit fecf89a51d6efc9e2ff06e700338ea944a4dd580
Author: manishamde <manish...@gmail.com>
Date:   2014-04-22T18:15:57Z

    Merge pull request #6 from etrain/deep_tree
    
    Fixing scalastyle issue.

commit 719d0098bb08b50e523cec3e388115d5a206512b
Author: Manish Amde <manish...@gmail.com>
Date:   2014-04-24T00:04:05Z

    updating user documentation

commit 9dbdabeeacc5fe5e0f1a729ce1ed8ab6ff399000
Author: Manish Amde <manish...@gmail.com>
Date:   2014-04-29T21:43:19Z

    merge from master

commit 15171550fe83e42fcb707744c9035ed540fb78d1
Author: Manish Amde <manish...@gmail.com>
Date:   2014-04-29T21:45:34Z

    updated documentation

commit 718506b2a0146a5794261a553847d363b7dfb932
Author: Manish Amde <manish...@gmail.com>
Date:   2014-04-30T23:29:24Z

    added unit test

commit e0426ee74d5e233c1e7b14e29135015d09a0370c
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-01T00:36:47Z

    renamed parameter

commit dad96523d740c2b7ced0f0d73ade66e528b64064
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-01T04:59:55Z

    removed unused imports

commit cbd9f140fd8d43941c61acd6055636bad88b358d
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-03T16:16:42Z

    modified scala.math to math

commit 5e822020ce50c6e1bdbdbb3d94d5cbc4c715731e
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-06T06:34:58Z

    added documentation, fixed off by 1 error in max level calculation

commit 4731cda7b08fdcd365dd1b690ac04a26f6e85657
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-06T06:44:39Z

    formatting

commit 5eca9e4fbd0e27e335d5cea0bf26b1a436be0457
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-06T06:47:14Z

    grammar

commit 8053fed22249bc788ba988489caa22f732b6416d
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-06T06:48:02Z

    more formatting

commit 426bb285f16c816b19e5c25518024ae4d2141c1a
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-06T07:16:02Z

    programming guide blurb

commit b27ad2c20edb8a7bf0c0edd5d82a6a683b5d9ea2
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-06T07:19:10Z

    formatting

commit ce004a1ab63405e0a5d0bc892a48b1c96c4d6605
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-06T17:29:04Z

    minor formatting

commit 7fc95457ec66023ddf14e0ef3e1e18cbf828a4db
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-07T17:47:27Z

    added docs

commit 968ca9df9b86c1dd60876c00fb3c48b758ffc34b
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-07T23:19:27Z

    merged master

commit a1a6e09d7858d82a4b91d40dfd3aeb83f4da2a06
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-01T04:57:42Z

    added weighted point class

commit 14aea48d10eca2727a1f79d3f65e508412c911ad
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-01T05:15:41Z

    changing instance format to weighted labeled point

commit 455bea92849f9c8f180cf6cbff8989b368d5b9ab
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-01T05:21:32Z

    fixed tests

commit 46f909c01419603d4526685dfcb2b713c8e3c979
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-04T23:30:04Z

    todo for multiclass support

commit 4d5f70c4688c1183b754f2133a4d5a11d862070a
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-06T05:52:08Z

    added multiclass support for find splits bins

commit 3f85a17d4c36bebb4831767dcd364fb12cf44873
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-07T01:01:37Z

    tests for multiclass classification

commit 46e06ee0ceb223aee50fa811a35d25090a5c4d42
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-07T01:05:58Z

    minor mods

commit 6c7af2206e6bd16e8bcc4feb4626bfccb5837c55
Author: Manish Amde <manish...@gmail.com>
Date:   2014-05-07T06:09:46Z

    prepared for multiclass without breaking binary classification

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to