GitHub user manishamde opened a pull request: https://github.com/apache/spark/pull/886
SPARK-1536: multiclass classification support for decision tree The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted. It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the 2^(m-1) - 1 categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints. The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets. cc: @mengxr, @etrain, @hirakendu, @atalwalkar, @srowen You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishamde/spark multiclass Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/886.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #886 ---- commit 50b143a4385f209fbc1793f3e03134cab3ab9583 Author: Manish Amde <manish...@gmail.com> Date: 2014-04-20T20:33:03Z adding support for very deep trees commit abc5a23bf80d792a345d723b44bff3ee217cd5ac Author: Evan Sparks <spa...@cs.berkeley.edu> Date: 2014-04-22T01:41:36Z Parameterizing max memory. commit 2f6072c12a1466d783da258d4aa1bde789e7e875 Author: manishamde <manish...@gmail.com> Date: 2014-04-22T03:43:47Z Merge pull request #5 from etrain/deep_tree Parameterizing max memory. commit 2f1e093c5187a1ed532f9c19b25f8a2a6a46e27a Author: Manish Amde <manish...@gmail.com> Date: 2014-04-22T03:49:46Z minor: added doc for maxMemory parameter commit 02877721328a560f210a7906061108ce5dd4bbbe Author: Evan Sparks <spa...@cs.berkeley.edu> Date: 2014-04-22T18:13:27Z Fixing scalastyle issue. commit fecf89a51d6efc9e2ff06e700338ea944a4dd580 Author: manishamde <manish...@gmail.com> Date: 2014-04-22T18:15:57Z Merge pull request #6 from etrain/deep_tree Fixing scalastyle issue. commit 719d0098bb08b50e523cec3e388115d5a206512b Author: Manish Amde <manish...@gmail.com> Date: 2014-04-24T00:04:05Z updating user documentation commit 9dbdabeeacc5fe5e0f1a729ce1ed8ab6ff399000 Author: Manish Amde <manish...@gmail.com> Date: 2014-04-29T21:43:19Z merge from master commit 15171550fe83e42fcb707744c9035ed540fb78d1 Author: Manish Amde <manish...@gmail.com> Date: 2014-04-29T21:45:34Z updated documentation commit 718506b2a0146a5794261a553847d363b7dfb932 Author: Manish Amde <manish...@gmail.com> Date: 2014-04-30T23:29:24Z added unit test commit e0426ee74d5e233c1e7b14e29135015d09a0370c Author: Manish Amde <manish...@gmail.com> Date: 2014-05-01T00:36:47Z renamed parameter commit dad96523d740c2b7ced0f0d73ade66e528b64064 Author: Manish Amde <manish...@gmail.com> Date: 2014-05-01T04:59:55Z removed unused imports commit cbd9f140fd8d43941c61acd6055636bad88b358d Author: Manish Amde <manish...@gmail.com> Date: 2014-05-03T16:16:42Z modified scala.math to math commit 5e822020ce50c6e1bdbdbb3d94d5cbc4c715731e Author: Manish Amde <manish...@gmail.com> Date: 2014-05-06T06:34:58Z added documentation, fixed off by 1 error in max level calculation commit 4731cda7b08fdcd365dd1b690ac04a26f6e85657 Author: Manish Amde <manish...@gmail.com> Date: 2014-05-06T06:44:39Z formatting commit 5eca9e4fbd0e27e335d5cea0bf26b1a436be0457 Author: Manish Amde <manish...@gmail.com> Date: 2014-05-06T06:47:14Z grammar commit 8053fed22249bc788ba988489caa22f732b6416d Author: Manish Amde <manish...@gmail.com> Date: 2014-05-06T06:48:02Z more formatting commit 426bb285f16c816b19e5c25518024ae4d2141c1a Author: Manish Amde <manish...@gmail.com> Date: 2014-05-06T07:16:02Z programming guide blurb commit b27ad2c20edb8a7bf0c0edd5d82a6a683b5d9ea2 Author: Manish Amde <manish...@gmail.com> Date: 2014-05-06T07:19:10Z formatting commit ce004a1ab63405e0a5d0bc892a48b1c96c4d6605 Author: Manish Amde <manish...@gmail.com> Date: 2014-05-06T17:29:04Z minor formatting commit 7fc95457ec66023ddf14e0ef3e1e18cbf828a4db Author: Manish Amde <manish...@gmail.com> Date: 2014-05-07T17:47:27Z added docs commit 968ca9df9b86c1dd60876c00fb3c48b758ffc34b Author: Manish Amde <manish...@gmail.com> Date: 2014-05-07T23:19:27Z merged master commit a1a6e09d7858d82a4b91d40dfd3aeb83f4da2a06 Author: Manish Amde <manish...@gmail.com> Date: 2014-05-01T04:57:42Z added weighted point class commit 14aea48d10eca2727a1f79d3f65e508412c911ad Author: Manish Amde <manish...@gmail.com> Date: 2014-05-01T05:15:41Z changing instance format to weighted labeled point commit 455bea92849f9c8f180cf6cbff8989b368d5b9ab Author: Manish Amde <manish...@gmail.com> Date: 2014-05-01T05:21:32Z fixed tests commit 46f909c01419603d4526685dfcb2b713c8e3c979 Author: Manish Amde <manish...@gmail.com> Date: 2014-05-04T23:30:04Z todo for multiclass support commit 4d5f70c4688c1183b754f2133a4d5a11d862070a Author: Manish Amde <manish...@gmail.com> Date: 2014-05-06T05:52:08Z added multiclass support for find splits bins commit 3f85a17d4c36bebb4831767dcd364fb12cf44873 Author: Manish Amde <manish...@gmail.com> Date: 2014-05-07T01:01:37Z tests for multiclass classification commit 46e06ee0ceb223aee50fa811a35d25090a5c4d42 Author: Manish Amde <manish...@gmail.com> Date: 2014-05-07T01:05:58Z minor mods commit 6c7af2206e6bd16e8bcc4feb4626bfccb5837c55 Author: Manish Amde <manish...@gmail.com> Date: 2014-05-07T06:09:46Z prepared for multiclass without breaking binary classification ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---