GitHub user manishamde opened a pull request:
https://github.com/apache/spark/pull/79
MLI-1 Decision Trees
Joint work with @hirakendu, @etrain, @atalwalkar and @harsha2010.
Key features:
+ Supports binary classification and regression
+ Supports gini, entropy and variance for information gain calculation
+ Supports both continuous and categorical features
The algorithm has gone through several development iterations over the last
few months leading to a highly optimized implementation. Optimizations include:
1. Level-wise training to reduce passes over the entire dataset.
2. Bin-wise split calculation to reduce computation overhead.
3. Aggregation over partitions before combining to reduce communication
overhead.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/manishamde/spark tree
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/79.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #79
----
commit cd53eae11313fd30f71f5ec94b20fe8d4427b8cd
Author: Manish Amde <[email protected]>
Date: 2013-11-28T10:20:27Z
skeletal framework
Signed-off-by: Manish Amde <[email protected]>
commit 92cedce2eb5055e0164c90842d6613c618bfed94
Author: Manish Amde <[email protected]>
Date: 2013-12-02T06:52:29Z
basic building blocks for intermediate RDD calculation. untested.
Signed-off-by: Manish Amde <[email protected]>
commit 8bca1e20b703fd90bc6fcdbed5d36b42a0bdf66e
Author: Manish Amde <[email protected]>
Date: 2013-12-09T03:48:39Z
additional code for creating intermediate RDD
Signed-off-by: Manish Amde <[email protected]>
commit 0012a77eb02e0a6627b7e3e68ac4d0f29d0885e0
Author: Manish Amde <[email protected]>
Date: 2013-12-10T05:08:44Z
basic stump working
Signed-off-by: Manish Amde <[email protected]>
commit 03f534c2f9a8dd739945f92b98a58e93fa5b716a
Author: Manish Amde <[email protected]>
Date: 2013-12-10T06:10:46Z
some more tests
Signed-off-by: Manish Amde <[email protected]>
commit dad0afc85aea64c06b4dd64504b3112c881ae4e6
Author: Manish Amde <[email protected]>
Date: 2013-12-15T08:25:58Z
decison stump functionality working
Signed-off-by: Manish Amde <[email protected]>
commit 4798aae63e898fed71e6240462a163ad81ccd64b
Author: Manish Amde <[email protected]>
Date: 2013-12-15T08:45:23Z
added gain stats class
Signed-off-by: Manish Amde <[email protected]>
commit 80e8c66dd25ad03c706f4993b10ba4caafa54c18
Author: Manish Amde <[email protected]>
Date: 2013-12-16T01:41:59Z
working version of multi-level split calculation
Signed-off-by: Manish Amde <[email protected]>
commit b0eb866cfd2d98a9281127e02e0c159668ca01f4
Author: Manish Amde <[email protected]>
Date: 2013-12-16T04:42:52Z
added logic to handle leaf nodes
Signed-off-by: Manish Amde <[email protected]>
commit 98ec8d57a0a0897b093ced7e3284228ee21ce5f4
Author: Manish Amde <[email protected]>
Date: 2013-12-22T06:39:29Z
tree building and prediction logic
Signed-off-by: Manish Amde <[email protected]>
commit 02c595c65f784061b1a78d4cbd5cac5990d1881d
Author: Manish Amde <[email protected]>
Date: 2013-12-22T20:00:17Z
added command line parsing
Signed-off-by: Manish Amde <[email protected]>
commit 733d6ddf51ddf440efb1a17c818da6d7fd027c4b
Author: Manish Amde <[email protected]>
Date: 2013-12-22T20:20:50Z
fixed tests
Signed-off-by: Manish Amde <[email protected]>
commit 154aa77c925e44a92e8bbf2f55e43cab06e75006
Author: Manish Amde <[email protected]>
Date: 2013-12-23T06:51:17Z
enums for configurations
Signed-off-by: Manish Amde <[email protected]>
commit b0e3e76c47b1b449c91832aee2a6e94cee0a7c6b
Author: Manish Amde <[email protected]>
Date: 2014-01-12T19:45:47Z
adding enum for feature type
Signed-off-by: Manish Amde <[email protected]>
commit c8f6d60c45ec7ec8cfac94b43fb22d8c294221db
Author: Manish Amde <[email protected]>
Date: 2014-01-12T19:46:55Z
adding enum for feature type
Signed-off-by: Manish Amde <[email protected]>
commit e23c2e5089a2bf2a50c5d3f52e5799bf76ca3a16
Author: Manish Amde <[email protected]>
Date: 2014-01-19T21:23:45Z
added regression support
Signed-off-by: Manish Amde <[email protected]>
commit 53108ed6ad241765757c1e4c68189035505b370f
Author: Manish Amde <[email protected]>
Date: 2014-01-20T00:56:15Z
fixing index for highest bin
Signed-off-by: Manish Amde <[email protected]>
commit 6df35b9e70701528b13b33820b687f295bcfb3a4
Author: Manish Amde <[email protected]>
Date: 2014-01-21T04:33:52Z
regression predict logic
Signed-off-by: Manish Amde <[email protected]>
commit dbb7ac13d28fba0848062a7bea40c617cb5f2c80
Author: Manish Amde <[email protected]>
Date: 2014-01-23T04:44:23Z
categorical feature support
Signed-off-by: Manish Amde <[email protected]>
commit d504eb1f8a3f7f06226448d42b709f2f7ec6e91c
Author: Manish Amde <[email protected]>
Date: 2014-01-23T05:59:15Z
more tests for categorical features
Signed-off-by: Manish Amde <[email protected]>
commit 6b7de78e3a59bef8cbb8aff8b2aeed0cd91ab4a1
Author: Manish Amde <[email protected]>
Date: 2014-01-26T01:53:41Z
minor refactoring and tests
Signed-off-by: Manish Amde <[email protected]>
commit b09dc983f4f05da61479c87617526064b0e3dde8
Author: Manish Amde <[email protected]>
Date: 2014-01-26T22:54:43Z
minor refactoring
Signed-off-by: Manish Amde <[email protected]>
commit c0e522b7d1f5e27c81d682e5c8c97543fb4242be
Author: Manish Amde <[email protected]>
Date: 2014-01-27T03:11:43Z
updated predict and split threshold logic
Signed-off-by: Manish Amde <[email protected]>
commit f067d68f0d951e7f0f089419c506fbd5ce2c2fc1
Author: Manish Amde <[email protected]>
Date: 2014-01-27T03:36:21Z
minor cleanup
Signed-off-by: Manish Amde <[email protected]>
commit 5841c2838e6834fc8c767f3c83dba7ef99375fa4
Author: Manish Amde <[email protected]>
Date: 2014-01-27T06:34:49Z
unit tests for categorical features
Signed-off-by: Manish Amde <[email protected]>
commit 0dd7659055879be9fbb3280964f87b14c735f225
Author: manishamde <[email protected]>
Date: 2014-01-27T06:42:06Z
basic doc
Signed-off-by: Manish Amde <[email protected]>
commit dd0c0d799d42c94da3f930065a6c2973143bfd75
Author: Manish Amde <[email protected]>
Date: 2014-01-27T08:01:43Z
minor: some docs
Signed-off-by: Manish Amde <[email protected]>
commit 937277990e80f9a97070c63d39552579f0320fd7
Author: Manish Amde <[email protected]>
Date: 2014-02-17T03:42:48Z
code style: max line lenght <= 100
Signed-off-by: Manish Amde <[email protected]>
commit 84f85d6d0a1fe7ed60149cc6b29a9ff76ef09abd
Author: Manish Amde <[email protected]>
Date: 2014-02-28T04:57:56Z
code documentation
Signed-off-by: Manish Amde <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---