[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48994002 @manishamde - can you add `[MLlib]` to the title of this pull request? Otherwise it doesn't get filtered properly by our filters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48998279 QA results for PR 886:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16661/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48958683 QA tests have started for PR 886. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16635/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48958795 QA results for PR 886:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16635/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48960458 QA tests have started for PR 886. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16636/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48960547 QA results for PR 886:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16636/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48961699 QA tests have started for PR 886. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16637/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48961781 QA results for PR 886:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16637/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48962289 QA tests have started for PR 886. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16638/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48971082 QA results for PR 886:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16638/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48972522 QA tests have started for PR 886. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16645/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48978578 QA results for PR 886:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16645/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48992765 QA tests have started for PR 886. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16661/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r14865144 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -768,104 +973,157 @@ object DecisionTree extends Serializable with Logging { /** * Extracts left and right split aggregates. * @param binData Array[Double] of size 2*numFeatures*numSplits - * @return (leftNodeAgg, rightNodeAgg) tuple of type (Array[Double], - * Array[Double]) where each array is of size(numFeature,2*(numSplits-1)) + * @return (leftNodeAgg, rightNodeAgg) tuple of type (Array[Array[Array[Double\]\]\], + * Array[Array[Array[Double\]\]\]) where each array is of size(numFeature, + * (numBins - 1), numClasses) */ def extractLeftRightNodeAggregates( -binData: Array[Double]): (Array[Array[Double]], Array[Array[Double]]) = { +binData: Array[Double]): (Array[Array[Array[Double]]], Array[Array[Array[Double]]]) = { + + + def findAggForOrderedFeatureClassification( + leftNodeAgg: Array[Array[Array[Double]]], + rightNodeAgg: Array[Array[Array[Double]]], + featureIndex: Int) { + +// shift for this featureIndex +val shift = numClasses * featureIndex * numBins + +var classIndex = 0 +while (classIndex numClasses) { + // left node aggregate for the lowest split + leftNodeAgg(featureIndex)(0)(classIndex) = binData(shift + classIndex) + // right node aggregate for the highest split + rightNodeAgg(featureIndex)(numBins - 2)(classIndex) += binData(shift + (numClasses * (numBins - 1)) + classIndex) + classIndex += 1 +} + +// Iterate over all splits. +var splitIndex = 1 +while (splitIndex numBins - 1) { + // calculating left node aggregate for a split as a sum of left node aggregate of a + // lower split and the left bin aggregate of a bin where the split is a high split + var innerClassIndex = 0 + while (innerClassIndex numClasses) { +leftNodeAgg(featureIndex)(splitIndex)(innerClassIndex) + = binData(shift + numClasses * splitIndex + innerClassIndex) + +leftNodeAgg(featureIndex)(splitIndex - 1)(innerClassIndex) +rightNodeAgg(featureIndex)(numBins - 2 - splitIndex)(innerClassIndex) = + binData(shift + (numClasses * (numBins - 1 - splitIndex) + innerClassIndex)) + +rightNodeAgg(featureIndex)(numBins - 1 - splitIndex)(innerClassIndex) +innerClassIndex += 1 + } + splitIndex += 1 +} + } + + def findAggForUnorderedFeatureClassification( + leftNodeAgg: Array[Array[Array[Double]]], + rightNodeAgg: Array[Array[Array[Double]]], + featureIndex: Int) { + +val rightChildShift = numClasses * numBins * numFeatures +var splitIndex = 0 +while (splitIndex numBins - 1) { + var classIndex = 0 + while (classIndex numClasses) { +// shift for this featureIndex +val shift = numClasses * featureIndex * numBins + splitIndex * numClasses +val leftBinValue = binData(shift + classIndex) +val rightBinValue = binData(rightChildShift + shift + classIndex) +leftNodeAgg(featureIndex)(splitIndex)(classIndex) = leftBinValue +rightNodeAgg(featureIndex)(splitIndex)(classIndex) = rightBinValue +classIndex += 1 + } + splitIndex += 1 +} + } + + def findAggForRegression( + leftNodeAgg: Array[Array[Array[Double]]], + rightNodeAgg: Array[Array[Array[Double]]], + featureIndex: Int) { + +// shift for this featureIndex +val shift = 3 * featureIndex * numBins +// left node aggregate for the lowest split +leftNodeAgg(featureIndex)(0)(0) = binData(shift + 0) +leftNodeAgg(featureIndex)(0)(1) = binData(shift + 1) +leftNodeAgg(featureIndex)(0)(2) = binData(shift + 2) + +// right node aggregate for the highest split +rightNodeAgg(featureIndex)(numBins - 2)(0) = + binData(shift + (3 * (numBins - 1))) +rightNodeAgg(featureIndex)(numBins - 2)(1) = + binData(shift + (3 * (numBins - 1)) + 1) +rightNodeAgg(featureIndex)(numBins - 2)(2) = + binData(shift + (3 * (numBins - 1)) + 2) + +// Iterate over all splits. +var
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48867199 Thanks Evan. I have compared to scikit-learn on the covertype dataset and the results looked similar. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r14836561 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -768,104 +973,157 @@ object DecisionTree extends Serializable with Logging { /** * Extracts left and right split aggregates. * @param binData Array[Double] of size 2*numFeatures*numSplits - * @return (leftNodeAgg, rightNodeAgg) tuple of type (Array[Double], - * Array[Double]) where each array is of size(numFeature,2*(numSplits-1)) + * @return (leftNodeAgg, rightNodeAgg) tuple of type (Array[Array[Array[Double\]\]\], + * Array[Array[Array[Double\]\]\]) where each array is of size(numFeature, + * (numBins - 1), numClasses) */ def extractLeftRightNodeAggregates( -binData: Array[Double]): (Array[Array[Double]], Array[Array[Double]]) = { +binData: Array[Double]): (Array[Array[Array[Double]]], Array[Array[Array[Double]]]) = { + + + def findAggForOrderedFeatureClassification( + leftNodeAgg: Array[Array[Array[Double]]], + rightNodeAgg: Array[Array[Array[Double]]], + featureIndex: Int) { + +// shift for this featureIndex +val shift = numClasses * featureIndex * numBins + +var classIndex = 0 +while (classIndex numClasses) { + // left node aggregate for the lowest split + leftNodeAgg(featureIndex)(0)(classIndex) = binData(shift + classIndex) + // right node aggregate for the highest split + rightNodeAgg(featureIndex)(numBins - 2)(classIndex) += binData(shift + (numClasses * (numBins - 1)) + classIndex) + classIndex += 1 +} + +// Iterate over all splits. +var splitIndex = 1 +while (splitIndex numBins - 1) { + // calculating left node aggregate for a split as a sum of left node aggregate of a + // lower split and the left bin aggregate of a bin where the split is a high split + var innerClassIndex = 0 + while (innerClassIndex numClasses) { +leftNodeAgg(featureIndex)(splitIndex)(innerClassIndex) + = binData(shift + numClasses * splitIndex + innerClassIndex) + +leftNodeAgg(featureIndex)(splitIndex - 1)(innerClassIndex) +rightNodeAgg(featureIndex)(numBins - 2 - splitIndex)(innerClassIndex) = + binData(shift + (numClasses * (numBins - 1 - splitIndex) + innerClassIndex)) + +rightNodeAgg(featureIndex)(numBins - 1 - splitIndex)(innerClassIndex) +innerClassIndex += 1 + } + splitIndex += 1 +} + } + + def findAggForUnorderedFeatureClassification( + leftNodeAgg: Array[Array[Array[Double]]], + rightNodeAgg: Array[Array[Array[Double]]], + featureIndex: Int) { + +val rightChildShift = numClasses * numBins * numFeatures +var splitIndex = 0 +while (splitIndex numBins - 1) { + var classIndex = 0 + while (classIndex numClasses) { +// shift for this featureIndex +val shift = numClasses * featureIndex * numBins + splitIndex * numClasses +val leftBinValue = binData(shift + classIndex) +val rightBinValue = binData(rightChildShift + shift + classIndex) +leftNodeAgg(featureIndex)(splitIndex)(classIndex) = leftBinValue +rightNodeAgg(featureIndex)(splitIndex)(classIndex) = rightBinValue +classIndex += 1 + } + splitIndex += 1 +} + } + + def findAggForRegression( + leftNodeAgg: Array[Array[Array[Double]]], + rightNodeAgg: Array[Array[Array[Double]]], + featureIndex: Int) { + +// shift for this featureIndex +val shift = 3 * featureIndex * numBins +// left node aggregate for the lowest split +leftNodeAgg(featureIndex)(0)(0) = binData(shift + 0) +leftNodeAgg(featureIndex)(0)(1) = binData(shift + 1) +leftNodeAgg(featureIndex)(0)(2) = binData(shift + 2) + +// right node aggregate for the highest split +rightNodeAgg(featureIndex)(numBins - 2)(0) = + binData(shift + (3 * (numBins - 1))) +rightNodeAgg(featureIndex)(numBins - 2)(1) = + binData(shift + (3 * (numBins - 1)) + 1) +rightNodeAgg(featureIndex)(numBins - 2)(2) = + binData(shift + (3 * (numBins - 1)) + 2) + +// Iterate over all splits. +var splitIndex
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48767401 I've gone through this in some depth, and aside from a couple of minor style nits - the logic looks good to me. Manish - have you compared output vs. scikit-learn for multiclass datasets and verified that things look at least reasonably similar? Really awesome work! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48674298 QA tests have started for PR 886. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16530/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48674369 QA results for PR 886:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brcase class WeightedLabeledPoint(label: Double, features: Vector, weight:Double = 1) {brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16530/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48674374 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16530/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48675447 QA tests have started for PR 886. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16531/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48675496 QA results for PR 886:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brcase class WeightedLabeledPoint(label: Double, features: Vector, weight:Double = 1) {brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16531/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48675499 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16531/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48683128 QA results for PR 886:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16538/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48412354 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48412365 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48412478 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48412480 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16428/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48413445 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48413437 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48413580 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16430/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48413579 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48415119 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48415107 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48415267 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16432/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48415266 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48143660 @etrain Added implicit conversion. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48143754 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48143761 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48143827 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16362/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-48143826 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13982468 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -49,6 +49,7 @@ object DecisionTreeRunner { case class Params( input: String = null, algo: Algo = Classification, + numClassesForClassification: Int = 2, --- End diff -- Inference from a large dataset could take a lot of time. In general, most practitioners know in advance. If not, we can add a pre-processing step. Currently we have only ```numClassesForClassification``` as a classification specific parameter. In general, I agree with you. At the same time, didn't want to create more configuration classes for the user. Shall we leave it as is for now and handle it with the ensembles PR where we have more parameters (boosting iterations, num trees, feature subsetting, etc.) ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13982568 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -45,7 +46,7 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo * @param input RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] used as training data * @return a DecisionTreeModel that can be used for prediction */ - def train(input: RDD[LabeledPoint]): DecisionTreeModel = { + def train(input: RDD[WeightedLabeledPoint]): DecisionTreeModel = { --- End diff -- Agree. I started with implicit conversions and forget the reason why I switched. I will give it a try again. If it works, great. If not, I will remember why it doesn't work well. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13982597 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -233,13 +234,73 @@ object DecisionTree extends Serializable with Logging { algo: Algo, impurity: Impurity, maxDepth: Int): DecisionTreeModel = { -val strategy = new Strategy(algo,impurity,maxDepth) -new DecisionTree(strategy).train(input: RDD[LabeledPoint]) +val strategy = new Strategy(algo, impurity, maxDepth) +// Converting from standard instance format to weighted input format for tree training +val weightedInput = input.map(x = WeightedLabeledPoint(x.label, x.features)) +new DecisionTree(strategy).train(weightedInput: RDD[WeightedLabeledPoint]) --- End diff -- Thanks. Will remove. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-46593827 Thanks @etrain 1. I will try to use implicits 2. I agree. We originally had separate trees and then merged them for readability. There is a sweet spot in between that we need to find. Agree, it's a major refactoring. I think it will be best to do it in or after ensemble PR where we will know most of the cases we need to handle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13982852 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -49,6 +49,7 @@ object DecisionTreeRunner { case class Params( input: String = null, algo: Algo = Classification, + numClassesForClassification: Int = 2, --- End diff -- Yeah, makes sense. If it doesn't complicate things too much we might consider adding an interface that doesn't have this specified and figures it out in one shot. Worth noting is that in R, an object of type factor (the default for categorical/label data) has this information built in. It can be a big pain at load time while the system tries to figure out the cardinality of the factor, but it leads to a nice compact representation of the data and eliminates situations like this one. I agree on doing the API separation with the ensembles PR. On Thu, Jun 19, 2014 at 10:46 AM, manishamde notificati...@github.com wrote: In examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala: @@ -49,6 +49,7 @@ object DecisionTreeRunner { case class Params( input: String = null, algo: Algo = Classification, + numClassesForClassification: Int = 2, Inference from a large dataset could take a lot of time. In general, most practitioners know in advance. If not, we can add a pre-processing step. Currently we have only numClassesForClassification as a classification specific parameter. In general, I agree with you. At the same time, didn't want to create more configuration classes for the user. Shall we leave it as is for now and handle it with the ensembles PR where we have more parameters (boosting iterations, num trees, feature subsetting, etc.) ? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/886/files#r13982468. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13983131 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -49,6 +49,7 @@ object DecisionTreeRunner { case class Params( input: String = null, algo: Algo = Classification, + numClassesForClassification: Int = 2, --- End diff -- Good point. Let me create a JIRA ticket for this so that we don't forget. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13996228 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -49,6 +49,7 @@ object DecisionTreeRunner { case class Params( input: String = null, algo: Algo = Classification, + numClassesForClassification: Int = 2, --- End diff -- https://issues.apache.org/jira/browse/SPARK-2206 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13926351 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala --- @@ -49,6 +49,7 @@ object DecisionTreeRunner { case class Params( input: String = null, algo: Algo = Classification, + numClassesForClassification: Int = 2, --- End diff -- Do we want this to be a parameter and not inferred from the data? Also - I'm wondering if it makes sense to subclass params with DecisionTreeParams vs. RegressionTreeParams so that we keep logically separate options separate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13926460 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -45,7 +46,7 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo * @param input RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] used as training data * @return a DecisionTreeModel that can be used for prediction */ - def train(input: RDD[LabeledPoint]): DecisionTreeModel = { + def train(input: RDD[WeightedLabeledPoint]): DecisionTreeModel = { --- End diff -- If we're going to change the interface, it might be nice to have an implicit conversion between LabeledPoint and WeightedLabeledPoint (which assigns weight 1 to everything). I think the common case is going to be using unweighted anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13926555 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -212,7 +211,9 @@ object DecisionTree extends Serializable with Logging { * @return a DecisionTreeModel that can be used for prediction */ def train(input: RDD[LabeledPoint], strategy: Strategy): DecisionTreeModel = { -new DecisionTree(strategy).train(input: RDD[LabeledPoint]) +// Converting from standard instance format to weighted input format for tree training --- End diff -- Maybe this is better served with an implicit since I think we'll want to re-use labeled point elsewhere and having an automatic conversion might be nice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/886#discussion_r13926606 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -233,13 +234,73 @@ object DecisionTree extends Serializable with Logging { algo: Algo, impurity: Impurity, maxDepth: Int): DecisionTreeModel = { -val strategy = new Strategy(algo,impurity,maxDepth) -new DecisionTree(strategy).train(input: RDD[LabeledPoint]) +val strategy = new Strategy(algo, impurity, maxDepth) +// Converting from standard instance format to weighted input format for tree training +val weightedInput = input.map(x = WeightedLabeledPoint(x.label, x.features)) +new DecisionTree(strategy).train(weightedInput: RDD[WeightedLabeledPoint]) --- End diff -- Not sure why you need to be explicit about the types here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-46465872 I've taken a first pass at this and at a high level it looks good. The main two things I'd say are 1) I think an implicit that converts LabeledPoint to WeightedLabeledPoint could go a long way at removing some of the boilerplate introduced by this PR. 2) I'm getting a little concerned that we could modularize a little better - for example, every time we do a strategy.algo match - it feels like we could just as easily have a separate class for Regression algo, Decision algo, etc. For example, each separate algo could implement its own binSeqOp and a few other methods and the base class could tie these all together. This would be a fairly major refactoring and is maybe better suited for a later PR. I still need to look closely at the principal logical changes in DecisionTree.scala - and will try to get to this before the end of the week. Thanks for your patience! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-45959540 Friendly nudge: could somebody please take a look at this PR. It is blocking upcoming ensemble tree PRs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-45127038 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-45127063 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-45127229 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-45139033 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-45139221 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-45139223 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15453/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-45141375 I added support for sorting categorical feature values using impurity (gini/entropy) calculated over the corresponding labels in multiclass classification. This heuristic will only be used when it's not feasible to check for all the categorical splits in multiclass classification. cc: @srowen, @etrain --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44381347 I don't have a reference, and I did look for one. I am sure it is not optimal, and not even that great as a greedy algorithm. Two low-entropy distributions over target values could be high-entropy when combined. You could pick one feature value which makes the target lowest entropy, then pick the next one that would make the combined entropy of the target lowest, and so on. That amounts to testing n^2 instead of 2^n decisions. If the alternative is to fail, or spend years in computation, I think heuristics of some kind are a must. Even random selection of subsets is better than rejecting the problem entirely -- anything is better than that I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44388751 I fully agree. I will give others a day or two to raise any concerns if they have any and then proceed to implement the two-step solution for multiclass classification that I mentioned above. The second step will be the O(k) algorithm (k is the number of categorical feature values) that will come up with k sorted categorical feature splits using the target variable entropy for ordering. The O(n^2) algorithm looked promising at first but I think it might end up dominating the tree computation time. In general, getting 0(k) splits is more important than ensuring that they are sorted since we now have a way of dealing with unsorted splits with this PR. I currently don't have a good intuition on what makes a good subset of splits but we could keep adding more heuristics later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44243946 @manishamde Yes for categorical features with high cardinality, you don't want to consider all possible splits. I don't think having a cardinality of 30 or 40 is that unusual though. Honestly I've always resented the fact that R simply can't handle more than 32! There are heuristics however that work well while efficiently considering a number of splits linear in the number of values. For regression, it's apparently optimal to sort the categorical values by average value of the target variable, and then consider just prefixes of that list of values as the subsets to try. Google's PLANET paper claims that is optimal. For classification, where the target itself is categorical, I don't know of a provably optimal way to do it. The heuristic I have used is to sort the categorical values by the entropy of the target value. This seems pretty OK. There is some Java code for creating the decision rules to evaluate here, in `CategoricalDecision.java` and `NumericDecision.java`: https://github.com/cloudera/oryx/tree/master/rdf-common/src/main/java/com/cloudera/oryx/rdf/common/rule It's pretty easy to lift them and Scala-fy it. I'd really like to see functionality like this so MLlib RDF can be comparable and I can move to it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44359841 @srowen It's good to know about the use-case for cardinality in the order of tens. The categorical feature ordering using the average value of the target variable works well for both binary classification and regression (section 9.2.4 of Elements of Statistical Learning) and it's already implemented in MLlib decision tree. This PR handles the scenario where the 'ordering' assumption does not hold true for the multiclass classification. I like the suggestion of using entropy to sort the categories -- it will be great if we could also find a theoretical reference for it! Here is what I propose for handling categorical features in multiclass classification: 1. We check for all splits of the categorical variable if the bin constraints are met. 2. If the bin constraints are not met, we can use a sorting heuristic (like entropy of the target variable) I think this might be the best tradeoff both from the theoretical and practical perspective and it will save the user a lot of data munging effort which is one of the main advantages of decision trees. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44360446 I am worried that exponential growth in the number of split possibilities kills us when we check for all splits when we get to even 20-30 categorical values. That's potentially a billion possible candidates to check. I have a feeling that heuristics will be more practical (but i don't have a reference!). We might add an option for checking for all vs. using an entropy based heuristic and automatically decide which to use at some conservative threshold that is user-configurable. On Tue, May 27, 2014 at 7:41 PM, manishamde notificati...@github.comwrote: @srowen https://github.com/srowen It's good to know about the use-case for cardinality in the order of tens. The categorical feature ordering using the average value of the target variable works well for both binary classification and regression (section 9.2.4 of Elements of Statistical Learning) and it's already implemented in MLlib decision tree. This PR handles the scenario where the 'ordering' assumption does not hold true for the multiclass classification. I like the suggestion of using entropy to sort the categories -- it will be great if we could also find a theoretical reference for it! Here is what I propose for handling categorical features in multiclass classification: 1. We check for all splits of the categorical variable if the bin constraints are met. 2. If the bin constraints are not met, we can use a sorting heuristic (like entropy of the target variable) I think this might be the best tradeoff both from the theoretical and practical perspective and it will save the user a lot of data munging effort which is one of the main advantages of decision trees. â Reply to this email directly or view it on GitHubhttps://github.com/apache/spark/pull/886#issuecomment-44359841 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
GitHub user manishamde opened a pull request: https://github.com/apache/spark/pull/886 SPARK-1536: multiclass classification support for decision tree The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted. It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the 2^(m-1) - 1 categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints. The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets. cc: @mengxr, @etrain, @hirakendu, @atalwalkar, @srowen You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishamde/spark multiclass Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/886.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #886 commit 50b143a4385f209fbc1793f3e03134cab3ab9583 Author: Manish Amde manish...@gmail.com Date: 2014-04-20T20:33:03Z adding support for very deep trees commit abc5a23bf80d792a345d723b44bff3ee217cd5ac Author: Evan Sparks spa...@cs.berkeley.edu Date: 2014-04-22T01:41:36Z Parameterizing max memory. commit 2f6072c12a1466d783da258d4aa1bde789e7e875 Author: manishamde manish...@gmail.com Date: 2014-04-22T03:43:47Z Merge pull request #5 from etrain/deep_tree Parameterizing max memory. commit 2f1e093c5187a1ed532f9c19b25f8a2a6a46e27a Author: Manish Amde manish...@gmail.com Date: 2014-04-22T03:49:46Z minor: added doc for maxMemory parameter commit 02877721328a560f210a7906061108ce5dd4bbbe Author: Evan Sparks spa...@cs.berkeley.edu Date: 2014-04-22T18:13:27Z Fixing scalastyle issue. commit fecf89a51d6efc9e2ff06e700338ea944a4dd580 Author: manishamde manish...@gmail.com Date: 2014-04-22T18:15:57Z Merge pull request #6 from etrain/deep_tree Fixing scalastyle issue. commit 719d0098bb08b50e523cec3e388115d5a206512b Author: Manish Amde manish...@gmail.com Date: 2014-04-24T00:04:05Z updating user documentation commit 9dbdabeeacc5fe5e0f1a729ce1ed8ab6ff399000 Author: Manish Amde manish...@gmail.com Date: 2014-04-29T21:43:19Z merge from master commit 15171550fe83e42fcb707744c9035ed540fb78d1 Author: Manish Amde manish...@gmail.com Date: 2014-04-29T21:45:34Z updated documentation commit 718506b2a0146a5794261a553847d363b7dfb932 Author: Manish Amde manish...@gmail.com Date: 2014-04-30T23:29:24Z added unit test commit e0426ee74d5e233c1e7b14e29135015d09a0370c Author: Manish Amde manish...@gmail.com Date: 2014-05-01T00:36:47Z renamed parameter commit dad96523d740c2b7ced0f0d73ade66e528b64064 Author: Manish Amde manish...@gmail.com Date: 2014-05-01T04:59:55Z removed unused imports commit cbd9f140fd8d43941c61acd6055636bad88b358d Author: Manish Amde manish...@gmail.com Date: 2014-05-03T16:16:42Z modified scala.math to math commit 5e822020ce50c6e1bdbdbb3d94d5cbc4c715731e Author: Manish Amde manish...@gmail.com Date: 2014-05-06T06:34:58Z added documentation, fixed off by 1 error in max level calculation commit 4731cda7b08fdcd365dd1b690ac04a26f6e85657 Author: Manish Amde manish...@gmail.com Date: 2014-05-06T06:44:39Z formatting commit 5eca9e4fbd0e27e335d5cea0bf26b1a436be0457 Author: Manish Amde manish...@gmail.com Date: 2014-05-06T06:47:14Z grammar commit 8053fed22249bc788ba988489caa22f732b6416d Author: Manish Amde manish...@gmail.com Date: 2014-05-06T06:48:02Z more formatting commit 426bb285f16c816b19e5c25518024ae4d2141c1a Author: Manish Amde manish...@gmail.com Date: 2014-05-06T07:16:02Z programming guide blurb commit b27ad2c20edb8a7bf0c0edd5d82a6a683b5d9ea2 Author: Manish Amde manish...@gmail.com Date: 2014-05-06T07:19:10Z
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44228078 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44228087 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44228145 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44228146 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15215/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44228654 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44228663 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44228716 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15216/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1536: multiclass classification support ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-44228715 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---