[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-15 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48994002
  
@manishamde  - can you add `[MLlib]` to the title of this pull request? 
Otherwise it doesn't get filtered properly by our filters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48998279
  
QA results for PR 886:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16661/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48958683
  
QA tests have started for PR 886. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16635/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48958795
  
QA results for PR 886:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16635/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48960458
  
QA tests have started for PR 886. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16636/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48960547
  
QA results for PR 886:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16636/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48961699
  
QA tests have started for PR 886. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16637/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48961781
  
QA results for PR 886:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16637/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48962289
  
QA tests have started for PR 886. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16638/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48971082
  
QA results for PR 886:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16638/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48972522
  
QA tests have started for PR 886. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16645/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48978578
  
QA results for PR 886:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16645/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48992765
  
QA tests have started for PR 886. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16661/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-13 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r14865144
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -768,104 +973,157 @@ object DecisionTree extends Serializable with 
Logging {
 /**
  * Extracts left and right split aggregates.
  * @param binData Array[Double] of size 2*numFeatures*numSplits
- * @return (leftNodeAgg, rightNodeAgg) tuple of type (Array[Double],
- * Array[Double]) where each array is of 
size(numFeature,2*(numSplits-1))
+ * @return (leftNodeAgg, rightNodeAgg) tuple of type 
(Array[Array[Array[Double\]\]\],
+ * Array[Array[Array[Double\]\]\]) where each array is of 
size(numFeature,
+ * (numBins - 1), numClasses)
  */
 def extractLeftRightNodeAggregates(
-binData: Array[Double]): (Array[Array[Double]], 
Array[Array[Double]]) = {
+binData: Array[Double]): (Array[Array[Array[Double]]], 
Array[Array[Array[Double]]]) = {
+
+
+  def findAggForOrderedFeatureClassification(
+  leftNodeAgg: Array[Array[Array[Double]]],
+  rightNodeAgg: Array[Array[Array[Double]]],
+  featureIndex: Int) {
+
+// shift for this featureIndex
+val shift = numClasses * featureIndex * numBins
+
+var classIndex = 0
+while (classIndex  numClasses) {
+  // left node aggregate for the lowest split
+  leftNodeAgg(featureIndex)(0)(classIndex) = binData(shift + 
classIndex)
+  // right node aggregate for the highest split
+  rightNodeAgg(featureIndex)(numBins - 2)(classIndex)
+= binData(shift + (numClasses * (numBins - 1)) + classIndex)
+  classIndex += 1
+}
+
+// Iterate over all splits.
+var splitIndex = 1
+while (splitIndex  numBins - 1) {
+  // calculating left node aggregate for a split as a sum of left 
node aggregate of a
+  // lower split and the left bin aggregate of a bin where the 
split is a high split
+  var innerClassIndex = 0
+  while (innerClassIndex  numClasses) {
+leftNodeAgg(featureIndex)(splitIndex)(innerClassIndex)
+  = binData(shift + numClasses * splitIndex + innerClassIndex) 
+
+leftNodeAgg(featureIndex)(splitIndex - 1)(innerClassIndex)
+rightNodeAgg(featureIndex)(numBins - 2 - 
splitIndex)(innerClassIndex) =
+  binData(shift + (numClasses * (numBins - 1 - splitIndex) + 
innerClassIndex)) +
+rightNodeAgg(featureIndex)(numBins - 1 - 
splitIndex)(innerClassIndex)
+innerClassIndex += 1
+  }
+  splitIndex += 1
+}
+  }
+
+  def findAggForUnorderedFeatureClassification(
+  leftNodeAgg: Array[Array[Array[Double]]],
+  rightNodeAgg: Array[Array[Array[Double]]],
+  featureIndex: Int) {
+
+val rightChildShift = numClasses * numBins * numFeatures
+var splitIndex = 0
+while (splitIndex  numBins - 1) {
+  var classIndex = 0
+  while (classIndex  numClasses) {
+// shift for this featureIndex
+val shift = numClasses * featureIndex * numBins + splitIndex * 
numClasses
+val leftBinValue = binData(shift + classIndex)
+val rightBinValue = binData(rightChildShift + shift + 
classIndex)
+leftNodeAgg(featureIndex)(splitIndex)(classIndex) = 
leftBinValue
+rightNodeAgg(featureIndex)(splitIndex)(classIndex) = 
rightBinValue
+classIndex += 1
+  }
+  splitIndex += 1
+}
+  }
+
+  def findAggForRegression(
+  leftNodeAgg: Array[Array[Array[Double]]],
+  rightNodeAgg: Array[Array[Array[Double]]],
+  featureIndex: Int) {
+
+// shift for this featureIndex
+val shift = 3 * featureIndex * numBins
+// left node aggregate for the lowest split
+leftNodeAgg(featureIndex)(0)(0) = binData(shift + 0)
+leftNodeAgg(featureIndex)(0)(1) = binData(shift + 1)
+leftNodeAgg(featureIndex)(0)(2) = binData(shift + 2)
+
+// right node aggregate for the highest split
+rightNodeAgg(featureIndex)(numBins - 2)(0) =
+  binData(shift + (3 * (numBins - 1)))
+rightNodeAgg(featureIndex)(numBins - 2)(1) =
+  binData(shift + (3 * (numBins - 1)) + 1)
+rightNodeAgg(featureIndex)(numBins - 2)(2) =
+  binData(shift + (3 * (numBins - 1)) + 2)
+
+// Iterate over all splits.
+var 

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-13 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48867199
  
Thanks Evan. I have compared to scikit-learn on the covertype dataset and 
the results looked similar. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-11 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r14836561
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -768,104 +973,157 @@ object DecisionTree extends Serializable with 
Logging {
 /**
  * Extracts left and right split aggregates.
  * @param binData Array[Double] of size 2*numFeatures*numSplits
- * @return (leftNodeAgg, rightNodeAgg) tuple of type (Array[Double],
- * Array[Double]) where each array is of 
size(numFeature,2*(numSplits-1))
+ * @return (leftNodeAgg, rightNodeAgg) tuple of type 
(Array[Array[Array[Double\]\]\],
+ * Array[Array[Array[Double\]\]\]) where each array is of 
size(numFeature,
+ * (numBins - 1), numClasses)
  */
 def extractLeftRightNodeAggregates(
-binData: Array[Double]): (Array[Array[Double]], 
Array[Array[Double]]) = {
+binData: Array[Double]): (Array[Array[Array[Double]]], 
Array[Array[Array[Double]]]) = {
+
+
+  def findAggForOrderedFeatureClassification(
+  leftNodeAgg: Array[Array[Array[Double]]],
+  rightNodeAgg: Array[Array[Array[Double]]],
+  featureIndex: Int) {
+
+// shift for this featureIndex
+val shift = numClasses * featureIndex * numBins
+
+var classIndex = 0
+while (classIndex  numClasses) {
+  // left node aggregate for the lowest split
+  leftNodeAgg(featureIndex)(0)(classIndex) = binData(shift + 
classIndex)
+  // right node aggregate for the highest split
+  rightNodeAgg(featureIndex)(numBins - 2)(classIndex)
+= binData(shift + (numClasses * (numBins - 1)) + classIndex)
+  classIndex += 1
+}
+
+// Iterate over all splits.
+var splitIndex = 1
+while (splitIndex  numBins - 1) {
+  // calculating left node aggregate for a split as a sum of left 
node aggregate of a
+  // lower split and the left bin aggregate of a bin where the 
split is a high split
+  var innerClassIndex = 0
+  while (innerClassIndex  numClasses) {
+leftNodeAgg(featureIndex)(splitIndex)(innerClassIndex)
+  = binData(shift + numClasses * splitIndex + innerClassIndex) 
+
+leftNodeAgg(featureIndex)(splitIndex - 1)(innerClassIndex)
+rightNodeAgg(featureIndex)(numBins - 2 - 
splitIndex)(innerClassIndex) =
+  binData(shift + (numClasses * (numBins - 1 - splitIndex) + 
innerClassIndex)) +
+rightNodeAgg(featureIndex)(numBins - 1 - 
splitIndex)(innerClassIndex)
+innerClassIndex += 1
+  }
+  splitIndex += 1
+}
+  }
+
+  def findAggForUnorderedFeatureClassification(
+  leftNodeAgg: Array[Array[Array[Double]]],
+  rightNodeAgg: Array[Array[Array[Double]]],
+  featureIndex: Int) {
+
+val rightChildShift = numClasses * numBins * numFeatures
+var splitIndex = 0
+while (splitIndex  numBins - 1) {
+  var classIndex = 0
+  while (classIndex  numClasses) {
+// shift for this featureIndex
+val shift = numClasses * featureIndex * numBins + splitIndex * 
numClasses
+val leftBinValue = binData(shift + classIndex)
+val rightBinValue = binData(rightChildShift + shift + 
classIndex)
+leftNodeAgg(featureIndex)(splitIndex)(classIndex) = 
leftBinValue
+rightNodeAgg(featureIndex)(splitIndex)(classIndex) = 
rightBinValue
+classIndex += 1
+  }
+  splitIndex += 1
+}
+  }
+
+  def findAggForRegression(
+  leftNodeAgg: Array[Array[Array[Double]]],
+  rightNodeAgg: Array[Array[Array[Double]]],
+  featureIndex: Int) {
+
+// shift for this featureIndex
+val shift = 3 * featureIndex * numBins
+// left node aggregate for the lowest split
+leftNodeAgg(featureIndex)(0)(0) = binData(shift + 0)
+leftNodeAgg(featureIndex)(0)(1) = binData(shift + 1)
+leftNodeAgg(featureIndex)(0)(2) = binData(shift + 2)
+
+// right node aggregate for the highest split
+rightNodeAgg(featureIndex)(numBins - 2)(0) =
+  binData(shift + (3 * (numBins - 1)))
+rightNodeAgg(featureIndex)(numBins - 2)(1) =
+  binData(shift + (3 * (numBins - 1)) + 1)
+rightNodeAgg(featureIndex)(numBins - 2)(2) =
+  binData(shift + (3 * (numBins - 1)) + 2)
+
+// Iterate over all splits.
+var splitIndex 

[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-11 Thread etrain
Github user etrain commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48767401
  
I've gone through this in some depth, and aside from a couple of minor 
style nits - the logic looks good to me. Manish - have you compared output vs. 
scikit-learn for multiclass datasets and verified that things look at least 
reasonably similar?

Really awesome work!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48674298
  
QA tests have started for PR 886. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16530/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48674369
  
QA results for PR 886:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class WeightedLabeledPoint(label: Double, features: 
Vector, weight:Double = 1) {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16530/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48674374
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16530/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48675447
  
QA tests have started for PR 886. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16531/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48675496
  
QA results for PR 886:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class WeightedLabeledPoint(label: Double, features: 
Vector, weight:Double = 1) {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16531/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48675499
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16531/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48683128
  
QA results for PR 886:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16538/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48412354
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48412365
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48412478
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48412480
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16428/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48413445
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48413437
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48413580
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16430/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48413579
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48415119
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48415107
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48415267
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16432/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48415266
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-07 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48143660
  
@etrain Added implicit conversion. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48143754
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48143761
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48143827
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16362/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-48143826
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r13982468
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -49,6 +49,7 @@ object DecisionTreeRunner {
   case class Params(
   input: String = null,
   algo: Algo = Classification,
+  numClassesForClassification: Int = 2,
--- End diff --

Inference from a large dataset could take a lot of time. In general, most 
practitioners know in advance. If not, we can add a pre-processing step.

Currently we have only ```numClassesForClassification``` as a 
classification specific parameter. In general, I agree with you. At the same 
time, didn't want to create more configuration classes for the user. Shall we 
leave it as is for now and handle it with the ensembles PR where we have more 
parameters (boosting iterations, num trees, feature subsetting, etc.) ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r13982568
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -45,7 +46,7 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
* @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
* @return a DecisionTreeModel that can be used for prediction
*/
-  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+  def train(input: RDD[WeightedLabeledPoint]): DecisionTreeModel = {
--- End diff --

Agree. I started with implicit conversions and forget the reason why I 
switched. I will give it a try again. If it works, great. If not, I will 
remember why it doesn't work well. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r13982597
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -233,13 +234,73 @@ object DecisionTree extends Serializable with Logging 
{
   algo: Algo,
   impurity: Impurity,
   maxDepth: Int): DecisionTreeModel = {
-val strategy = new Strategy(algo,impurity,maxDepth)
-new DecisionTree(strategy).train(input: RDD[LabeledPoint])
+val strategy = new Strategy(algo, impurity, maxDepth)
+// Converting from standard instance format to weighted input format 
for tree training
+val weightedInput = input.map(x = WeightedLabeledPoint(x.label, 
x.features))
+new DecisionTree(strategy).train(weightedInput: 
RDD[WeightedLabeledPoint])
--- End diff --

Thanks. Will remove.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-19 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-46593827
  
Thanks @etrain 

1. I will try to use implicits
2. I agree. We originally had separate trees and then merged them for 
readability. There is a sweet spot in between that we need to find. Agree, it's 
a major refactoring. I think it will be best to do it in or after ensemble PR 
where we will know most of the cases we need to handle. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-19 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r13982852
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -49,6 +49,7 @@ object DecisionTreeRunner {
   case class Params(
   input: String = null,
   algo: Algo = Classification,
+  numClassesForClassification: Int = 2,
--- End diff --

Yeah, makes sense. If it doesn't complicate things too much we might
consider adding an interface that doesn't have this specified and figures
it out in one shot.

Worth noting is that in R, an object of type factor (the default for
categorical/label data) has this information built in. It can be a big pain
at load time while the system tries to figure out the cardinality of the
factor, but it leads to a nice compact representation of the data and
eliminates situations like this one.

I agree on doing the API separation with the ensembles PR.


On Thu, Jun 19, 2014 at 10:46 AM, manishamde notificati...@github.com
wrote:

 In
 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala:

  @@ -49,6 +49,7 @@ object DecisionTreeRunner {
 case class Params(
 input: String = null,
 algo: Algo = Classification,
  +  numClassesForClassification: Int = 2,

 Inference from a large dataset could take a lot of time. In general, most
 practitioners know in advance. If not, we can add a pre-processing step.

 Currently we have only numClassesForClassification as a classification
 specific parameter. In general, I agree with you. At the same time, didn't
 want to create more configuration classes for the user. Shall we leave it
 as is for now and handle it with the ensembles PR where we have more
 parameters (boosting iterations, num trees, feature subsetting, etc.) ?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/886/files#r13982468.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r13983131
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -49,6 +49,7 @@ object DecisionTreeRunner {
   case class Params(
   input: String = null,
   algo: Algo = Classification,
+  numClassesForClassification: Int = 2,
--- End diff --

Good point. Let me create a JIRA ticket for this so that we don't forget. 
:-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r13996228
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -49,6 +49,7 @@ object DecisionTreeRunner {
   case class Params(
   input: String = null,
   algo: Algo = Classification,
+  numClassesForClassification: Int = 2,
--- End diff --

https://issues.apache.org/jira/browse/SPARK-2206


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-18 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r13926351
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -49,6 +49,7 @@ object DecisionTreeRunner {
   case class Params(
   input: String = null,
   algo: Algo = Classification,
+  numClassesForClassification: Int = 2,
--- End diff --

Do we want this to be a parameter and not inferred from the data?

Also - I'm wondering if it makes sense to subclass params with 
DecisionTreeParams vs. RegressionTreeParams so that we keep logically separate 
options separate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-18 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r13926460
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -45,7 +46,7 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
* @param input RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
* @return a DecisionTreeModel that can be used for prediction
*/
-  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
+  def train(input: RDD[WeightedLabeledPoint]): DecisionTreeModel = {
--- End diff --

If we're going to change the interface, it might be nice to have an 
implicit conversion between LabeledPoint and WeightedLabeledPoint (which 
assigns weight 1 to everything). I think the common case is going to be using 
unweighted anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-18 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r13926555
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -212,7 +211,9 @@ object DecisionTree extends Serializable with Logging {
* @return a DecisionTreeModel that can be used for prediction
   */
   def train(input: RDD[LabeledPoint], strategy: Strategy): 
DecisionTreeModel = {
-new DecisionTree(strategy).train(input: RDD[LabeledPoint])
+// Converting from standard instance format to weighted input format 
for tree training
--- End diff --

Maybe this is better served with an implicit since I think we'll want to 
re-use labeled point elsewhere and having an automatic conversion might be nice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-18 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/886#discussion_r13926606
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -233,13 +234,73 @@ object DecisionTree extends Serializable with Logging 
{
   algo: Algo,
   impurity: Impurity,
   maxDepth: Int): DecisionTreeModel = {
-val strategy = new Strategy(algo,impurity,maxDepth)
-new DecisionTree(strategy).train(input: RDD[LabeledPoint])
+val strategy = new Strategy(algo, impurity, maxDepth)
+// Converting from standard instance format to weighted input format 
for tree training
+val weightedInput = input.map(x = WeightedLabeledPoint(x.label, 
x.features))
+new DecisionTree(strategy).train(weightedInput: 
RDD[WeightedLabeledPoint])
--- End diff --

Not sure why you need to be explicit about the types here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-18 Thread etrain
Github user etrain commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-46465872
  
I've taken a first pass at this and at a high level it looks good.

The main two things I'd say are
1) I think an implicit that converts LabeledPoint to WeightedLabeledPoint 
could go a long way at removing some of the boilerplate introduced by this PR.
2) I'm getting a little concerned that we could modularize a little better 
- for example, every time we do a strategy.algo match - it feels like we 
could just as easily have a separate class for Regression algo, Decision algo, 
etc. For example, each separate algo could implement its own binSeqOp and a 
few other methods and the base class could tie these all together. This would 
be a fairly major refactoring and is maybe better suited for a later PR.

I still need to look closely at the principal logical changes in 
DecisionTree.scala - and will try to get to this before the end of the week. 
Thanks for your patience!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-12 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-45959540
  
Friendly nudge: could somebody please take a look at this PR. It is 
blocking upcoming ensemble tree PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-45127038
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-45127063
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-45127229
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-45139033
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-45139221
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-45139223
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15453/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-06-04 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-45141375
  
I added support for sorting categorical feature values using impurity 
(gini/entropy) calculated over the corresponding labels in multiclass 
classification. This heuristic will only be used when it's not feasible to 
check for all the categorical splits in multiclass classification.

cc: @srowen, @etrain


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-28 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44381347
  
I don't have a reference, and I did look for one. I am sure it is not 
optimal, and not even that great as a greedy algorithm. Two low-entropy 
distributions over target values could be high-entropy when combined. 

You could pick one feature value which makes the target lowest entropy, 
then pick the next one that would make the combined entropy of the target 
lowest, and so on. That amounts to testing n^2 instead of 2^n decisions.

If the alternative is to fail, or spend years in computation, I think 
heuristics of some kind are a must. Even random selection of subsets is better 
than rejecting the problem entirely -- anything is better than that I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-28 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44388751
  
I fully agree. 

I will give others a day or two to raise any concerns if they have any and 
then proceed to implement the two-step solution for multiclass classification 
that I mentioned above. The second step will be the O(k) algorithm (k is the 
number of categorical feature values) that will come up with k sorted 
categorical feature splits using the target variable entropy for ordering.

The O(n^2) algorithm looked promising at first but I think it might end up 
dominating the tree computation time.

In general, getting 0(k) splits is more important than ensuring that they 
are sorted since we now have a way of dealing with unsorted splits with this 
PR. I currently don't have a good intuition on what makes a good subset of 
splits but we could keep adding more heuristics later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-27 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44243946
  
@manishamde Yes for categorical features with high cardinality, you don't 
want to consider all possible splits. I don't think having a cardinality of 30 
or 40 is that unusual though. Honestly I've always resented the fact that R 
simply can't handle more than 32!

There are heuristics however that work well while efficiently considering a 
number of splits linear in the number of values. For regression, it's 
apparently optimal to sort the categorical values by average value of the 
target variable, and then consider just prefixes of that list of values as the 
subsets to try. Google's PLANET paper claims that is optimal.

For classification, where the target itself is categorical, I don't know of 
a provably optimal way to do it. The heuristic I have used is to sort the 
categorical values by the entropy of the target value. This seems pretty OK.

There is some Java code for creating the decision rules to evaluate here, 
in `CategoricalDecision.java` and `NumericDecision.java`:

https://github.com/cloudera/oryx/tree/master/rdf-common/src/main/java/com/cloudera/oryx/rdf/common/rule

It's pretty easy to lift them and Scala-fy it. I'd really like to see 
functionality like this so MLlib RDF can be comparable and I can move to it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-27 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44359841
  
@srowen It's good to know about the use-case for cardinality in the order 
of tens.

The categorical feature ordering using the average value of the target 
variable works well for both binary classification and regression (section 
9.2.4 of Elements of Statistical Learning) and it's already implemented in 
MLlib decision tree. 

This PR handles the scenario where the 'ordering' assumption does not hold 
true for the multiclass classification. I like the suggestion of using entropy 
to sort the categories -- it will be great if we could also find a theoretical 
reference for it!

Here is what I propose for handling categorical features in multiclass 
classification:
1. We check for all splits of the categorical variable if the bin 
constraints are met.
2. If the bin constraints are not met, we can use a sorting heuristic (like 
entropy of the target variable)

I think this might be the best tradeoff both from the theoretical and 
practical perspective and it will save the user a lot of data munging effort 
which is one of the main advantages of decision trees.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-27 Thread etrain
Github user etrain commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44360446
  
I am worried that exponential growth in the number of split possibilities
kills us when we check for all splits when we get to even 20-30
categorical values. That's potentially a billion possible candidates to
check. I have a feeling that heuristics will be more practical (but i don't
have a reference!). We might add an option for checking for all vs.
using an entropy based heuristic and automatically decide which to use at
some conservative threshold that is user-configurable.


On Tue, May 27, 2014 at 7:41 PM, manishamde notificati...@github.comwrote:

 @srowen https://github.com/srowen It's good to know about the use-case
 for cardinality in the order of tens.

 The categorical feature ordering using the average value of the target
 variable works well for both binary classification and regression (section
 9.2.4 of Elements of Statistical Learning) and it's already implemented in
 MLlib decision tree.

 This PR handles the scenario where the 'ordering' assumption does not hold
 true for the multiclass classification. I like the suggestion of using
 entropy to sort the categories -- it will be great if we could also find a
 theoretical reference for it!

 Here is what I propose for handling categorical features in multiclass
 classification:
 1. We check for all splits of the categorical variable if the bin
 constraints are met.
 2. If the bin constraints are not met, we can use a sorting heuristic
 (like entropy of the target variable)

 I think this might be the best tradeoff both from the theoretical and
 practical perspective and it will save the user a lot of data munging
 effort which is one of the main advantages of decision trees.

 —
 Reply to this email directly or view it on 
GitHubhttps://github.com/apache/spark/pull/886#issuecomment-44359841
 .



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-26 Thread manishamde
GitHub user manishamde opened a pull request:

https://github.com/apache/spark/pull/886

SPARK-1536: multiclass classification support for decision tree

The ability to perform multiclass classification is a big advantage for 
using decision trees and was a highly requested feature for mllib. This pull 
request adds multiclass classification support to the MLlib decision tree. It 
also adds sample weights support using WeightedLabeledPoint class for handling 
unbalanced datasets during classification. It will also support algorithms such 
as AdaBoost which requires instances to be weighted.

It handles the special case where the categorical variables cannot be 
ordered for multiclass classification and thus the optimizations used for 
speeding up binary classification cannot be directly used for multiclass 
classification with categorical variables. More specifically, for m categories 
in a categorical feature, it analyses all the 2^(m-1) - 1 categorical splits 
provided that #splits are less than the maxBins provided in the input. This 
condition will not be met for features with large number of categories -- using 
decision trees is not recommended for such datasets in general since the 
categorical features are favored over continuous features. Moreover, the user 
can use a combination of tricks (increasing bin size of the tree algorithms, 
use binary encoding for categorical features or use one-vs-all classification 
strategy) to avoid these constraints.

The new code is accompanied by unit tests and has also been tested on the 
iris and covtype datasets.

cc: @mengxr, @etrain, @hirakendu, @atalwalkar, @srowen

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/manishamde/spark multiclass

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/886.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #886


commit 50b143a4385f209fbc1793f3e03134cab3ab9583
Author: Manish Amde manish...@gmail.com
Date:   2014-04-20T20:33:03Z

adding support for very deep trees

commit abc5a23bf80d792a345d723b44bff3ee217cd5ac
Author: Evan Sparks spa...@cs.berkeley.edu
Date:   2014-04-22T01:41:36Z

Parameterizing max memory.

commit 2f6072c12a1466d783da258d4aa1bde789e7e875
Author: manishamde manish...@gmail.com
Date:   2014-04-22T03:43:47Z

Merge pull request #5 from etrain/deep_tree

Parameterizing max memory.

commit 2f1e093c5187a1ed532f9c19b25f8a2a6a46e27a
Author: Manish Amde manish...@gmail.com
Date:   2014-04-22T03:49:46Z

minor: added doc for maxMemory parameter

commit 02877721328a560f210a7906061108ce5dd4bbbe
Author: Evan Sparks spa...@cs.berkeley.edu
Date:   2014-04-22T18:13:27Z

Fixing scalastyle issue.

commit fecf89a51d6efc9e2ff06e700338ea944a4dd580
Author: manishamde manish...@gmail.com
Date:   2014-04-22T18:15:57Z

Merge pull request #6 from etrain/deep_tree

Fixing scalastyle issue.

commit 719d0098bb08b50e523cec3e388115d5a206512b
Author: Manish Amde manish...@gmail.com
Date:   2014-04-24T00:04:05Z

updating user documentation

commit 9dbdabeeacc5fe5e0f1a729ce1ed8ab6ff399000
Author: Manish Amde manish...@gmail.com
Date:   2014-04-29T21:43:19Z

merge from master

commit 15171550fe83e42fcb707744c9035ed540fb78d1
Author: Manish Amde manish...@gmail.com
Date:   2014-04-29T21:45:34Z

updated documentation

commit 718506b2a0146a5794261a553847d363b7dfb932
Author: Manish Amde manish...@gmail.com
Date:   2014-04-30T23:29:24Z

added unit test

commit e0426ee74d5e233c1e7b14e29135015d09a0370c
Author: Manish Amde manish...@gmail.com
Date:   2014-05-01T00:36:47Z

renamed parameter

commit dad96523d740c2b7ced0f0d73ade66e528b64064
Author: Manish Amde manish...@gmail.com
Date:   2014-05-01T04:59:55Z

removed unused imports

commit cbd9f140fd8d43941c61acd6055636bad88b358d
Author: Manish Amde manish...@gmail.com
Date:   2014-05-03T16:16:42Z

modified scala.math to math

commit 5e822020ce50c6e1bdbdbb3d94d5cbc4c715731e
Author: Manish Amde manish...@gmail.com
Date:   2014-05-06T06:34:58Z

added documentation, fixed off by 1 error in max level calculation

commit 4731cda7b08fdcd365dd1b690ac04a26f6e85657
Author: Manish Amde manish...@gmail.com
Date:   2014-05-06T06:44:39Z

formatting

commit 5eca9e4fbd0e27e335d5cea0bf26b1a436be0457
Author: Manish Amde manish...@gmail.com
Date:   2014-05-06T06:47:14Z

grammar

commit 8053fed22249bc788ba988489caa22f732b6416d
Author: Manish Amde manish...@gmail.com
Date:   2014-05-06T06:48:02Z

more formatting

commit 426bb285f16c816b19e5c25518024ae4d2141c1a
Author: Manish Amde manish...@gmail.com
Date:   2014-05-06T07:16:02Z

programming guide blurb

commit b27ad2c20edb8a7bf0c0edd5d82a6a683b5d9ea2
Author: Manish Amde manish...@gmail.com
Date:   2014-05-06T07:19:10Z


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44228078
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44228087
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44228145
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44228146
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15215/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44228654
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44228663
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44228716
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15216/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1536: multiclass classification support ...

2014-05-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-44228715
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---