[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

sethah Wed, 21 Feb 2018 17:47:49 -0800

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20632#discussion_r169833178
  
    --- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -303,26 +303,6 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
         assert(split.threshold < 2020)
       }
     
    -  test("Multiclass classification stump with 10-ary (ordered) categorical 
features") {
    --- End diff --
    
    Regarding this test - it fails now for a silly reason. Because of the data, 
the tree built winds up with a right node with equal labels of 1.0 and 2.0. It 
breaks the tie by prediction 1.0, which left node also predicts. You can modify 
the data generating method to:
    
    ```scala
      def generateCategoricalDataPointsForMulticlassForOrderedFeatures():
        Array[LabeledPoint] = {
        val arr = new Array[LabeledPoint](3000)
        for (i <- 0 until 3000) {
          if (i < 1001) {
            arr(i) = new LabeledPoint(2.0, Vectors.dense(2.0, 2.0))
          } else if (i < 2000) {
            arr(i) = new LabeledPoint(1.0, Vectors.dense(1.0, 2.0))
          } else {
            arr(i) = new LabeledPoint(1.0, Vectors.dense(2.0, 2.0))
          }
        }
        arr
      }
    ```
    so that 2.0 will be predicted. I slightly prefer this, assuming all other 
tests pass (I checked some of the suites). The less stuff we can move around 
that is mostly unrelated to this change, the better.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

Reply via email to