[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20362 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r170074167 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -586,6 +586,68 @@ class ALSSuite allModelParamSettings, checkModelData) } + private def checkNumericTypesALS( --- End diff -- It's changed because one of the test hasn't done anything. Please take a look at the last commit it contains a test bugfix not only the transformation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r170072578 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -413,34 +411,36 @@ class ALSSuite .setSeed(0) val alpha = als.getAlpha val model = als.fit(training.toDF()) -val predictions = model.transform(test.toDF()).select("rating", "prediction").rdd.map { - case Row(rating: Float, prediction: Float) => -(rating.toDouble, prediction.toDouble) +testTransformerByGlobalCheckFunc[Rating[Int]](test.toDF(), model, "rating", "prediction") { +case rows: Seq[Row] => + val predictions = rows.map(row => (row.getFloat(0).toDouble, row.getFloat(1).toDouble)) + + val rmse = +if (implicitPrefs) { + // TODO: Use a better (rank-based?) evaluation metric for implicit feedback. + // We limit the ratings and the predictions to interval [0, 1] and compute the + // weighted RMSE with the confidence scores as weights. + val (totalWeight, weightedSumSq) = predictions.map { case (rating, prediction) => +val confidence = 1.0 + alpha * math.abs(rating) +val rating01 = math.max(math.min(rating, 1.0), 0.0) +val prediction01 = math.max(math.min(prediction, 1.0), 0.0) +val err = prediction01 - rating01 +(confidence, confidence * err * err) + }.reduce[(Double, Double)] { case ((c0, e0), (c1, e1)) => +(c0 + c1, e0 + e1) + } + math.sqrt(weightedSumSq / totalWeight) +} else { + val errorSquares = predictions.map { case (rating, prediction) => +val err = rating - prediction +err * err + } + val mse = errorSquares.sum / errorSquares.length + math.sqrt(mse) +} + logInfo(s"Test RMSE is $rmse.") + assert(rmse < targetRMSE) } -val rmse = --- End diff -- Mainly move but there was no mean function so implemented. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r170072392 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -586,6 +586,68 @@ class ALSSuite allModelParamSettings, checkModelData) } + private def checkNumericTypesALS( --- End diff -- Mainly move but there was no mean function so implemented this oneliner. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r170046788 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -586,6 +586,68 @@ class ALSSuite allModelParamSettings, checkModelData) } + private def checkNumericTypesALS( --- End diff -- Is this mostly just moving the code from the test class, or did it change too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r170047180 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -693,7 +766,7 @@ class ALSSuite val data = ratings.toDF val model = new ALS().fit(data) Seq("nan", "NaN", "Nan", "drop", "DROP", "Drop").foreach { s => - model.setColdStartStrategy(s).transform(data) + testTransformer[Rating[Int]](data, model.setColdStartStrategy(s), "prediction") { _ => } --- End diff -- What's the no-op function at the end for -- just because it requires an argument? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r170046857 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -413,34 +411,36 @@ class ALSSuite .setSeed(0) val alpha = als.getAlpha val model = als.fit(training.toDF()) -val predictions = model.transform(test.toDF()).select("rating", "prediction").rdd.map { - case Row(rating: Float, prediction: Float) => -(rating.toDouble, prediction.toDouble) +testTransformerByGlobalCheckFunc[Rating[Int]](test.toDF(), model, "rating", "prediction") { +case rows: Seq[Row] => + val predictions = rows.map(row => (row.getFloat(0).toDouble, row.getFloat(1).toDouble)) + + val rmse = +if (implicitPrefs) { + // TODO: Use a better (rank-based?) evaluation metric for implicit feedback. + // We limit the ratings and the predictions to interval [0, 1] and compute the + // weighted RMSE with the confidence scores as weights. + val (totalWeight, weightedSumSq) = predictions.map { case (rating, prediction) => +val confidence = 1.0 + alpha * math.abs(rating) +val rating01 = math.max(math.min(rating, 1.0), 0.0) +val prediction01 = math.max(math.min(prediction, 1.0), 0.0) +val err = prediction01 - rating01 +(confidence, confidence * err * err) + }.reduce[(Double, Double)] { case ((c0, e0), (c1, e1)) => +(c0 + c1, e0 + e1) + } + math.sqrt(weightedSumSq / totalWeight) +} else { + val errorSquares = predictions.map { case (rating, prediction) => +val err = rating - prediction +err * err + } + val mse = errorSquares.sum / errorSquares.length + math.sqrt(mse) +} + logInfo(s"Test RMSE is $rmse.") + assert(rmse < targetRMSE) } -val rmse = --- End diff -- This change is just a move, really? or did something else change as well? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r167005865 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -599,8 +598,11 @@ class ALSSuite (ex, act) => ex.userFactors.first().getSeq[Float](1) === act.userFactors.first.getSeq[Float](1) } { (ex, act, _) => - ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~== -act.transform(_: DataFrame).select("prediction").first.getDouble(0) absTol 1e-6 + testTransformerByGlobalCheckFunc[Float](_: DataFrame, act, "prediction") { +case actRows: Seq[Row] => + ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~== +actRows(0).getDouble(0) absTol 1e-6 + } --- End diff -- Woah, didn't check the original functionality in such a depth. This is really dead code in runtime environment. Fixed ð --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165745445 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -599,8 +598,11 @@ class ALSSuite (ex, act) => ex.userFactors.first().getSeq[Float](1) === act.userFactors.first.getSeq[Float](1) } { (ex, act, _) => - ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~== -act.transform(_: DataFrame).select("prediction").first.getDouble(0) absTol 1e-6 + testTransformerByGlobalCheckFunc[Float](_: DataFrame, act, "prediction") { +case actRows: Seq[Row] => + ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~== +actRows(0).getDouble(0) absTol 1e-6 + } --- End diff -- I think this code does not check anything. `testTransformerByGlobalCheckFunc[Float](_: DataFrame, act, "prediction")` is just a partial application of `testTransformerByGlobalCheckFunc`. However, `checkNumericTypesALS` expects `check2: (ALSModel, ALSModel, DataFrame) => Unit`. It's happy to call the provided function, discard the partially applied function and use `()` instead, so it will typecheck. As a consequence, the function doing the assert is never called, so the `~===` assertion never happens. You can check it say by asking for the 100th column of the first row - it will not produce an error. This problem is not a result of your change, the original code had the same issue. It could probably be simplified a bit but I think the original intent was to do a check like this: ``` { (ex, act, df) => ex.transform(df).selectExpr("cast(prediction as double)").first.getDouble(0) ~== act.transform(df).selectExpr("cast(prediction as double)").first.getDouble(0) absTol 1e-6 } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165396480 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -662,28 +676,32 @@ class ALSSuite val knownItem = data.select(max("item")).as[Int].first() val unknownItem = knownItem + 20 val test = Seq( - (unknownUser, unknownItem), - (knownUser, unknownItem), - (unknownUser, knownItem), - (knownUser, knownItem) -).toDF("user", "item") + (unknownUser, unknownItem, true), + (knownUser, unknownItem, true), + (unknownUser, knownItem, true), + (knownUser, knownItem, false) +).toDF("user", "item", "expectedIsNaN") val als = new ALS().setMaxIter(1).setRank(1) // default is 'nan' val defaultModel = als.fit(data) -val defaultPredictions = defaultModel.transform(test).select("prediction").as[Float].collect() -assert(defaultPredictions.length == 4) -assert(defaultPredictions.slice(0, 3).forall(_.isNaN)) -assert(!defaultPredictions.last.isNaN) +var defaultPredictionNotNaN = Float.NaN --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165392031 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -693,7 +711,9 @@ class ALSSuite val data = ratings.toDF val model = new ALS().fit(data) Seq("nan", "NaN", "Nan", "drop", "DROP", "Drop").foreach { s => - model.setColdStartStrategy(s).transform(data) + testTransformer[Rating[Int]](data, model.setColdStartStrategy(s), "prediction") { +case _ => --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165382596 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -653,6 +666,7 @@ class ALSSuite test("ALS cold start user/item prediction strategy") { val spark = this.spark import spark.implicits._ + --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165382316 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -628,18 +635,24 @@ class ALSSuite } withClue("transform should fail when ids exceed integer range. ") { val model = als.fit(df) - assert(intercept[SparkException] { -model.transform(df.select(df("user_big").as("user"), df("item"))).first - }.getMessage.contains(msg)) - assert(intercept[SparkException] { -model.transform(df.select(df("user_small").as("user"), df("item"))).first - }.getMessage.contains(msg)) - assert(intercept[SparkException] { -model.transform(df.select(df("item_big").as("item"), df("user"))).first - }.getMessage.contains(msg)) - assert(intercept[SparkException] { -model.transform(df.select(df("item_small").as("item"), df("user"))).first - }.getMessage.contains(msg)) + def testTransformIdExceedsIntRange[A : Encoder](dataFrame: DataFrame): Unit = { +assert(intercept[SparkException] { + model.transform(dataFrame).first +}.getMessage.contains(msg)) +assert(intercept[StreamingQueryException] { + testTransformer[A](dataFrame, model, "prediction") { +case _ => --- End diff -- Partial function removed. This code part expects `StreamingQueryException` which is quite close to this area. Not sure whether a comment would make it better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165379876 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -599,8 +599,15 @@ class ALSSuite (ex, act) => ex.userFactors.first().getSeq[Float](1) === act.userFactors.first.getSeq[Float](1) } { (ex, act, _) => - ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~== -act.transform(_: DataFrame).select("prediction").first.getDouble(0) absTol 1e-6 + testTransformerByGlobalCheckFunc[Float](_: DataFrame, ex, "prediction") { +case exRows: Seq[Row] => --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165378573 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -566,6 +565,7 @@ class ALSSuite test("read/write") { val spark = this.spark import spark.implicits._ + --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165304423 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -566,6 +565,7 @@ class ALSSuite test("read/write") { val spark = this.spark import spark.implicits._ + --- End diff -- nit: new line is not needed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165308129 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -628,18 +635,24 @@ class ALSSuite } withClue("transform should fail when ids exceed integer range. ") { val model = als.fit(df) - assert(intercept[SparkException] { -model.transform(df.select(df("user_big").as("user"), df("item"))).first - }.getMessage.contains(msg)) - assert(intercept[SparkException] { -model.transform(df.select(df("user_small").as("user"), df("item"))).first - }.getMessage.contains(msg)) - assert(intercept[SparkException] { -model.transform(df.select(df("item_big").as("item"), df("user"))).first - }.getMessage.contains(msg)) - assert(intercept[SparkException] { -model.transform(df.select(df("item_small").as("item"), df("user"))).first - }.getMessage.contains(msg)) + def testTransformIdExceedsIntRange[A : Encoder](dataFrame: DataFrame): Unit = { +assert(intercept[SparkException] { + model.transform(dataFrame).first +}.getMessage.contains(msg)) +assert(intercept[StreamingQueryException] { + testTransformer[A](dataFrame, model, "prediction") { +case _ => --- End diff -- No need for a partial function here, you can simplify it to `{ _ => }`. I would also add a small comment to make it explicit that we intentionally do not check anything. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165305989 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -599,8 +599,15 @@ class ALSSuite (ex, act) => ex.userFactors.first().getSeq[Float](1) === act.userFactors.first.getSeq[Float](1) } { (ex, act, _) => - ex.transform(_: DataFrame).select("prediction").first.getDouble(0) ~== -act.transform(_: DataFrame).select("prediction").first.getDouble(0) absTol 1e-6 + testTransformerByGlobalCheckFunc[Float](_: DataFrame, ex, "prediction") { +case exRows: Seq[Row] => --- End diff -- I think it's ok to keep ex.transform here. This way the code will be a bit simpler. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165312057 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -662,28 +676,32 @@ class ALSSuite val knownItem = data.select(max("item")).as[Int].first() val unknownItem = knownItem + 20 val test = Seq( - (unknownUser, unknownItem), - (knownUser, unknownItem), - (unknownUser, knownItem), - (knownUser, knownItem) -).toDF("user", "item") + (unknownUser, unknownItem, true), + (knownUser, unknownItem, true), + (unknownUser, knownItem, true), + (knownUser, knownItem, false) +).toDF("user", "item", "expectedIsNaN") val als = new ALS().setMaxIter(1).setRank(1) // default is 'nan' val defaultModel = als.fit(data) -val defaultPredictions = defaultModel.transform(test).select("prediction").as[Float].collect() -assert(defaultPredictions.length == 4) -assert(defaultPredictions.slice(0, 3).forall(_.isNaN)) -assert(!defaultPredictions.last.isNaN) +var defaultPredictionNotNaN = Float.NaN --- End diff -- I would get rid of this variable. In `testTransformer` it just adds overhead, `assert(!defaultPredictionNotNaN.isNaN)` asserts something that was already checked in testTransformer, so it's only use is in `testTransformerByGlobalCheckFunc`. Producing it is a bit convoluted, it's not easy to understand why it's needed. I would make it clearer by doing a plain old transform using the `test` DF (or a smaller one containing only the knownUser, knownItem pair) and selecting the value. An alternative solution could be to use real expected values in the `test` DF instead of "isNan" flags. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165312157 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -693,7 +711,9 @@ class ALSSuite val data = ratings.toDF val model = new ALS().fit(data) Seq("nan", "NaN", "Nan", "drop", "DROP", "Drop").foreach { s => - model.setColdStartStrategy(s).transform(data) + testTransformer[Rating[Int]](data, model.setColdStartStrategy(s), "prediction") { +case _ => --- End diff -- Just like above, no need for partial function. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/20362#discussion_r165308205 --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala --- @@ -653,6 +666,7 @@ class ALSSuite test("ALS cold start user/item prediction strategy") { val spark = this.spark import spark.implicits._ + --- End diff -- nit: no need for empty line here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20362: [Spark-22886][ML][TESTS] ML test for structured s...
GitHub user gaborgsomogyi opened a pull request: https://github.com/apache/spark/pull/20362 [Spark-22886][ML][TESTS] ML test for structured streaming: ml.recomme⦠## What changes were proposed in this pull request? Converting spark.ml.recommendation tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882. ## How was this patch tested? Automated: Pass the Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gaborgsomogyi/spark SPARK-22886 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20362.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20362 commit 33654c93c2fe240eb0c6a6932353239ab84b0ce0 Author: Gabor Somogyi Date: 2018-01-18T20:27:08Z [Spark-22886][ML][TESTS] ML test for structured streaming: ml.recommendation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org