[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17373 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133324363 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -527,9 +550,21 @@ private[ml] class FeedForwardModel private( override def predict(data: Vector): Vector = { val size = data.size -val result = forward(new BDM[Double](size, 1, data.toArray)) +val result = forward(new BDM[Double](size, 1, data.toArray), true) Vectors.dense(result.last.toArray) } + + override def predictRaw(data: Vector): Vector = { +val size = data.size +val result = forward(new BDM[Double](size, 1, data.toArray), false) +Vectors.dense(result(result.length - 2).toArray) + } + + override def raw2ProbabilityInPlace(data: Vector): Vector = { +val dataMatrix = new BDM[Double](data.size, 1, data.toArray) +layerModels.last.eval(dataMatrix, dataMatrix) --- End diff -- Ping: If this proposal sounds good, then can you please update accordingly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133322927 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite } } + test("strong dataset test") { +val layers = Array[Int](4, 5, 5, 2) + +val strongDataset = Seq( + (Vectors.dense(1, 2, 3, 4), 0d, Vectors.dense(1d, 0d)), + (Vectors.dense(4, 3, 2, 1), 1d, Vectors.dense(0d, 1d)), + (Vectors.dense(1, 1, 1, 1), 0d, Vectors.dense(.5, .5)), + (Vectors.dense(1, 1, 1, 1), 1d, Vectors.dense(.5, .5)) +).toDF("features", "label", "expectedProbability") +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(strongDataset) +val result = model.transform(strongDataset) +model.setProbabilityCol("probability") +MLTestingUtils.checkCopyAndUids(trainer, model) +// result.select("probability").show(false) +result.select("probability", "expectedProbability").collect().foreach { + case Row(p: Vector, e: Vector) => +assert(p ~== e absTol 1e-3) +} + } + + test("test model probability") { +val layers = Array[Int](2, 5, 2) +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(dataset) +model.setProbabilityCol("probability") --- End diff -- Ping --- this should not be necessary --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133323889 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -374,6 +380,22 @@ private[ann] trait TopologyModel extends Serializable { def predict(data: Vector): Vector /** + * Raw prediction of the model + * + * @param data input data + * @return raw prediction + */ + def predictRaw(data: Vector): Vector --- End diff -- Ping: rename data -> features --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133081809 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -527,9 +550,21 @@ private[ml] class FeedForwardModel private( override def predict(data: Vector): Vector = { val size = data.size -val result = forward(new BDM[Double](size, 1, data.toArray)) +val result = forward(new BDM[Double](size, 1, data.toArray), true) Vectors.dense(result.last.toArray) } + + override def predictRaw(data: Vector): Vector = { +val size = data.size +val result = forward(new BDM[Double](size, 1, data.toArray), false) +Vectors.dense(result(result.length - 2).toArray) + } + + override def raw2ProbabilityInPlace(data: Vector): Vector = { +val dataMatrix = new BDM[Double](data.size, 1, data.toArray) +layerModels.last.eval(dataMatrix, dataMatrix) --- End diff -- This assumes that the ```eval``` method can operate in-place. That is fine for the last layer for MLP (SoftmaxLayerModelWithCrossEntropyLoss), but not OK in general. More generally, these methods for classifiers should not go in the very general TopologyModel abstraction; that abstraction may be used in the future for regression as well. I'd be fine with putting this classification-specific logic in MLP itself; we do not need to generalize the logic until we add other Classifiers, which might take a long time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133079098 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -374,6 +380,22 @@ private[ann] trait TopologyModel extends Serializable { def predict(data: Vector): Vector /** + * Raw prediction of the model + * + * @param data input data + * @return raw prediction + */ + def predictRaw(data: Vector): Vector + + /** + * Probability of the model --- End diff -- This documentation does not add any information. Can you please link to ProbabilisticClassifier instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133079087 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -374,6 +380,22 @@ private[ann] trait TopologyModel extends Serializable { def predict(data: Vector): Vector /** + * Raw prediction of the model --- End diff -- This documentation does not add any information. Can you please link to ProbabilisticClassifier instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133079226 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -374,6 +380,22 @@ private[ann] trait TopologyModel extends Serializable { def predict(data: Vector): Vector /** + * Raw prediction of the model + * + * @param data input data + * @return raw prediction + */ + def predictRaw(data: Vector): Vector + + /** + * Probability of the model + * + * @param data input data --- End diff -- "input data" sounds like input feature values, which is not correct. Why not match the ProbabilisticClassifier API? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133076755 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite } } + test("strong dataset test") { +val layers = Array[Int](4, 5, 5, 2) --- End diff -- Can you make this test faster by using a simpler network, e.g., by removing one of the middle layers? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133075752 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite } } + test("strong dataset test") { +val layers = Array[Int](4, 5, 5, 2) + +val strongDataset = Seq( + (Vectors.dense(1, 2, 3, 4), 0d, Vectors.dense(1d, 0d)), + (Vectors.dense(4, 3, 2, 1), 1d, Vectors.dense(0d, 1d)), + (Vectors.dense(1, 1, 1, 1), 0d, Vectors.dense(.5, .5)), + (Vectors.dense(1, 1, 1, 1), 1d, Vectors.dense(.5, .5)) +).toDF("features", "label", "expectedProbability") +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(strongDataset) +val result = model.transform(strongDataset) +model.setProbabilityCol("probability") +MLTestingUtils.checkCopyAndUids(trainer, model) --- End diff -- checkCopyAndUids is a generic test which should only be run in a single test; it does not need to be run in each test. Please remove it from here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133075802 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite } } + test("strong dataset test") { +val layers = Array[Int](4, 5, 5, 2) + +val strongDataset = Seq( + (Vectors.dense(1, 2, 3, 4), 0d, Vectors.dense(1d, 0d)), + (Vectors.dense(4, 3, 2, 1), 1d, Vectors.dense(0d, 1d)), + (Vectors.dense(1, 1, 1, 1), 0d, Vectors.dense(.5, .5)), + (Vectors.dense(1, 1, 1, 1), 1d, Vectors.dense(.5, .5)) +).toDF("features", "label", "expectedProbability") +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(strongDataset) +val result = model.transform(strongDataset) +model.setProbabilityCol("probability") +MLTestingUtils.checkCopyAndUids(trainer, model) +// result.select("probability").show(false) --- End diff -- remove old comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133080046 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -361,9 +361,15 @@ private[ann] trait TopologyModel extends Serializable { * Forward propagation * * @param data input data + * @param includeLastLayer include last layer when computing. In MultilayerPerceptronClassifier, --- End diff -- This text is unclear. This phrasing is better: "Include the last layer in the output. In MultilayerPerceptronClassifier, the last layer is always softmax; the last layer of outputs is needed for class predictions, but not for rawPrediction." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133076328 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite } } + test("strong dataset test") { --- End diff -- Make the test title more descriptive so it is clear what it is testing. E.g. "Predicted class probabilities: calibration on toy dataset" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133082415 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite } } + test("strong dataset test") { +val layers = Array[Int](4, 5, 5, 2) + +val strongDataset = Seq( + (Vectors.dense(1, 2, 3, 4), 0d, Vectors.dense(1d, 0d)), + (Vectors.dense(4, 3, 2, 1), 1d, Vectors.dense(0d, 1d)), + (Vectors.dense(1, 1, 1, 1), 0d, Vectors.dense(.5, .5)), + (Vectors.dense(1, 1, 1, 1), 1d, Vectors.dense(.5, .5)) +).toDF("features", "label", "expectedProbability") +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(strongDataset) +val result = model.transform(strongDataset) +model.setProbabilityCol("probability") +MLTestingUtils.checkCopyAndUids(trainer, model) +// result.select("probability").show(false) +result.select("probability", "expectedProbability").collect().foreach { + case Row(p: Vector, e: Vector) => +assert(p ~== e absTol 1e-3) +} + } + + test("test model probability") { +val layers = Array[Int](2, 5, 2) +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(dataset) +model.setProbabilityCol("probability") --- End diff -- That's the default already, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133080187 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -527,9 +550,21 @@ private[ml] class FeedForwardModel private( override def predict(data: Vector): Vector = { val size = data.size -val result = forward(new BDM[Double](size, 1, data.toArray)) +val result = forward(new BDM[Double](size, 1, data.toArray), true) Vectors.dense(result.last.toArray) } + + override def predictRaw(data: Vector): Vector = { +val size = data.size --- End diff -- This temp val of "size" is only used once, so I recommend removing it to make the code clearer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133082607 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -82,6 +83,49 @@ class MultilayerPerceptronClassifierSuite } } + test("strong dataset test") { +val layers = Array[Int](4, 5, 5, 2) + +val strongDataset = Seq( + (Vectors.dense(1, 2, 3, 4), 0d, Vectors.dense(1d, 0d)), + (Vectors.dense(4, 3, 2, 1), 1d, Vectors.dense(0d, 1d)), + (Vectors.dense(1, 1, 1, 1), 0d, Vectors.dense(.5, .5)), + (Vectors.dense(1, 1, 1, 1), 1d, Vectors.dense(.5, .5)) +).toDF("features", "label", "expectedProbability") +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(strongDataset) +val result = model.transform(strongDataset) +model.setProbabilityCol("probability") +MLTestingUtils.checkCopyAndUids(trainer, model) +// result.select("probability").show(false) +result.select("probability", "expectedProbability").collect().foreach { + case Row(p: Vector, e: Vector) => +assert(p ~== e absTol 1e-3) +} + } + + test("test model probability") { +val layers = Array[Int](2, 5, 2) +val trainer = new MultilayerPerceptronClassifier() + .setLayers(layers) + .setBlockSize(1) + .setSeed(123L) + .setMaxIter(100) + .setSolver("l-bfgs") +val model = trainer.fit(dataset) +model.setProbabilityCol("probability") +val result = model.transform(dataset) +val features2prob = udf { features: Vector => model.mlpModel.predict(features) } +val cmpVec = udf { (v1: Vector, v2: Vector) => v1 ~== v2 relTol 1e-3 } +assert(result.select(cmpVec(features2prob(col("features")), col("probability"))) --- End diff -- If this test fails, it will not give much info. How about collecting the data and comparing on the driver? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17373: [SPARK-12664][ML] Expose probability in mlp model
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17373#discussion_r133079322 --- Diff: mllib/src/main/scala/org/apache/spark/ml/ann/Layer.scala --- @@ -374,6 +380,22 @@ private[ann] trait TopologyModel extends Serializable { def predict(data: Vector): Vector /** + * Raw prediction of the model + * + * @param data input data + * @return raw prediction + */ + def predictRaw(data: Vector): Vector --- End diff -- Please match the Classifier API here (data -> features) unless there's a reason to deviate --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org