This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 72af2c0fbc6 [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk 72af2c0fbc6 is described below commit 72af2c0fbc6673a5e49f1fd6693fe2c90141a84f Author: Guilhem Vuillier <101632595+guilhem-de...@users.noreply.github.com> AuthorDate: Fri Jul 28 17:29:47 2023 -0500 [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk ### What changes were proposed in this pull request? This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data" The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal. The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal. This warning was added in the following PR: https://github.com/apache/spark/pull/36843 ### Why are the changes needed? With the current logic, RankingMetrics will: - raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input) - not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No change made to the test suite for RankingMetrics: https://github.com/uchiiii/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala Closes #42207 from guilhem-depop/patch-1. Authored-by: Guilhem Vuillier <101632595+guilhem-de...@users.noreply.github.com> Signed-off-by: Sean Owen <sro...@gmail.com> --- .../scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala index 37e57736574..a3316d8a8fa 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala @@ -140,6 +140,9 @@ class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: * and the NDCG is obtained by dividing the DCG value on the ground truth set. In the current * implementation, the relevance value is binary if the relevance value is empty. + * If the relevance value is not empty but its size doesn't match the ground truth set size, + * a log warning is generated. + * * If a query has an empty ground truth set, zero will be used as ndcg together with * a log warning. * @@ -157,7 +160,7 @@ class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: val useBinary = rel.isEmpty val labSet = lab.toSet val relMap = Utils.toMap(lab, rel) - if (useBinary && lab.size != rel.size) { + if (!useBinary && lab.size != rel.size) { logWarning( "# of ground truth set and # of relevance value set should be equal, " + "check input data") --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org