This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-2.4 by this push: new 1f85cd7 [SPARK-31671][ML] Wrong error message in VectorAssembler 1f85cd7 is described below commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9 Author: fan31415 <fan12356...@gmail.com> AuthorDate: Mon May 11 18:23:23 2020 -0500 [SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 <fan12356...@gmail.com> Co-authored-by: yijiefan <fany...@gmail.com> Signed-off-by: Sean Owen <sro...@gmail.com> (cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853) Signed-off-by: Sean Owen <sro...@gmail.com> --- .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala | 11 +++++++++++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala index 9192e72..994681a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala @@ -228,7 +228,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] { getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns) case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException( s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint - |to add metadata for columns: ${columns.mkString("[", ", ", "]")}.""" + |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}.""" .stripMargin.replaceAll("\n", " ")) case (_, _) => Map.empty } diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala index a4d388f..4957f6f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala @@ -261,4 +261,15 @@ class VectorAssemblerSuite val output = vectorAssembler.transform(dfWithNullsAndNaNs) assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty))) } + + test("SPARK-31671: should give explicit error message when can not infer column lengths") { + val df = Seq( + (Vectors.dense(1.0), Vectors.dense(2.0)) + ).toDF("n1", "n2") + val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df) + val assembler = new VectorAssembler() + .setInputCols(Array("n1", "n2")).setOutputCol("features") + assert(!intercept[RuntimeException](assembler.setHandleInvalid("keep").transform(hintedDf)) + .getMessage.contains("n1"), "should only show no vector size columns' name") + } } --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org