[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1f85cd7  [SPARK-31671][ML] Wrong error message in VectorAssembler
1f85cd7 is described below

commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 9192e72..994681a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -228,7 +228,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol("features")
+
assert(!intercept[RuntimeException](assembler.setHandleInvalid("keep").transform(hintedDf))
+  .getMessage.contains("n1"), "should only show no vector size columns' 
name")
+  }
 }


-
To unsubscribe, e-mail: 

[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1f85cd7  [SPARK-31671][ML] Wrong error message in VectorAssembler
1f85cd7 is described below

commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 9192e72..994681a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -228,7 +228,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol("features")
+
assert(!intercept[RuntimeException](assembler.setHandleInvalid("keep").transform(hintedDf))
+  .getMessage.contains("n1"), "should only show no vector size columns' 
name")
+  }
 }


-
To unsubscribe, e-mail: 

[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1f85cd7  [SPARK-31671][ML] Wrong error message in VectorAssembler
1f85cd7 is described below

commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 9192e72..994681a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -228,7 +228,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol("features")
+
assert(!intercept[RuntimeException](assembler.setHandleInvalid("keep").transform(hintedDf))
+  .getMessage.contains("n1"), "should only show no vector size columns' 
name")
+  }
 }


-
To unsubscribe, e-mail: 

[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1f85cd7  [SPARK-31671][ML] Wrong error message in VectorAssembler
1f85cd7 is described below

commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 9192e72..994681a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -228,7 +228,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol("features")
+
assert(!intercept[RuntimeException](assembler.setHandleInvalid("keep").transform(hintedDf))
+  .getMessage.contains("n1"), "should only show no vector size columns' 
name")
+  }
 }


-
To unsubscribe, e-mail: