[jira] [Updated] (SPARK-31671) Wrong error message in VectorAssembler when column lengths can not be inferred
[ https://issues.apache.org/jira/browse/SPARK-31671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31671: - Fix Version/s: (was: 2.4.7) 2.4.6 > Wrong error message in VectorAssembler when column lengths can not be > inferred > --- > > Key: SPARK-31671 > URL: https://issues.apache.org/jira/browse/SPARK-31671 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.4 > Environment: Mac OS catalina >Reporter: YijieFan >Assignee: YijieFan >Priority: Minor > Fix For: 2.4.6, 3.0.0 > > Original Estimate: 72h > Remaining Estimate: 72h > > In VectorAssembler when input column lengths can not be inferred and > handleInvalid = "keep", it will throw a runtime exception with message like > below > _Can not infer column lengths with handleInvalid = "keep". *Consider using > VectorSizeHint*_ > *_|to add metadata for columns: [column1, column2]_* > However, even if you set vector size hint for *column1*, the message remains, > and will not change to *[column2]* only. This is not consistent with the > description in the error message. > This introduce difficulties when I try to resolve this exception, for I do > not know which column required vectorSizeHint. This is especially troublesome > when you have a large number of columns to deal with. > Here is a simple example: > > {code:java} > // create a df without vector size > val df = Seq( > (Vectors.dense(1.0), Vectors.dense(2.0)) > ).toDF("n1", "n2") > // only set vector size hint for n1 column > val hintedDf = new VectorSizeHint() > .setInputCol("n1") > .setSize(1) > .transform(df) > // assemble n1, n2 > val output = new VectorAssembler() > .setInputCols(Array("n1", "n2")) > .setOutputCol("features") > .setHandleInvalid("keep") > .transform(hintedDf) > // because only n1 has vector size, the error message should tell us to set > vector size for n2 too > output.show() > {code} > Expected error message: > > {code:java} > Can not infer column lengths with handleInvalid = "keep". Consider using > VectorSizeHint to add metadata for columns: [n2]. > {code} > Actual error message: > {code:java} > Can not infer column lengths with handleInvalid = "keep". Consider using > VectorSizeHint to add metadata for columns: [n1, n2]. > {code} > I change one line in VectorAssembler.scala, so that it can work properly as > expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31671) Wrong error message in VectorAssembler when column lengths can not be inferred
[ https://issues.apache.org/jira/browse/SPARK-31671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-31671: - Affects Version/s: (was: 3.0.1) Labels: (was: pull-request-available) > Wrong error message in VectorAssembler when column lengths can not be > inferred > --- > > Key: SPARK-31671 > URL: https://issues.apache.org/jira/browse/SPARK-31671 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.4 > Environment: Mac OS catalina >Reporter: YijieFan >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > In VectorAssembler when input column lengths can not be inferred and > handleInvalid = "keep", it will throw a runtime exception with message like > below > _Can not infer column lengths with handleInvalid = "keep". *Consider using > VectorSizeHint*_ > *_|to add metadata for columns: [column1, column2]_* > However, even if you set vector size hint for *column1*, the message remains, > and will not change to *[column2]* only. This is not consistent with the > description in the error message. > This introduce difficulties when I try to resolve this exception, for I do > not know which column required vectorSizeHint. This is especially troublesome > when you have a large number of columns to deal with. > Here is a simple example: > > {code:java} > // create a df without vector size > val df = Seq( > (Vectors.dense(1.0), Vectors.dense(2.0)) > ).toDF("n1", "n2") > // only set vector size hint for n1 column > val hintedDf = new VectorSizeHint() > .setInputCol("n1") > .setSize(1) > .transform(df) > // assemble n1, n2 > val output = new VectorAssembler() > .setInputCols(Array("n1", "n2")) > .setOutputCol("features") > .setHandleInvalid("keep") > .transform(hintedDf) > // because only n1 has vector size, the error message should tell us to set > vector size for n2 too > output.show() > {code} > Expected error message: > > {code:java} > Can not infer column lengths with handleInvalid = "keep". Consider using > VectorSizeHint to add metadata for columns: [n2]. > {code} > Actual error message: > {code:java} > Can not infer column lengths with handleInvalid = "keep". Consider using > VectorSizeHint to add metadata for columns: [n1, n2]. > {code} > I change one line in VectorAssembler.scala, so that it can work properly as > expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org