[ https://issues.apache.org/jira/browse/SPARK-31671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-31671. ---------------------------------- Fix Version/s: 2.4.7 3.0.0 Assignee: YijieFan Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28487 > Wrong error message in VectorAssembler when column lengths can not be > inferred > ------------------------------------------------------------------------------- > > Key: SPARK-31671 > URL: https://issues.apache.org/jira/browse/SPARK-31671 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.4.4 > Environment: Mac OS catalina > Reporter: YijieFan > Assignee: YijieFan > Priority: Minor > Fix For: 3.0.0, 2.4.7 > > Original Estimate: 72h > Remaining Estimate: 72h > > In VectorAssembler when input column lengths can not be inferred and > handleInvalid = "keep", it will throw a runtime exception with message like > below > _Can not infer column lengths with handleInvalid = "keep". *Consider using > VectorSizeHint*_ > *_|to add metadata for columns: [column1, column2]_* > However, even if you set vector size hint for *column1*, the message remains, > and will not change to *[column2]* only. This is not consistent with the > description in the error message. > This introduce difficulties when I try to resolve this exception, for I do > not know which column required vectorSizeHint. This is especially troublesome > when you have a large number of columns to deal with. > Here is a simple example: > > {code:java} > // create a df without vector size > val df = Seq( > (Vectors.dense(1.0), Vectors.dense(2.0)) > ).toDF("n1", "n2") > // only set vector size hint for n1 column > val hintedDf = new VectorSizeHint() > .setInputCol("n1") > .setSize(1) > .transform(df) > // assemble n1, n2 > val output = new VectorAssembler() > .setInputCols(Array("n1", "n2")) > .setOutputCol("features") > .setHandleInvalid("keep") > .transform(hintedDf) > // because only n1 has vector size, the error message should tell us to set > vector size for n2 too > output.show() > {code} > Expected error message: > > {code:java} > Can not infer column lengths with handleInvalid = "keep". Consider using > VectorSizeHint to add metadata for columns: [n2]. > {code} > Actual error message: > {code:java} > Can not infer column lengths with handleInvalid = "keep". Consider using > VectorSizeHint to add metadata for columns: [n1, n2]. > {code} > I change one line in VectorAssembler.scala, so that it can work properly as > expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org