[ 
https://issues.apache.org/jira/browse/SPARK-31671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31671.
----------------------------------
    Fix Version/s: 2.4.7
                   3.0.0
         Assignee: YijieFan
       Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28487

> Wrong error message in VectorAssembler  when column lengths can not be 
> inferred
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-31671
>                 URL: https://issues.apache.org/jira/browse/SPARK-31671
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.4.4
>         Environment: Mac OS  catalina
>            Reporter: YijieFan
>            Assignee: YijieFan
>            Priority: Minor
>             Fix For: 3.0.0, 2.4.7
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In VectorAssembler when input column lengths can not be inferred and 
> handleInvalid = "keep", it will throw a runtime exception with message like 
> below
> _Can not infer column lengths with handleInvalid = "keep". *Consider using 
> VectorSizeHint*_
>  *_|to add metadata for columns: [column1, column2]_*
> However, even if you set vector size hint for *column1*, the message remains, 
> and will not change to  *[column2]* only. This is not consistent with the 
> description in the error message.
> This introduce difficulties when I try to resolve this exception, for I do 
> not know which column required vectorSizeHint. This is especially troublesome 
> when you have a large number of columns to deal with.
> Here is a simple example:
>  
> {code:java}
> // create a df without vector size
> val df = Seq(
>   (Vectors.dense(1.0), Vectors.dense(2.0))
> ).toDF("n1", "n2")
> // only set vector size hint for n1 column
> val hintedDf = new VectorSizeHint()
>   .setInputCol("n1")
>   .setSize(1)
>   .transform(df)
> // assemble n1, n2
> val output = new VectorAssembler()
>   .setInputCols(Array("n1", "n2"))
>   .setOutputCol("features")
>   .setHandleInvalid("keep")
>   .transform(hintedDf)
> // because only n1 has vector size, the error message should tell us to set 
> vector size for n2 too
> output.show()
> {code}
> Expected error message:
>  
> {code:java}
> Can not infer column lengths with handleInvalid = "keep". Consider using 
> VectorSizeHint to add metadata for columns: [n2].
> {code}
> Actual error message:
> {code:java}
> Can not infer column lengths with handleInvalid = "keep". Consider using 
> VectorSizeHint to add metadata for columns: [n1, n2].
> {code}
> I change one line in VectorAssembler.scala, so that it can work properly as 
> expected. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to