[ 
https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405174#comment-16405174
 ] 

yogesh garg commented on SPARK-23690:
-------------------------------------

In an offline discussion with [~mrbago], we discussed the following behavior 
for `handleInvalid`. We have to get the lengths of the vector columns that are 
involved in the assembly, ideally this information is present in the 
`attributeGroup` of the column, but that might return `size == -1`, in which 
case we earlier used the `d.select.first` to infer the size of these columns. 
This could raise an exception in the corner case that the first row itself has 
null values. We are abandoning the idea that we can get this information by 
finding a non-null row in each of such columns because this approach has 
complicated logic, terrible run time (O(#columns) distributed queries) and 
fewer guarantees for any such data we might see in the future (even if we infer 
the size right now, there's no guarantee we can do it later, leading to an 
un-expected error).

1. *Error*: Find the remaining lengths from `d.select.first`
  * if we get NullPointerException while iterating on the cells for sizes, 
throw an (early) error
  * if we get NoSuchElementError while looking for the first row, give the rows 
0 sizes and warn about incomplete metadata

2. *Skip*: Find remaining lengths from `d.drop.first`
  * if we get NoSuchElementError, warn about incomplete metadata
  * Note that we can't get NullPointerException in this case (yay!)

3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer 
sizes from the data because even if we get the information form the current 
dataset, a future cut of the data is not guaranteed to be infer-able. Thus, 
throw an error encouraging `VectorSizeHint`

Please share thoughts and feedback on this!

> VectorAssembler should have handleInvalid to handle columns with null values
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-23690
>                 URL: https://issues.apache.org/jira/browse/SPARK-23690
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: yogesh garg
>            Priority: Major
>
> VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as 
> an input and returns the assembled vector. It currently throws an error if it 
> sees a null value in any column. This behavior also affects `RFormula` that 
> uses VectorAssembler to assemble numeric columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to