[ https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405174#comment-16405174 ]
yogesh garg commented on SPARK-23690: ------------------------------------- In an offline discussion with [~mrbago], we discussed the following behavior for `handleInvalid`. We have to get the lengths of the vector columns that are involved in the assembly, ideally this information is present in the `attributeGroup` of the column, but that might return `size == -1`, in which case we earlier used the `d.select.first` to infer the size of these columns. This could raise an exception in the corner case that the first row itself has null values. We are abandoning the idea that we can get this information by finding a non-null row in each of such columns because this approach has complicated logic, terrible run time (O(#columns) distributed queries) and fewer guarantees for any such data we might see in the future (even if we infer the size right now, there's no guarantee we can do it later, leading to an un-expected error). 1. *Error*: Find the remaining lengths from `d.select.first` * if we get NullPointerException while iterating on the cells for sizes, throw an (early) error * if we get NoSuchElementError while looking for the first row, give the rows 0 sizes and warn about incomplete metadata 2. *Skip*: Find remaining lengths from `d.drop.first` * if we get NoSuchElementError, warn about incomplete metadata * Note that we can't get NullPointerException in this case (yay!) 3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer sizes from the data because even if we get the information form the current dataset, a future cut of the data is not guaranteed to be infer-able. Thus, throw an error encouraging `VectorSizeHint` Please share thoughts and feedback on this! > VectorAssembler should have handleInvalid to handle columns with null values > ---------------------------------------------------------------------------- > > Key: SPARK-23690 > URL: https://issues.apache.org/jira/browse/SPARK-23690 > Project: Spark > Issue Type: Sub-task > Components: ML > Affects Versions: 2.3.0 > Reporter: yogesh garg > Priority: Major > > VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as > an input and returns the assembled vector. It currently throws an error if it > sees a null value in any column. This behavior also affects `RFormula` that > uses VectorAssembler to assemble numeric columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org