Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20829#discussion_r177501836
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
    @@ -49,32 +53,57 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /** @group setParam */
    +  @Since("2.4.0")
    +  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
    +
    +  /**
    +   * Param for how to handle invalid data (NULL values). Options are 
'skip' (filter out rows with
    --- End diff --
    
    I'd recommend we deal with NaNs now.  This PR is already dealing with some 
NaN cases: Dataset.na.drop handles NaNs in NumericType columns (but not 
VectorUDT columns).
    
    I'm Ok with postponing incorrect vector lengths until later or doing that 
now since that work will be more separate.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to