[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

jkbradley Fri, 21 Oct 2016 16:57:06 -0700

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15428#discussion_r84561834
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
    @@ -73,15 +74,51 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
       @Since("1.4.0")
       def setOutputCol(value: String): this.type = set(outputCol, value)
     
    +  /**
    +   * Param for how to handle NaN entries. Options are skip (which will 
filter out rows with
    +   * NaN values), or error (which will throw an error), or keep (which 
will make the NaN
    +   * values an extra bucket). More options may be added later.
    +   *
    +   * @group param
    +   */
    +  @Since("2.1.0")
    +  val handleNaN: Param[String] = new Param[String](this, "handleNaNs", 
"how to handle NaN" +
    +    "entries. Options are skip (which will filter out rows with NaN 
values), or error" +
    +    "(which will throw an error), or keep (which will make the NaN values 
an extra bucket)." +
    +    "More options may be added later", 
ParamValidators.inArray(Array("skip", "error", "keep")))
    +
    +  /** @group getParam */
    +  @Since("2.1.0")
    +  def getHandleNaN: Option[Boolean] = $(handleNaN) match {
    +    case "keep" => Some(true)
    +    case "skip" => Some(false)
    +    case _ => None
    +  }
    +
    +  /** @group setParam */
    +  @Since("2.1.0")
    +  def setHandleNaN(value: String): this.type = set(handleNaN, value)
    +  setDefault(handleNaN, "error")
    +
       @Since("2.0.0")
       override def transform(dataset: Dataset[_]): DataFrame = {
         transformSchema(dataset.schema)
    -    val bucketizer = udf { feature: Double =>
    -      Bucketizer.binarySearchForBuckets($(splits), feature)
    +
    +    val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
    --- End diff --
    
    Ah, sorry, one more comment.  I'm not quite sure how closure capture 
behaves currently, but it might be good to define local vals for 
```$(splits)``` and ```getHandleNaN.isDefined && getHandleNaN.get```.  Since 
these reference methods in the Bucketizer class, I believe the UDF may capture 
the whole Bucketizer class instead of just those vals.
    
    After you define them in local vals here, you can use those vals in this 
UDF.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

Reply via email to