[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r162935486 --- Diff: docs/ml-guide.md --- @@ -122,6 +122,8 @@ There are no deprecations. * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027): We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version, the `OneVsRest` parallelism would be parallelism of the default threadpool in scala. +* [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781): + `Bucketizer` handles NULL values the same way as NaN when handleInvalid is skip or keep. --- End diff -- Yep, you are right. :-p --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r162910933 --- Diff: docs/ml-guide.md --- @@ -122,6 +122,8 @@ There are no deprecations. * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027): We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version, the `OneVsRest` parallelism would be parallelism of the default threadpool in scala. +* [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781): + `Bucketizer` handles NULL values the same way as NaN when handleInvalid is skip or keep. --- End diff -- hmm, I think for skip, `dataset.na.drop` drops NULL. We didn't change its behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r162885968 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -53,7 +53,8 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String * Values at -inf, inf must be explicitly provided to cover all Double values; * otherwise, values outside the splits specified will be treated as errors. * - * See also [[handleInvalid]], which can optionally create an additional bucket for NaN values. + * See also [[handleInvalid]], which can optionally create an additional bucket for NaN/NULL --- End diff -- @viirya done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r162874801 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -53,7 +53,8 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String * Values at -inf, inf must be explicitly provided to cover all Double values; * otherwise, values outside the splits specified will be treated as errors. * - * See also [[handleInvalid]], which can optionally create an additional bucket for NaN values. + * See also [[handleInvalid]], which can optionally create an additional bucket for NaN/NULL --- End diff -- This sounds like a behavior change, we should add an item in migration guide of ML docs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r162703711 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -171,23 +176,23 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { * Binary searching in several buckets to place each data point. * @param splits array of split points * @param feature data point - * @param keepInvalid NaN flag. - *Set "true" to make an extra bucket for NaN values; - *Set "false" to report an error for NaN values + * @param keepInvalid NaN/NULL flag. + *Set "true" to make an extra bucket for NaN/NULL values; + *Set "false" to report an error for NaN/NULL values * @return bucket for each data point * @throws SparkException if a feature is < splits.head or > splits.last */ private[feature] def binarySearchForBuckets( splits: Array[Double], - feature: Double, + feature: java.lang.Double, --- End diff -- Also change to `Option[Double]` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r162703633 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => --- End diff -- As @cloud-fan suggested, `Option[Double]` is better. :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r106363022 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => +val bucketizer: UserDefinedFunction = udf { (row: Row) => --- End diff -- Thanks for pointing out the performace problem. Maybe my original code will work better to use java.lang.Double instead of scala's Double to hold NULLs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r106339981 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -171,34 +173,34 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { * Binary searching in several buckets to place each data point. * @param splits array of split points * @param feature data point - * @param keepInvalid NaN flag. - *Set "true" to make an extra bucket for NaN values; - *Set "false" to report an error for NaN values + * @param keepInvalid NaN/NULL flag. + *Set "true" to make an extra bucket for NaN/NULL values; + *Set "false" to report an error for NaN/NULL values * @return bucket for each data point * @throws SparkException if a feature is < splits.head or > splits.last */ private[feature] def binarySearchForBuckets( splits: Array[Double], - feature: Double, + feature: Option[Double], keepInvalid: Boolean): Double = { -if (feature.isNaN) { +if (feature.getOrElse(Double.NaN).isNaN) { --- End diff -- I think you can equivalently write this as: if (feature.isEmpty) { --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r106339731 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => +val bucketizer: UserDefinedFunction = udf { (row: Row) => --- End diff -- I believe you should try to avoid using a udf on a row because the serialization costs will be more expensive... hmm how could we make this perform well and handle nulls? Does it work with Option[Double] instead of Row? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r104572696 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => --- End diff -- Thanks a lot. `Option[Double]` is much better. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r104524886 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => --- End diff -- see the document of `ScalaUDF`, if you don't like mixing java and scala types, you can use `Option[Double]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r104434899 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => --- End diff -- Use both Java and Scala types seems less graceful. Instead, is it better a way to pass a `Row` to `bucketizer()` and then check NULLs with `isNullAt()` and `getDouble()` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r104362536 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => --- End diff -- actually, can we just use `java.lang.Double` as the type for `feature`? Then we don't need to change https://github.com/apache/spark/pull/17123/files#diff-37f2c93b88c73b91cdc9e40fc8c45fc5R121 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r104224870 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => - Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid) +val bucketizer: UserDefinedFunction = udf { (row: Row) => + Bucketizer.binarySearchForBuckets( +$(splits), +row.getAs[java.lang.Double]($(inputCol)), --- End diff -- Ideally we should use `row.getDouble(index)` and `row.isNullAt(index)` together to get values for primitive types, but technically `Row` is just a `Array[Object]`, so there is no performance penalty by using `java.lang.Double`.(this may change in the future, if possible we should prefer `isNullAt` and `getDouble`) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r103955065 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -171,23 +176,23 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { * Binary searching in several buckets to place each data point. * @param splits array of split points * @param feature data point - * @param keepInvalid NaN flag. - *Set "true" to make an extra bucket for NaN values; - *Set "false" to report an error for NaN values + * @param keepInvalid NaN/NULL flag. + *Set "true" to make an extra bucket for NaN/NULL values; + *Set "false" to report an error for NaN/NULL values * @return bucket for each data point * @throws SparkException if a feature is < splits.head or > splits.last */ private[feature] def binarySearchForBuckets( splits: Array[Double], - feature: Double, + feature: java.lang.Double, keepInvalid: Boolean): Double = { -if (feature.isNaN) { +if (feature == null || feature.isNaN) { if (keepInvalid) { splits.length - 1 } else { -throw new SparkException("Bucketizer encountered NaN value. To handle or skip NaNs," + - " try setting Bucketizer.handleInvalid.") +throw new SparkException("Bucketizer encountered NaN/NULL values. " + + "To handle or skip NaNs/NULLs, try setting Bucketizer.handleInvalid.") } } else if (feature == splits.last) { --- End diff -- My fault! I'll do it now! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r103954857 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => - Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid) +val bucketizer: UserDefinedFunction = udf { (row: Row) => + Bucketizer.binarySearchForBuckets( +$(splits), +row.getAs[java.lang.Double]($(inputCol)), --- End diff -- Hi, Scala's Double will convert null to zero. Say: > scala> val a: Double = null.asInstanceOf[Double] > a: Double = 0.0 So I use Java's Double instead to hold NULLs. I feel it a litte ugly, any better way? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r103949549 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -171,23 +176,23 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { * Binary searching in several buckets to place each data point. * @param splits array of split points * @param feature data point - * @param keepInvalid NaN flag. - *Set "true" to make an extra bucket for NaN values; - *Set "false" to report an error for NaN values + * @param keepInvalid NaN/NULL flag. + *Set "true" to make an extra bucket for NaN/NULL values; + *Set "false" to report an error for NaN/NULL values * @return bucket for each data point * @throws SparkException if a feature is < splits.head or > splits.last */ private[feature] def binarySearchForBuckets( splits: Array[Double], - feature: Double, + feature: java.lang.Double, --- End diff -- Double here as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r103949274 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => - Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid) +val bucketizer: UserDefinedFunction = udf { (row: Row) => + Bucketizer.binarySearchForBuckets( +$(splits), +row.getAs[java.lang.Double]($(inputCol)), --- End diff -- can you use Double instead of java.lang.Double? It should be the scala Double type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r103949478 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -171,23 +176,23 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { * Binary searching in several buckets to place each data point. * @param splits array of split points * @param feature data point - * @param keepInvalid NaN flag. - *Set "true" to make an extra bucket for NaN values; - *Set "false" to report an error for NaN values + * @param keepInvalid NaN/NULL flag. + *Set "true" to make an extra bucket for NaN/NULL values; + *Set "false" to report an error for NaN/NULL values * @return bucket for each data point * @throws SparkException if a feature is < splits.head or > splits.last */ private[feature] def binarySearchForBuckets( splits: Array[Double], - feature: Double, + feature: java.lang.Double, keepInvalid: Boolean): Double = { -if (feature.isNaN) { +if (feature == null || feature.isNaN) { if (keepInvalid) { splits.length - 1 } else { -throw new SparkException("Bucketizer encountered NaN value. To handle or skip NaNs," + - " try setting Bucketizer.handleInvalid.") +throw new SparkException("Bucketizer encountered NaN/NULL values. " + + "To handle or skip NaNs/NULLs, try setting Bucketizer.handleInvalid.") } } else if (feature == splits.last) { --- End diff -- could you please add some tests to validate that NULL values can now be handled in addition to NaN values by the bucketizer? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
GitHub user crackcell opened a pull request: https://github.com/apache/spark/pull/17123 [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucketizer when handleInvalid is on ## What changes were proposed in this pull request? The original Bucketizer can put NaNs into a special bucket when handleInvalid is on. but leave NULLs untouched. This PR unify behaviours of processing of NULLs and NaNs. BTW, this is my first commit to Spark code. I'm not sure whether my code or the way of doing things is appropriate. Plz point it out if I'm doing anything wrong. :-) ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/crackcell/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17123.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17123 commit 2b0751428cad5280df47f96412608967b71a7360 Author: Menglong TAN Date: 2017-03-01T12:37:33Z add support for null values in Bucketizer commit b3f98b66e63c9c61c69a1429819feb236fad56c7 Author: Menglong TAN Date: 2017-03-01T15:10:05Z fix a typo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org