[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r162935486 --- Diff: docs/ml-guide.md --- @@ -122,6 +122,8 @@ There are no deprecations. * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027): We are now setting the default parallelism used in `OneVsRest` to be 1 (i.e. serial), in 2.2 and earlier version, the `OneVsRest` parallelism would be parallelism of the default threadpool in scala. +* [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781): + `Bucketizer` handles NULL values the same way as NaN when handleInvalid is skip or keep. --- End diff -- Yep, you are right. :-p --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17123 @WeichenXu123 I have finished my work, plz review it. Any suggestion is welcome. :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r162885968 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -53,7 +53,8 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String * Values at -inf, inf must be explicitly provided to cover all Double values; * otherwise, values outside the splits specified will be treated as errors. * - * See also [[handleInvalid]], which can optionally create an additional bucket for NaN values. + * See also [[handleInvalid]], which can optionally create an additional bucket for NaN/NULL --- End diff -- @viirya done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17123 @WeichenXu123 sorry to miss the message for two days, I'm working on it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r106363022 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => +val bucketizer: UserDefinedFunction = udf { (row: Row) => --- End diff -- Thanks for pointing out the performace problem. Maybe my original code will work better to use java.lang.Double instead of scala's Double to hold NULLs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17233: [SPARK-11569][ML] Fix StringIndexer to handle null value...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17233 @jkbradley Hi, I have made some updates according to your comments, please review it again. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17233#discussion_r105820314 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -188,35 +189,45 @@ class StringIndexerModel ( transformSchema(dataset.schema, logging = true) val filteredLabels = getHandleInvalid match { - case StringIndexer.KEEP_UNSEEN_LABEL => labels :+ "__unknown" + case StringIndexer.KEEP_INVALID => labels :+ "__unknown" case _ => labels } val metadata = NominalAttribute.defaultAttr .withName($(outputCol)).withValues(filteredLabels).toMetadata() // If we are skipping invalid records, filter them out. val (filteredDataset, keepInvalid) = getHandleInvalid match { - case StringIndexer.SKIP_UNSEEN_LABEL => + case StringIndexer.SKIP_INVALID => val filterer = udf { label: String => labelToIndex.contains(label) } -(dataset.where(filterer(dataset($(inputCol, false) - case _ => (dataset, getHandleInvalid == StringIndexer.KEEP_UNSEEN_LABEL) + (dataset.na.drop(Array($(inputCol))).where(filterer(dataset($(inputCol, false) + case _ => (dataset, getHandleInvalid == StringIndexer.KEEP_INVALID) } -val indexer = udf { label: String => - if (labelToIndex.contains(label)) { -labelToIndex(label) - } else if (keepInvalid) { -labels.length +val indexer = udf { row: Row => --- End diff -- got it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17233#discussion_r105820279 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -122,6 +122,86 @@ class StringIndexerSuite assert(output === expected) } + test("StringIndexer with a string input column with NULLs") { +val data: Seq[java.lang.String] = Seq("a", "b", "b", null) +val data2: Seq[java.lang.String] = Seq("a", "b", null) +val expectedSkip = Array(1.0, 0.0) +val expectedKeep = Array(1.0, 0.0, 2.0) +val df = data.toDF("label") +val df2 = data2.toDF("label") + +val indexer = new StringIndexer() + .setInputCol("label") + .setOutputCol("labelIndex") + +withClue("StringIndexer should throw error when setHandleValid=error when given NULL values") { + intercept[SparkException] { +indexer.setHandleInvalid("error") +indexer.fit(df).transform(df2).collect() + } +} + +indexer.setHandleInvalid("skip") +val transformedSkip = indexer.fit(df).transform(df2) +val attrSkip = Attribute + .fromStructField(transformedSkip.schema("labelIndex")) + .asInstanceOf[NominalAttribute] +assert(attrSkip.values.get === Array("b", "a")) +assert(transformedSkip.select("labelIndex").rdd.map { r => + r.getDouble(0) +}.collect() === expectedSkip) + +indexer.setHandleInvalid("keep") +val transformedKeep = indexer.fit(df).transform(df2) +val attrKeep = Attribute + .fromStructField(transformedKeep.schema("labelIndex")) + .asInstanceOf[NominalAttribute] +assert(attrKeep.values.get === Array("b", "a", "__unknown")) +assert(transformedKeep.select("labelIndex").rdd.map { r => + r.getDouble(0) +}.collect() === expectedKeep) + } + + test("StringIndexer with a numeric input column with NULLs") { --- End diff -- OK, I'll remove the numeric test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17233#discussion_r105820283 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -122,6 +122,86 @@ class StringIndexerSuite assert(output === expected) } + test("StringIndexer with a string input column with NULLs") { +val data: Seq[java.lang.String] = Seq("a", "b", "b", null) +val data2: Seq[java.lang.String] = Seq("a", "b", null) +val expectedSkip = Array(1.0, 0.0) +val expectedKeep = Array(1.0, 0.0, 2.0) +val df = data.toDF("label") +val df2 = data2.toDF("label") + +val indexer = new StringIndexer() + .setInputCol("label") + .setOutputCol("labelIndex") + +withClue("StringIndexer should throw error when setHandleValid=error when given NULL values") { + intercept[SparkException] { +indexer.setHandleInvalid("error") +indexer.fit(df).transform(df2).collect() + } +} + +indexer.setHandleInvalid("skip") +val transformedSkip = indexer.fit(df).transform(df2) +val attrSkip = Attribute + .fromStructField(transformedSkip.schema("labelIndex")) + .asInstanceOf[NominalAttribute] +assert(attrSkip.values.get === Array("b", "a")) +assert(transformedSkip.select("labelIndex").rdd.map { r => + r.getDouble(0) +}.collect() === expectedSkip) --- End diff -- roger --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17233: [SPARK-11569][ML] Fix StringIndexer to handle null value...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17233 cc @srowen @cloud-fan @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17123 @cloud-fan Would you please review my code again? I'm now using `Option` to handle NULLs. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...
GitHub user crackcell opened a pull request: https://github.com/apache/spark/pull/17233 [SPARK-11569][ML] Fix StringIndexer to handle null value properly ## What changes were proposed in this pull request? This PR is to enhance StringIndexer with NULL values handling. Before the PR, StringIndexer will throw an exception when encounters NULL values. With this PR: - handleInvalid=error: Throw an exception as before - handleInvalid=skip: Skip null values as well as unseen labels - handleInvalid=keep: Give null values an additional index as well as unseen labels BTW, I noticed someone was trying to solve the same problem ( #9920 ) but seems getting no progress or response for a long time. Would you mind give a chance to solve it ? ## How was this patch tested? new unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/crackcell/spark 11569_StringIndexer_NULL Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17233.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17233 commit 75e3975597aa6271f4f8ab688922edda88b03045 Author: Menglong TAN Date: 2017-03-08T03:50:17Z Merge pull request #1 from apache/master merge master to my repo commit 79d706085e8371fb1724ce73377767c38d551e5d Author: Menglong TAN Date: 2017-03-10T04:45:56Z Enhance StringIndexer with NULL values commit 0cb121c65f592b9623bdeef2746d7c2a3c281ae1 Author: Menglong TAN Date: 2017-03-10T04:52:30Z filter out NULLs when transform dataset --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16883: [SPARK-17498][ML] StringIndexer enhancement for handling...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/16883 Nice work! I'm just planning to improve `StringIndexer` exactly the same way as yours. Now I can have a rest. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r104572696 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => --- End diff -- Thanks a lot. `Option[Double]` is much better. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r104434899 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => --- End diff -- Use both Java and Scala types seems less graceful. Instead, is it better a way to pass a `Row` to `bucketizer()` and then check NULLs with `isNullAt()` and `getDouble()` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17123 @imatiach-msft @cloud-fan I updated the code, replaced java.lang.Double with isNullAt() and getDouble(). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17123 @srowen @cloud-fan Please review my code. Thanks. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17123 @imatiach-msft Hi, Ilya. I have added two tests based on the original tests for NaN data. Please review my code again. Thanks for your time. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r103955065 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -171,23 +176,23 @@ object Bucketizer extends DefaultParamsReadable[Bucketizer] { * Binary searching in several buckets to place each data point. * @param splits array of split points * @param feature data point - * @param keepInvalid NaN flag. - *Set "true" to make an extra bucket for NaN values; - *Set "false" to report an error for NaN values + * @param keepInvalid NaN/NULL flag. + *Set "true" to make an extra bucket for NaN/NULL values; + *Set "false" to report an error for NaN/NULL values * @return bucket for each data point * @throws SparkException if a feature is < splits.head or > splits.last */ private[feature] def binarySearchForBuckets( splits: Array[Double], - feature: Double, + feature: java.lang.Double, keepInvalid: Boolean): Double = { -if (feature.isNaN) { +if (feature == null || feature.isNaN) { if (keepInvalid) { splits.length - 1 } else { -throw new SparkException("Bucketizer encountered NaN value. To handle or skip NaNs," + - " try setting Bucketizer.handleInvalid.") +throw new SparkException("Bucketizer encountered NaN/NULL values. " + + "To handle or skip NaNs/NULLs, try setting Bucketizer.handleInvalid.") } } else if (feature == splits.last) { --- End diff -- My fault! I'll do it now! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
Github user crackcell commented on a diff in the pull request: https://github.com/apache/spark/pull/17123#discussion_r103954857 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String transformSchema(dataset.schema) val (filteredDataset, keepInvalid) = { if (getHandleInvalid == Bucketizer.SKIP_INVALID) { -// "skip" NaN option is set, will filter out NaN values in the dataset +// "skip" NaN/NULL option is set, will filter out NaN/NULL values in the dataset (dataset.na.drop().toDF(), false) } else { (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID) } } -val bucketizer: UserDefinedFunction = udf { (feature: Double) => - Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid) +val bucketizer: UserDefinedFunction = udf { (row: Row) => + Bucketizer.binarySearchForBuckets( +$(splits), +row.getAs[java.lang.Double]($(inputCol)), --- End diff -- Hi, Scala's Double will convert null to zero. Say: > scala> val a: Double = null.asInstanceOf[Double] > a: Double = 0.0 So I use Java's Double instead to hold NULLs. I feel it a litte ugly, any better way? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17123 Fixed style errors during the unit tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...
GitHub user crackcell opened a pull request: https://github.com/apache/spark/pull/17123 [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucketizer when handleInvalid is on ## What changes were proposed in this pull request? The original Bucketizer can put NaNs into a special bucket when handleInvalid is on. but leave NULLs untouched. This PR unify behaviours of processing of NULLs and NaNs. BTW, this is my first commit to Spark code. I'm not sure whether my code or the way of doing things is appropriate. Plz point it out if I'm doing anything wrong. :-) ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/crackcell/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17123.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17123 commit 2b0751428cad5280df47f96412608967b71a7360 Author: Menglong TAN Date: 2017-03-01T12:37:33Z add support for null values in Bucketizer commit b3f98b66e63c9c61c69a1429819feb236fad56c7 Author: Menglong TAN Date: 2017-03-01T15:10:05Z fix a typo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org