[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2018-01-22 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r162935486
  
--- Diff: docs/ml-guide.md ---
@@ -122,6 +122,8 @@ There are no deprecations.
 * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
  We are now setting the default parallelism used in `OneVsRest` to be 1 
(i.e. serial), in 2.2 and earlier version,
  the `OneVsRest` parallelism would be parallelism of the default 
threadpool in scala.
+* [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781):
+ `Bucketizer` handles NULL values the same way as NaN when handleInvalid 
is skip or keep.
--- End diff --

Yep, you are right. :-p


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

2018-01-22 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17123
  
@WeichenXu123 I have finished my work, plz review it. Any suggestion is 
welcome. :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2018-01-22 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r162885968
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -53,7 +53,8 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
* Values at -inf, inf must be explicitly provided to cover all Double 
values;
* otherwise, values outside the splits specified will be treated as 
errors.
*
-   * See also [[handleInvalid]], which can optionally create an additional 
bucket for NaN values.
+   * See also [[handleInvalid]], which can optionally create an additional 
bucket for NaN/NULL
--- End diff --

@viirya done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

2018-01-21 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17123
  
@WeichenXu123 sorry to miss the message for two days, I'm working on it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-16 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r106363022
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
+val bucketizer: UserDefinedFunction = udf { (row: Row) =>
--- End diff --

Thanks for pointing out the performace problem. Maybe my original code will 
work better to use java.lang.Double instead of scala's Double to hold NULLs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17233: [SPARK-11569][ML] Fix StringIndexer to handle null value...

2017-03-13 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17233
  
@jkbradley Hi, I have made some updates according to your comments, please 
review it again. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17233#discussion_r105820314
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -188,35 +189,45 @@ class StringIndexerModel (
 transformSchema(dataset.schema, logging = true)
 
 val filteredLabels = getHandleInvalid match {
-  case StringIndexer.KEEP_UNSEEN_LABEL => labels :+ "__unknown"
+  case StringIndexer.KEEP_INVALID => labels :+ "__unknown"
   case _ => labels
 }
 
 val metadata = NominalAttribute.defaultAttr
   .withName($(outputCol)).withValues(filteredLabels).toMetadata()
 // If we are skipping invalid records, filter them out.
 val (filteredDataset, keepInvalid) = getHandleInvalid match {
-  case StringIndexer.SKIP_UNSEEN_LABEL =>
+  case StringIndexer.SKIP_INVALID =>
 val filterer = udf { label: String =>
   labelToIndex.contains(label)
 }
-(dataset.where(filterer(dataset($(inputCol, false)
-  case _ => (dataset, getHandleInvalid == 
StringIndexer.KEEP_UNSEEN_LABEL)
+
(dataset.na.drop(Array($(inputCol))).where(filterer(dataset($(inputCol, 
false)
+  case _ => (dataset, getHandleInvalid == StringIndexer.KEEP_INVALID)
 }
 
-val indexer = udf { label: String =>
-  if (labelToIndex.contains(label)) {
-labelToIndex(label)
-  } else if (keepInvalid) {
-labels.length
+val indexer = udf { row: Row =>
--- End diff --

got it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17233#discussion_r105820279
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -122,6 +122,86 @@ class StringIndexerSuite
 assert(output === expected)
   }
 
+  test("StringIndexer with a string input column with NULLs") {
+val data: Seq[java.lang.String] = Seq("a", "b", "b", null)
+val data2: Seq[java.lang.String] = Seq("a", "b", null)
+val expectedSkip = Array(1.0, 0.0)
+val expectedKeep = Array(1.0, 0.0, 2.0)
+val df = data.toDF("label")
+val df2 = data2.toDF("label")
+
+val indexer = new StringIndexer()
+  .setInputCol("label")
+  .setOutputCol("labelIndex")
+
+withClue("StringIndexer should throw error when setHandleValid=error 
when given NULL values") {
+  intercept[SparkException] {
+indexer.setHandleInvalid("error")
+indexer.fit(df).transform(df2).collect()
+  }
+}
+
+indexer.setHandleInvalid("skip")
+val transformedSkip = indexer.fit(df).transform(df2)
+val attrSkip = Attribute
+  .fromStructField(transformedSkip.schema("labelIndex"))
+  .asInstanceOf[NominalAttribute]
+assert(attrSkip.values.get === Array("b", "a"))
+assert(transformedSkip.select("labelIndex").rdd.map { r =>
+  r.getDouble(0)
+}.collect() === expectedSkip)
+
+indexer.setHandleInvalid("keep")
+val transformedKeep = indexer.fit(df).transform(df2)
+val attrKeep = Attribute
+  .fromStructField(transformedKeep.schema("labelIndex"))
+  .asInstanceOf[NominalAttribute]
+assert(attrKeep.values.get === Array("b", "a", "__unknown"))
+assert(transformedKeep.select("labelIndex").rdd.map { r =>
+  r.getDouble(0)
+}.collect() === expectedKeep)
+  }
+
+  test("StringIndexer with a numeric input column with NULLs") {
--- End diff --

OK, I'll remove the numeric test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17233#discussion_r105820283
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -122,6 +122,86 @@ class StringIndexerSuite
 assert(output === expected)
   }
 
+  test("StringIndexer with a string input column with NULLs") {
+val data: Seq[java.lang.String] = Seq("a", "b", "b", null)
+val data2: Seq[java.lang.String] = Seq("a", "b", null)
+val expectedSkip = Array(1.0, 0.0)
+val expectedKeep = Array(1.0, 0.0, 2.0)
+val df = data.toDF("label")
+val df2 = data2.toDF("label")
+
+val indexer = new StringIndexer()
+  .setInputCol("label")
+  .setOutputCol("labelIndex")
+
+withClue("StringIndexer should throw error when setHandleValid=error 
when given NULL values") {
+  intercept[SparkException] {
+indexer.setHandleInvalid("error")
+indexer.fit(df).transform(df2).collect()
+  }
+}
+
+indexer.setHandleInvalid("skip")
+val transformedSkip = indexer.fit(df).transform(df2)
+val attrSkip = Attribute
+  .fromStructField(transformedSkip.schema("labelIndex"))
+  .asInstanceOf[NominalAttribute]
+assert(attrSkip.values.get === Array("b", "a"))
+assert(transformedSkip.select("labelIndex").rdd.map { r =>
+  r.getDouble(0)
+}.collect() === expectedSkip)
--- End diff --

roger


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17233: [SPARK-11569][ML] Fix StringIndexer to handle null value...

2017-03-10 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17233
  
cc @srowen @cloud-fan @MLnick


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

2017-03-09 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17123
  
@cloud-fan Would you please review my code again? I'm now using `Option` to 
handle NULLs. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-09 Thread crackcell
GitHub user crackcell opened a pull request:

https://github.com/apache/spark/pull/17233

[SPARK-11569][ML] Fix StringIndexer to handle null value properly

## What changes were proposed in this pull request?

This PR is to enhance StringIndexer with NULL values handling.

Before the PR, StringIndexer will throw an exception when encounters NULL 
values.
With this PR:
- handleInvalid=error: Throw an exception as before
- handleInvalid=skip: Skip null values as well as unseen labels
- handleInvalid=keep: Give null values an additional index as well as 
unseen labels

BTW, I noticed someone was trying to solve the same problem ( #9920 ) but 
seems getting no progress or response for a long time. Would you mind give a 
chance to solve it ?

## How was this patch tested?

new unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/crackcell/spark 11569_StringIndexer_NULL

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17233


commit 75e3975597aa6271f4f8ab688922edda88b03045
Author: Menglong TAN <tanmengl...@gmail.com>
Date:   2017-03-08T03:50:17Z

Merge pull request #1 from apache/master

merge master to my repo

commit 79d706085e8371fb1724ce73377767c38d551e5d
Author: Menglong TAN <tanmengl...@renrenche.com>
Date:   2017-03-10T04:45:56Z

Enhance StringIndexer with NULL values

commit 0cb121c65f592b9623bdeef2746d7c2a3c281ae1
Author: Menglong TAN <tanmengl...@renrenche.com>
Date:   2017-03-10T04:52:30Z

filter out NULLs when transform dataset




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16883: [SPARK-17498][ML] StringIndexer enhancement for handling...

2017-03-07 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/16883
  
Nice work! I'm just planning to improve `StringIndexer` exactly the same 
way as yours. Now I can have a rest. :-) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-06 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r104572696
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
--- End diff --

Thanks a lot. `Option[Double]` is much better. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-06 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r104434899
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
--- End diff --

Use both Java and Scala types seems less graceful. Instead, is it better a 
way to pass a `Row` to  `bucketizer()` and then check NULLs with `isNullAt()` 
and `getDouble()` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

2017-03-04 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17123
  
@imatiach-msft @cloud-fan I updated the code, replaced java.lang.Double 
with isNullAt() and getDouble().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

2017-03-03 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17123
  
@srowen @cloud-fan Please review my code. Thanks. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

2017-03-03 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17123
  
@imatiach-msft Hi, Ilya. I have added two tests based on the original tests 
for NaN data. Please review my code again. Thanks for your time. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-02 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r103955065
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -171,23 +176,23 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
* Binary searching in several buckets to place each data point.
* @param splits array of split points
* @param feature data point
-   * @param keepInvalid NaN flag.
-   *Set "true" to make an extra bucket for NaN values;
-   *Set "false" to report an error for NaN values
+   * @param keepInvalid NaN/NULL flag.
+   *Set "true" to make an extra bucket for NaN/NULL 
values;
+   *Set "false" to report an error for NaN/NULL values
* @return bucket for each data point
* @throws SparkException if a feature is < splits.head or > splits.last
*/
 
   private[feature] def binarySearchForBuckets(
   splits: Array[Double],
-  feature: Double,
+  feature: java.lang.Double,
   keepInvalid: Boolean): Double = {
-if (feature.isNaN) {
+if (feature == null || feature.isNaN) {
   if (keepInvalid) {
 splits.length - 1
   } else {
-throw new SparkException("Bucketizer encountered NaN value. To 
handle or skip NaNs," +
-  " try setting Bucketizer.handleInvalid.")
+throw new SparkException("Bucketizer encountered NaN/NULL values. 
" +
+  "To handle or skip NaNs/NULLs, try setting 
Bucketizer.handleInvalid.")
   }
 } else if (feature == splits.last) {
--- End diff --

My fault! I'll do it now!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-02 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r103954857
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
-  Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
+val bucketizer: UserDefinedFunction = udf { (row: Row) =>
+  Bucketizer.binarySearchForBuckets(
+$(splits),
+row.getAs[java.lang.Double]($(inputCol)),
--- End diff --

Hi, Scala's Double will convert null to zero. Say:
> scala> val a: Double = null.asInstanceOf[Double]
> a: Double = 0.0
So I use Java's Double instead to hold NULLs. I feel it a litte ugly, any 
better way?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

2017-03-01 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17123
  
Fixed style errors during the unit tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-01 Thread crackcell
GitHub user crackcell opened a pull request:

https://github.com/apache/spark/pull/17123

[SPARK-19781][ML] Handle NULLs as well as NaNs in Bucketizer when 
handleInvalid is on

## What changes were proposed in this pull request?

The original Bucketizer can put NaNs into a special bucket when 
handleInvalid is on. but leave NULLs untouched.
This PR unify behaviours of processing of NULLs and NaNs.

BTW, this is my first commit to Spark code. I'm not sure whether my code or 
the way of doing things is appropriate. Plz point it out if I'm doing anything 
wrong. :-)

## How was this patch tested?

manual tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/crackcell/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17123


commit 2b0751428cad5280df47f96412608967b71a7360
Author: Menglong TAN <tanmengl...@gmail.com>
Date:   2017-03-01T12:37:33Z

add support for null values in Bucketizer

commit b3f98b66e63c9c61c69a1429819feb236fad56c7
Author: Menglong TAN <tanmengl...@gmail.com>
Date:   2017-03-01T15:10:05Z

fix a typo




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org