[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2018-01-22 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r162935486
  
--- Diff: docs/ml-guide.md ---
@@ -122,6 +122,8 @@ There are no deprecations.
 * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
  We are now setting the default parallelism used in `OneVsRest` to be 1 
(i.e. serial), in 2.2 and earlier version,
  the `OneVsRest` parallelism would be parallelism of the default 
threadpool in scala.
+* [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781):
+ `Bucketizer` handles NULL values the same way as NaN when handleInvalid 
is skip or keep.
--- End diff --

Yep, you are right. :-p


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2018-01-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r162910933
  
--- Diff: docs/ml-guide.md ---
@@ -122,6 +122,8 @@ There are no deprecations.
 * [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
  We are now setting the default parallelism used in `OneVsRest` to be 1 
(i.e. serial), in 2.2 and earlier version,
  the `OneVsRest` parallelism would be parallelism of the default 
threadpool in scala.
+* [SPARK-19781](https://issues.apache.org/jira/browse/SPARK-19781):
+ `Bucketizer` handles NULL values the same way as NaN when handleInvalid 
is skip or keep.
--- End diff --

hmm, I think for skip, `dataset.na.drop` drops NULL. We didn't change its 
behavior. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2018-01-22 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r162885968
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -53,7 +53,8 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
* Values at -inf, inf must be explicitly provided to cover all Double 
values;
* otherwise, values outside the splits specified will be treated as 
errors.
*
-   * See also [[handleInvalid]], which can optionally create an additional 
bucket for NaN values.
+   * See also [[handleInvalid]], which can optionally create an additional 
bucket for NaN/NULL
--- End diff --

@viirya done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2018-01-22 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r162874801
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -53,7 +53,8 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String
* Values at -inf, inf must be explicitly provided to cover all Double 
values;
* otherwise, values outside the splits specified will be treated as 
errors.
*
-   * See also [[handleInvalid]], which can optionally create an additional 
bucket for NaN values.
+   * See also [[handleInvalid]], which can optionally create an additional 
bucket for NaN/NULL
--- End diff --

This sounds like a behavior change, we should add an item in migration 
guide of ML docs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2018-01-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r162703711
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -171,23 +176,23 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
* Binary searching in several buckets to place each data point.
* @param splits array of split points
* @param feature data point
-   * @param keepInvalid NaN flag.
-   *Set "true" to make an extra bucket for NaN values;
-   *Set "false" to report an error for NaN values
+   * @param keepInvalid NaN/NULL flag.
+   *Set "true" to make an extra bucket for NaN/NULL 
values;
+   *Set "false" to report an error for NaN/NULL values
* @return bucket for each data point
* @throws SparkException if a feature is < splits.head or > splits.last
*/
 
   private[feature] def binarySearchForBuckets(
   splits: Array[Double],
-  feature: Double,
+  feature: java.lang.Double,
--- End diff --

Also change to `Option[Double]` here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2018-01-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r162703633
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
--- End diff --

As @cloud-fan suggested, `Option[Double]` is better. :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-16 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r106363022
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
+val bucketizer: UserDefinedFunction = udf { (row: Row) =>
--- End diff --

Thanks for pointing out the performace problem. Maybe my original code will 
work better to use java.lang.Double instead of scala's Double to hold NULLs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-15 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r106339981
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -171,34 +173,34 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
* Binary searching in several buckets to place each data point.
* @param splits array of split points
* @param feature data point
-   * @param keepInvalid NaN flag.
-   *Set "true" to make an extra bucket for NaN values;
-   *Set "false" to report an error for NaN values
+   * @param keepInvalid NaN/NULL flag.
+   *Set "true" to make an extra bucket for NaN/NULL 
values;
+   *Set "false" to report an error for NaN/NULL values
* @return bucket for each data point
* @throws SparkException if a feature is < splits.head or > splits.last
*/
 
   private[feature] def binarySearchForBuckets(
   splits: Array[Double],
-  feature: Double,
+  feature: Option[Double],
   keepInvalid: Boolean): Double = {
-if (feature.isNaN) {
+if (feature.getOrElse(Double.NaN).isNaN) {
--- End diff --

I think you can equivalently write this as:
if (feature.isEmpty) { 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-15 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r106339731
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
+val bucketizer: UserDefinedFunction = udf { (row: Row) =>
--- End diff --

I believe you should try to avoid using a udf on a row because the 
serialization costs will be more expensive... hmm how could we make this 
perform well and handle nulls?  Does it work with Option[Double] instead of Row?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-06 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r104572696
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
--- End diff --

Thanks a lot. `Option[Double]` is much better. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r104524886
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
--- End diff --

see the document of `ScalaUDF`, if you don't like mixing java and scala 
types, you can use `Option[Double]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-06 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r104434899
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
--- End diff --

Use both Java and Scala types seems less graceful. Instead, is it better a 
way to pass a `Row` to  `bucketizer()` and then check NULLs with `isNullAt()` 
and `getDouble()` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r104362536
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,21 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
--- End diff --

actually, can we just use `java.lang.Double` as the type for `feature`? 
Then we don't need to change 
https://github.com/apache/spark/pull/17123/files#diff-37f2c93b88c73b91cdc9e40fc8c45fc5R121


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r104224870
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
-  Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
+val bucketizer: UserDefinedFunction = udf { (row: Row) =>
+  Bucketizer.binarySearchForBuckets(
+$(splits),
+row.getAs[java.lang.Double]($(inputCol)),
--- End diff --

Ideally we should use `row.getDouble(index)` and `row.isNullAt(index)` 
together to get values for primitive types, but technically `Row` is just a 
`Array[Object]`, so there is no performance penalty by using 
`java.lang.Double`.(this may change in the future, if possible we should prefer 
`isNullAt` and `getDouble`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-02 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r103955065
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -171,23 +176,23 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
* Binary searching in several buckets to place each data point.
* @param splits array of split points
* @param feature data point
-   * @param keepInvalid NaN flag.
-   *Set "true" to make an extra bucket for NaN values;
-   *Set "false" to report an error for NaN values
+   * @param keepInvalid NaN/NULL flag.
+   *Set "true" to make an extra bucket for NaN/NULL 
values;
+   *Set "false" to report an error for NaN/NULL values
* @return bucket for each data point
* @throws SparkException if a feature is < splits.head or > splits.last
*/
 
   private[feature] def binarySearchForBuckets(
   splits: Array[Double],
-  feature: Double,
+  feature: java.lang.Double,
   keepInvalid: Boolean): Double = {
-if (feature.isNaN) {
+if (feature == null || feature.isNaN) {
   if (keepInvalid) {
 splits.length - 1
   } else {
-throw new SparkException("Bucketizer encountered NaN value. To 
handle or skip NaNs," +
-  " try setting Bucketizer.handleInvalid.")
+throw new SparkException("Bucketizer encountered NaN/NULL values. 
" +
+  "To handle or skip NaNs/NULLs, try setting 
Bucketizer.handleInvalid.")
   }
 } else if (feature == splits.last) {
--- End diff --

My fault! I'll do it now!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-02 Thread crackcell
Github user crackcell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r103954857
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
-  Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
+val bucketizer: UserDefinedFunction = udf { (row: Row) =>
+  Bucketizer.binarySearchForBuckets(
+$(splits),
+row.getAs[java.lang.Double]($(inputCol)),
--- End diff --

Hi, Scala's Double will convert null to zero. Say:
> scala> val a: Double = null.asInstanceOf[Double]
> a: Double = 0.0
So I use Java's Double instead to hold NULLs. I feel it a litte ugly, any 
better way?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-02 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r103949549
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -171,23 +176,23 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
* Binary searching in several buckets to place each data point.
* @param splits array of split points
* @param feature data point
-   * @param keepInvalid NaN flag.
-   *Set "true" to make an extra bucket for NaN values;
-   *Set "false" to report an error for NaN values
+   * @param keepInvalid NaN/NULL flag.
+   *Set "true" to make an extra bucket for NaN/NULL 
values;
+   *Set "false" to report an error for NaN/NULL values
* @return bucket for each data point
* @throws SparkException if a feature is < splits.head or > splits.last
*/
 
   private[feature] def binarySearchForBuckets(
   splits: Array[Double],
-  feature: Double,
+  feature: java.lang.Double,
--- End diff --

Double here as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-02 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r103949274
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -105,20 +106,24 @@ final class Bucketizer @Since("1.4.0") 
(@Since("1.4.0") override val uid: String
 transformSchema(dataset.schema)
 val (filteredDataset, keepInvalid) = {
   if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
-// "skip" NaN option is set, will filter out NaN values in the 
dataset
+// "skip" NaN/NULL option is set, will filter out NaN/NULL values 
in the dataset
 (dataset.na.drop().toDF(), false)
   } else {
 (dataset.toDF(), getHandleInvalid == Bucketizer.KEEP_INVALID)
   }
 }
 
-val bucketizer: UserDefinedFunction = udf { (feature: Double) =>
-  Bucketizer.binarySearchForBuckets($(splits), feature, keepInvalid)
+val bucketizer: UserDefinedFunction = udf { (row: Row) =>
+  Bucketizer.binarySearchForBuckets(
+$(splits),
+row.getAs[java.lang.Double]($(inputCol)),
--- End diff --

can you use Double instead of java.lang.Double?  It should be the scala 
Double type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-02 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/17123#discussion_r103949478
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 
---
@@ -171,23 +176,23 @@ object Bucketizer extends 
DefaultParamsReadable[Bucketizer] {
* Binary searching in several buckets to place each data point.
* @param splits array of split points
* @param feature data point
-   * @param keepInvalid NaN flag.
-   *Set "true" to make an extra bucket for NaN values;
-   *Set "false" to report an error for NaN values
+   * @param keepInvalid NaN/NULL flag.
+   *Set "true" to make an extra bucket for NaN/NULL 
values;
+   *Set "false" to report an error for NaN/NULL values
* @return bucket for each data point
* @throws SparkException if a feature is < splits.head or > splits.last
*/
 
   private[feature] def binarySearchForBuckets(
   splits: Array[Double],
-  feature: Double,
+  feature: java.lang.Double,
   keepInvalid: Boolean): Double = {
-if (feature.isNaN) {
+if (feature == null || feature.isNaN) {
   if (keepInvalid) {
 splits.length - 1
   } else {
-throw new SparkException("Bucketizer encountered NaN value. To 
handle or skip NaNs," +
-  " try setting Bucketizer.handleInvalid.")
+throw new SparkException("Bucketizer encountered NaN/NULL values. 
" +
+  "To handle or skip NaNs/NULLs, try setting 
Bucketizer.handleInvalid.")
   }
 } else if (feature == splits.last) {
--- End diff --

could you please add some tests to validate that NULL values can now be 
handled in addition to NaN values by the bucketizer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in...

2017-03-01 Thread crackcell
GitHub user crackcell opened a pull request:

https://github.com/apache/spark/pull/17123

[SPARK-19781][ML] Handle NULLs as well as NaNs in Bucketizer when 
handleInvalid is on

## What changes were proposed in this pull request?

The original Bucketizer can put NaNs into a special bucket when 
handleInvalid is on. but leave NULLs untouched.
This PR unify behaviours of processing of NULLs and NaNs.

BTW, this is my first commit to Spark code. I'm not sure whether my code or 
the way of doing things is appropriate. Plz point it out if I'm doing anything 
wrong. :-)

## How was this patch tested?

manual tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/crackcell/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17123


commit 2b0751428cad5280df47f96412608967b71a7360
Author: Menglong TAN 
Date:   2017-03-01T12:37:33Z

add support for null values in Bucketizer

commit b3f98b66e63c9c61c69a1429819feb236fad56c7
Author: Menglong TAN 
Date:   2017-03-01T15:10:05Z

fix a typo




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org