[jira] [Comment Edited] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883125#comment-15883125
 ] 

Bill Chambers edited comment on SPARK-19714 at 2/24/17 5:15 PM:


The thing is QuantileDiscretizer and Bucketizer do fundamentally different 
things so there are different use cases there (quantiles vs actual values). 
It's more of a nuisance than anything and an unclear parameter that seems to 
imply things that are not actually the case.

Here's where it *really* falls apart, if I have a bucket and I provide one 
split, how many buckets do I have?

In Bucketizer I have none! That makes little sense. Splits is not the correct 
word here either because they aren't splits! They're bucket boundaries. I think 
this is more than a documentation issue, even though those aren't very clear 
themselves.

> Parameter for mapping continuous features into buckets. With n+1 splits, 
> there are n buckets. A bucket defined by splits x,y holds values in the range 
> [x,y) except the last bucket, which also includes y. Splits should be of 
> length greater than or equal to 3 and strictly increasing. Values at -inf, 
> inf must be explicitly provided to cover all Double values; otherwise, values 
> outside the splits specified will be treated as errors.

I also realize I'm being a pain here :) and that this stuff is always 
difficult. I empathize with that, it's just that this method doesn't seem to 
use correct terminology or a conceptually relevant implementation for what it 
aims to do.


was (Author: bill_chambers):
The thing is QuantileDiscretizer and Bucketizer do fundamentally different 
things so there are different use cases there (quantiles vs actual values). 
It's more of a nuisance than anything and an unclear parameter that seems to 
imply things that are not actually the case.

Here's where it *really* falls apart, if I have a bucket and I provide one 
split, how many buckets do I have?

In Bucketizer I have none! That makes no sense. Splits is not the correct word 
here either because they aren't splits! They're bounds or containers or buckets 
themselves. I think this is more than a documentation issue, even though those 
aren't very clear themselves.

> Parameter for mapping continuous features into buckets. With n+1 splits, 
> there are n buckets. A bucket defined by splits x,y holds values in the range 
> [x,y) except the last bucket, which also includes y. Splits should be of 
> length greater than or equal to 3 and strictly increasing. Values at -inf, 
> inf must be explicitly provided to cover all Double values; otherwise, values 
> outside the splits specified will be treated as errors.



> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882216#comment-15882216
 ] 

Nick Pentreath edited comment on SPARK-19714 at 2/24/17 8:35 AM:
-

I agree that the parameter naming is perhaps misleading. At least the doc 
should be updated because "invalid" here actually means {{NaN}} or {{null}}. 

However {{Bucketizer}} is doing what you tell it to as the splits are specified 
by you. Note that if you used {{QuantileDiscretizer}} to construct the 
{{Bucketizer}} then it adds {{+/- Infinity}} as the lower/upper bounds of the 
splits. So you can do the same if you want anything below the lower bound or 
above the lower bound to be "valid". You will then have 2 more buckets.


was (Author: mlnick):
I agree that the parameter naming is perhaps misleading. At least the doc 
should be updated because "invalid" here actually means {{NaN}} or {{null}}. 

However {{Bucketizer}} is doing what you tell it to as the splits are specified 
by you. Note that if you used {{QuantileDiscretizer}} to construct the 
{{Bucketizer}} then it adds {{+/- Infinity}} as the lower/upper bounds of the 
splits. So you can do the same if you want anything below the lower bound to be 
included in the first bucket, and above the upper bound to be included in the 
last bucket.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org