[jira] [Comment Edited] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

Bill Chambers (JIRA) Fri, 24 Feb 2017 09:16:08 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883125#comment-15883125
 ]


Bill Chambers edited comment on SPARK-19714 at 2/24/17 5:15 PM:
----------------------------------------------------------------

The thing is QuantileDiscretizer and Bucketizer do fundamentally different 
things so there are different use cases there (quantiles vs actual values). 
It's more of a nuisance than anything and an unclear parameter that seems to 
imply things that are not actually the case.

Here's where it *really* falls apart, if I have a bucket and I provide one 
split, how many buckets do I have?

In Bucketizer I have none! That makes little sense. Splits is not the correct 
word here either because they aren't splits! They're bucket boundaries. I think 
this is more than a documentation issue, even though those aren't very clear 
themselves.

> Parameter for mapping continuous features into buckets. With n+1 splits, 
> there are n buckets. A bucket defined by splits x,y holds values in the range 
> [x,y) except the last bucket, which also includes y. Splits should be of 
> length greater than or equal to 3 and strictly increasing. Values at -inf, 
> inf must be explicitly provided to cover all Double values; otherwise, values 
> outside the splits specified will be treated as errors.

I also realize I'm being a pain here :) and that this stuff is always 
difficult. I empathize with that, it's just that this method doesn't seem to 
use correct terminology or a conceptually relevant implementation for what it 
aims to do.


was (Author: bill_chambers):
The thing is QuantileDiscretizer and Bucketizer do fundamentally different 
things so there are different use cases there (quantiles vs actual values). 
It's more of a nuisance than anything and an unclear parameter that seems to 
imply things that are not actually the case.

Here's where it *really* falls apart, if I have a bucket and I provide one 
split, how many buckets do I have?

In Bucketizer I have none! That makes no sense. Splits is not the correct word 
here either because they aren't splits! They're bounds or containers or buckets 
themselves. I think this is more than a documentation issue, even though those 
aren't very clear themselves.

> Parameter for mapping continuous features into buckets. With n+1 splits, 
> there are n buckets. A bucket defined by splits x,y holds values in the range 
> [x,y) except the last bucket, which also includes y. Splits should be of 
> length greater than or equal to 3 and strictly increasing. Values at -inf, 
> inf must be explicitly provided to cover all Double values; otherwise, values 
> outside the splits specified will be treated as errors.



> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---------------------------------------------------
>
>                 Key: SPARK-19714
>                 URL: https://issues.apache.org/jira/browse/SPARK-19714
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.1.0
>            Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

Reply via email to