[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883125#comment-15883125 ]
Bill Chambers edited comment on SPARK-19714 at 2/24/17 5:15 PM: ---------------------------------------------------------------- The thing is QuantileDiscretizer and Bucketizer do fundamentally different things so there are different use cases there (quantiles vs actual values). It's more of a nuisance than anything and an unclear parameter that seems to imply things that are not actually the case. Here's where it *really* falls apart, if I have a bucket and I provide one split, how many buckets do I have? In Bucketizer I have none! That makes little sense. Splits is not the correct word here either because they aren't splits! They're bucket boundaries. I think this is more than a documentation issue, even though those aren't very clear themselves. > Parameter for mapping continuous features into buckets. With n+1 splits, > there are n buckets. A bucket defined by splits x,y holds values in the range > [x,y) except the last bucket, which also includes y. Splits should be of > length greater than or equal to 3 and strictly increasing. Values at -inf, > inf must be explicitly provided to cover all Double values; otherwise, values > outside the splits specified will be treated as errors. I also realize I'm being a pain here :) and that this stuff is always difficult. I empathize with that, it's just that this method doesn't seem to use correct terminology or a conceptually relevant implementation for what it aims to do. was (Author: bill_chambers): The thing is QuantileDiscretizer and Bucketizer do fundamentally different things so there are different use cases there (quantiles vs actual values). It's more of a nuisance than anything and an unclear parameter that seems to imply things that are not actually the case. Here's where it *really* falls apart, if I have a bucket and I provide one split, how many buckets do I have? In Bucketizer I have none! That makes no sense. Splits is not the correct word here either because they aren't splits! They're bounds or containers or buckets themselves. I think this is more than a documentation issue, even though those aren't very clear themselves. > Parameter for mapping continuous features into buckets. With n+1 splits, > there are n buckets. A bucket defined by splits x,y holds values in the range > [x,y) except the last bucket, which also includes y. Splits should be of > length greater than or equal to 3 and strictly increasing. Values at -inf, > inf must be explicitly provided to cover all Double values; otherwise, values > outside the splits specified will be treated as errors. > Bucketizer Bug Regarding Handling Unbucketed Inputs > --------------------------------------------------- > > Key: SPARK-19714 > URL: https://issues.apache.org/jira/browse/SPARK-19714 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.1.0 > Reporter: Bill Chambers > > {code} > contDF = spark.range(500).selectExpr("cast(id as double) as id") > import org.apache.spark.ml.feature.Bucketizer > val splits = Array(5.0, 10.0, 250.0, 500.0) > val bucketer = new Bucketizer() > .setSplits(splits) > .setInputCol("id") > .setHandleInvalid("skip") > bucketer.transform(contDF).show() > {code} > You would expect that this would handle the invalid buckets. However it fails > {code} > Caused by: org.apache.spark.SparkException: Feature value 0.0 out of > Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the > lower/upper bound constraints. > {code} > It seems strange that handleInvalud doesn't actually handleInvalid inputs. > Thoughts anyone? -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org