[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426270#comment-15426270 ]
Sean Owen commented on SPARK-17086: ----------------------------------- I suppose it depends on the desired semantics of QuantileDiscretizer. It sounds like already would return fewer buckets than requested. (That could or should be documented.) It makes it sound like it tries to make the buckets match quantiles of the input, even if it doesn't guarantee it. The bins you describe here would result in pretty lopsided binning, but, any consistent scheme would be the same. OK I think I would agree with matching the 1.6.2 behavior then and documenting that the number of buckets may be smaller than requested, rather than return buckets some of which will always be empty. Let's just document / add a test for it. I don't think the test should involve the number of distinct input elements (which could be expensive to compute); you just want to collapse adjacent splits that are equal right? That will cover more cases too. > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.1.0 > Reporter: Barry Becker > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org