[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426259#comment-15426259 ]
Sean Owen commented on SPARK-17086: ----------------------------------- You are right, some buckets will get no values. If the input has so few distinct (integer) values, 10 buckets is too many and there's no way 3 distinct values can ever fall into more than 3 distinct buckets. That much is actually fine IMHO. But yes it's ambiguous because 2 can logically go into [2,2) or [2,3). If it were consistently mapped into one of them, like the first matching bucket, I think it would still be valid output. If there's no easy way to make this mapping consistently then I agree it should probably be an error to end up with splits like this. (I'm also not sure why 8 splits are output in the example but 10 buckets were requested. It just defines 7 buckets.) > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.1.0 > Reporter: Barry Becker > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org