[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429951#comment-15429951 ]
Qian Huang commented on SPARK-17086: ------------------------------------ I also agree on that it does not make sense if we put the same continuous value into different categorical features. But the original exception is confusing and difficult to understand, we could add a exception like in R: > x<-c(1,1,1,1,1,1,1,1,4,5,10) > quantile(x) 0% 25% 50% 75% 100% 1.0 1.0 1.0 2.5 10.0 > a<-quantile(x) > cut(x,a) Error in cut.default(x, a) : 'breaks' are not unique > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.1.0 > Reporter: Barry Becker > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org