[ https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435236#comment-15435236 ]
Barry Becker commented on SPARK-17086: -------------------------------------- Thanks. BTW, I hope there are some test cases where the column to bin has NaN values (for nulls). I seem to recall there being duplicate NaN split points being added for every occurrence of NaN in the data (or something like that). I will open a separate issue on this if I can nail down the specifics and make a simple test case. > QuantileDiscretizer throws InvalidArgumentException (parameter splits given > invalid value) on valid data > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-17086 > URL: https://issues.apache.org/jira/browse/SPARK-17086 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.0.0 > Reporter: Barry Becker > Assignee: Vincent > Fix For: 2.0.1, 2.1.0 > > > I discovered this bug when working with a build from the master branch (which > I believe is 2.1.0). This used to work fine when running spark 1.6.2. > I have a dataframe with an "intData" column that has values like > {code} > 1 3 2 1 1 2 3 2 2 2 1 3 > {code} > I have a stage in my pipeline that uses the QuantileDiscretizer to produce > equal weight splits like this > {code} > new QuantileDiscretizer() > .setInputCol("intData") > .setOutputCol("intData_bin") > .setNumBuckets(10) > .fit(df) > {code} > But when that gets run it (incorrectly) throws this error: > {code} > parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, > 3.0, Infinity] > {code} > I don't think that there should be duplicate splits generated should there be? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org