[ 
https://issues.apache.org/jira/browse/SPARK-17086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426270#comment-15426270
 ] 

Sean Owen commented on SPARK-17086:
-----------------------------------

I suppose it depends on the desired semantics of QuantileDiscretizer. It sounds 
like already would return fewer buckets than requested. (That could or should 
be documented.) 

It makes it sound like it tries to make the buckets match quantiles of the 
input, even if it doesn't guarantee it. The bins you describe here would result 
in pretty lopsided binning, but, any consistent scheme would be the same.

OK I think I would agree with matching the 1.6.2 behavior then and documenting 
that the number of buckets may be smaller than requested, rather than return 
buckets some of which will always be empty. Let's just document / add a test 
for it.

I don't think the test should involve the number of distinct input elements 
(which could be expensive to compute); you just want to collapse adjacent 
splits that are equal right? That will cover more cases too.

> QuantileDiscretizer throws InvalidArgumentException (parameter splits given 
> invalid value) on valid data
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17086
>                 URL: https://issues.apache.org/jira/browse/SPARK-17086
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Barry Becker
>
> I discovered this bug when working with a build from the master branch (which 
> I believe is 2.1.0). This used to work fine when running spark 1.6.2.
> I have a dataframe with an "intData" column that has values like 
> {code}
> 1 3 2 1 1 2 3 2 2 2 1 3
> {code}
> I have a stage in my pipeline that uses the QuantileDiscretizer to produce 
> equal weight splits like this
> {code}
> new QuantileDiscretizer()
>         .setInputCol("intData")
>         .setOutputCol("intData_bin")
>         .setNumBuckets(10)
>         .fit(df)
> {code}
> But when that gets run it (incorrectly) throws this error:
> {code}
> parameter splits given invalid value [-Infinity, 1.0, 1.0, 2.0, 2.0, 3.0, 
> 3.0, Infinity]
> {code}
> I don't think that there should be duplicate splits generated should there be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to