[ 
https://issues.apache.org/jira/browse/SPARK-13444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13444.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.6.2
                   2.0.0

Issue resolved by pull request 11319
[https://github.com/apache/spark/pull/11319]

> QuantileDiscretizer chooses bad splits on large DataFrames
> ----------------------------------------------------------
>
>                 Key: SPARK-13444
>                 URL: https://issues.apache.org/jira/browse/SPARK-13444
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.6.0
>            Reporter: Oliver Pierson
>             Fix For: 2.0.0, 1.6.2
>
>
> In certain circumstances, QuantileDiscretizer fails to calculate the correct 
> splits and will instead split data into two bins regardless of the value 
> specified in numBuckets.
> For example, supposed dataset.count is 200 million.  And we do
> val discretizer = new QuantileDiscretizer().setNumBuckets(10)
>   ... set output and input columns ...
> val dataWithBins = discretizer.fit(dataset).transform(dataset)
> In this case, dataWithBins will have only two distinct bins versus the 
> expected 10.
> Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be fixed 
> by changing line 113 like so:
> before: val requiredSamples = math.max(numBins * numBins, 10000)
> after: val requiredSamples = math.max(numBins * numBins, 10000.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to