Hello,

  I've discovered a bug in the QuantileDiscretizer estimator.  Specifically, 
for large DataFrames QuantileDiscretizer will only create one split (i.e. two 
bins).


The error happens in lines 113 and 114 of QuantileDiscretizer.scala:


    val requiredSamples = math.max(numBins * numBins, 10000)

    val fraction = math.min(requiredSamples / dataset.count(), 1.0)


After the first line, requiredSamples is an Int.  Therefore, if requiredSamples 
> dataset.count() then fraction is always 0.0.


The problem can be simply fixed by replacing the first with:


  val requiredSamples = math.max(numBins * numBins, 10000.0)


I've implemented this change in my fork and all tests passed (except for docker 
integration, but I think that's another issue).  I'm happy to submit a PR if it 
will ease someone else's workload.  However, I'm unsure of how to create a 
JIRA.  I've created an account on the issue tracker (issues.apache.org) but 
when I try to create an issue it asks me to choose a "Service Desk".  Which one 
should I be choosing?


Thanks much,

Oliver Pierson


Reply via email to