Hello, I've discovered a bug in the QuantileDiscretizer estimator. Specifically, for large DataFrames QuantileDiscretizer will only create one split (i.e. two bins).
The error happens in lines 113 and 114 of QuantileDiscretizer.scala: val requiredSamples = math.max(numBins * numBins, 10000) val fraction = math.min(requiredSamples / dataset.count(), 1.0) After the first line, requiredSamples is an Int. Therefore, if requiredSamples > dataset.count() then fraction is always 0.0. The problem can be simply fixed by replacing the first with: val requiredSamples = math.max(numBins * numBins, 10000.0) I've implemented this change in my fork and all tests passed (except for docker integration, but I think that's another issue). I'm happy to submit a PR if it will ease someone else's workload. However, I'm unsure of how to create a JIRA. I've created an account on the issue tracker (issues.apache.org) but when I try to create an issue it asks me to choose a "Service Desk". Which one should I be choosing? Thanks much, Oliver Pierson