Barry Becker created SPARK-21986: ------------------------------------ Summary: QuantileDiscretizer picks wrong split point for data with lots of 0's Key: SPARK-21986 URL: https://issues.apache.org/jira/browse/SPARK-21986 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.1.1 Reporter: Barry Becker
I have some simple test cases to help illustrate (see below). I discovered this with data that had 96,000 rows, but can reproduce with much smaller data that has roughly the same distribution of values. If I have data like Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0) and ask for 3 buckets, then it does the right thing and yields splits of Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity) However, if I add just one more zero, such that I have data like Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0) then it will do the wrong thing and give splits of Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity)) I'm not bothered that it gave fewer buckets than asked for (that is to be expected), but I am bothered that it picked 0.0 instead of 40 as the one split point. The way it did it, now I have 1 bucket with all the data, and a second with none of the data. Am I interpreting something wrong? Here are my 2 test cases in scala: {code} class QuantileDiscretizerSuite extends FunSuite { test("Quantile discretizer on data with lots of 0") { verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0), Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity)) } test("Quantile discretizer on data with one less 0") { verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0), Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)) } def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = { val theData: Seq[(Int, Double)] = data.map { case x: Int => (x, 0.0) case _ => (0, 0.0) } val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", "unused") val qb = new QuantileDiscretizer() .setInputCol("rawCol") .setOutputCol("binnedColumn") .setRelativeError(0.0) .setNumBuckets(3) .fit(df) assertResult(expectedSplits) {qb.getSplits} } } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org