[ https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-21986: ------------------------------ Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) It's an approximate algorithm, and this is a tiny amount of data. I think it's at best a potential improvement, if it's doing slightly the wrong thing in a corner case. However, is this wrong? you're asking for the 33%/66%-tiles. In both cases, at least 66% of the values are <= 0. I suppose it finds 40 in the first case as it's a bit approximate, but in the second case, it's far off. > QuantileDiscretizer picks wrong split point for data with lots of 0's > --------------------------------------------------------------------- > > Key: SPARK-21986 > URL: https://issues.apache.org/jira/browse/SPARK-21986 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.1.1 > Reporter: Barry Becker > Priority: Minor > > I have some simple test cases to help illustrate (see below). > I discovered this with data that had 96,000 rows, but can reproduce with much > smaller data that has roughly the same distribution of values. > If I have data like > Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0) > and ask for 3 buckets, then it does the right thing and yields splits of > Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity) > However, if I add just one more zero, such that I have data like > Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0) > then it will do the wrong thing and give splits of > Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity)) > I'm not bothered that it gave fewer buckets than asked for (that is to be > expected), but I am bothered that it picked 0.0 instead of 40 as the one > split point. > The way it did it, now I have 1 bucket with all the data, and a second with > none of the data. > Am I interpreting something wrong? > Here are my 2 test cases in scala: > {code} > class QuantileDiscretizerSuite extends FunSuite { > test("Quantile discretizer on data with lots of 0") { > verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0), > Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity)) > } > test("Quantile discretizer on data with one less 0") { > verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0), > Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)) > } > > def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = { > val theData: Seq[(Int, Double)] = data.map { > case x: Int => (x, 0.0) > case _ => (0, 0.0) > } > val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", > "unused") > val qb = new QuantileDiscretizer() > .setInputCol("rawCol") > .setOutputCol("binnedColumn") > .setRelativeError(0.0) > .setNumBuckets(3) > .fit(df) > assertResult(expectedSplits) {qb.getSplits} > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org