[ 
https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21986:
------------------------------
      Priority: Minor  (was: Major)
    Issue Type: Improvement  (was: Bug)

It's an approximate algorithm, and this is a tiny amount of data. I think it's 
at best a potential improvement, if it's doing slightly the wrong thing in a 
corner case.

However, is this wrong? you're asking for the 33%/66%-tiles. In both cases, at 
least 66% of the values are <= 0. I suppose it finds 40 in the first case as 
it's a bit approximate, but in the second case, it's far off.

> QuantileDiscretizer picks wrong split point for data with lots of 0's
> ---------------------------------------------------------------------
>
>                 Key: SPARK-21986
>                 URL: https://issues.apache.org/jira/browse/SPARK-21986
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Barry Becker
>            Priority: Minor
>
> I have some simple test cases to help illustrate (see below).
> I discovered this with data that had 96,000 rows, but can reproduce with much 
> smaller data that has roughly the same distribution of values.
> If I have data like
>   Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> and ask for 3 buckets, then it does the right thing and yields splits of 
> Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)
> However, if I add just one more zero, such that I have data like
>  Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> then it will do the wrong thing and give splits of 
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
> I'm not bothered that it gave fewer buckets than asked for (that is to be 
> expected), but I am bothered that it picked 0.0 instead of 40 as the one 
> split point.
> The way it did it, now I have 1 bucket with all the data, and a second with 
> none of the data.
> Am I interpreting something wrong?
> Here are my 2 test cases in scala:
> {code}
> class QuantileDiscretizerSuite extends FunSuite {
>   test("Quantile discretizer on data with lots of 0") {
>     verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>       Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
>   }
>   test("Quantile discretizer on data with one less 0") {
>     verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>       Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
>   }
>   
>   def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
>     val theData: Seq[(Int, Double)] = data.map {
>       case x: Int => (x, 0.0)
>       case _ => (0, 0.0)
>     }
>     val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", 
> "unused")
>     val qb = new QuantileDiscretizer()
>       .setInputCol("rawCol")
>       .setOutputCol("binnedColumn")
>       .setRelativeError(0.0)
>       .setNumBuckets(3)
>       .fit(df)
>     assertResult(expectedSplits) {qb.getSplits}
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to