[ https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-17455: -------------------------------------- Shepherd: Joseph K. Bradley > IsotonicRegression takes non-polynomial time for some inputs > ------------------------------------------------------------ > > Key: SPARK-17455 > URL: https://issues.apache.org/jira/browse/SPARK-17455 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0 > Reporter: Nic Eggert > > The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently > in MLlib can take O(N!) time for certain inputs, when it should have > worst-case complexity of O(N^2). > To reproduce this, I pulled the private method poolAdjacentViolators out of > mllib.regression.IsotonicRegression and into a benchmarking harness. > Given this input > {code} > val x = (1 to length).toArray.map(_.toDouble) > val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 > else yi} > val w = Array.fill(length)(1d) > val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, > x), w) => (y, x, w)} > {code} > I vary the length of the input to get these timings: > || Input Length || Time (us) || > | 100 | 1.35 | > | 200 | 3.14 | > | 400 | 116.10 | > | 800 | 2134225.90 | > (tests were performed using > https://github.com/sirthias/scala-benchmarking-template) > I can also confirm that I run into this issue on a real dataset I'm working > on when trying to calibrate random forest probability output. Some partitions > take > 12 hours to run. This isn't a skew issue, since the largest partitions > finish in minutes. I can only assume that some partitions cause something > approaching this worst-case complexity. > I'm working on a patch that borrows the implementation that is used in > scikit-learn and the R "iso" package, both of which handle this particular > input in linear time and are quadratic in the worst case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org