[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 Looks like the build completed successfully. Somehow Jenkins didn't post the message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 My thought was that there's nothing in the computation that prevents it (unlike 0 weights, which cause NaNs). Why not err on the side of maximum flexibility to the end user? Just because there's no use case for it now doesn't mean someone won't find one in the future. Just my 2 cents. I don't feel strongly about it and will defer to commiters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 Your changes sound fine. I don't have strong feelings one way or another, although I think we should at least throw a warning if we're discarding points. I do want to point out that disallowing negative weights also breaks the API, since they were previously supported (there's even a test for them). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r94641912 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,80 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. if [i, j] is a valid block from +// input(i) to input(j) (inclusive), then blockBounds(i) = j and blockBounds(j) = i +val blockBounds = Array.range(0, input.length) // Initially, each data point is its own block - var i = start - while (i <= end) { -input(i) = (weightedSum / weight, input(i)._2, input(i)._3) -i = i + 1 - } +// Keep track of the sum of weights and sum of weight * y for each block. weights(start) +// gives the values for the block. Entries that are not at the start of a block +// are meaningless. +val weights: Array[(Double, Double)] = input.map { case (y, _, weight) => + require(weight != 0.0) --- End diff -- It's a weird thing to do, but I don't think there's any reason we can't support it. The problem still has a unique solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 Done. It was surprisingly tricky to test, since the exception is thrown in the executor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 Sorry, I've been on vacation. I'll make the requested changes in the next day or two. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 @viirya Better to remove them, or throw an error? Personally, I'd rather be alerted that I'm passing invalid input, rather than have it "fixed" for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r9328 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,69 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. blockBounds(start) gives the +// index of the end of the block and blockBounds(end) gives the index of the start of the +// block. Entries that are not the start or end of the block are meaningless. The idea is that --- End diff -- It relies on knowing ahead of time wether `x` is the start or end index1. If it's a start index, `blockBounds(x)` gives the ending index of that block. If it's an end index, `blockBounds(x)` gives the starting index of the block. So yes, the implementations of `blockStart` and `blockEnd` are identical. I just have two different functions because it makes the code more readable. Maybe the comment from scikit-learn (where I borrowed this idea from) explains it better? (their `target` = my `blockBounds`) > target describes a list of blocks. At any time, if [i..j] (inclusive) is > an active block, then blockBounds[i] := j and target[j] := i. The trick is just in maintaining the array so that the above property is always true. At initialization, it's trivially true because all blocks have only one element, and `blockBounds(x)` = `x`. After initialization, `blockBounds` is only modified by the merge function, which is set up to modify `blockBounds` so that this property is preserved, then return the starting index of the newly-merged block. This is admittedly a bit tricky, but it's a lot faster than the implementation I did where created a doubly-linked list of `Block` case classes. I'm open to suggestions on how to explain it better. 1 You could actually figure out whether you have a start or an end index by comparing `blockBounds(x)` to `x`. The lesser value will be the start index. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 @srowen Addressed your comments and fixed some style issues. Some updated timings: Alternating decreasing input val x = (1 to length).toArray.map(_.toDouble) val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi} Before: | Input Length | Time (us) | |-:|--:| | 100 | 1.4 | | 200 | 3.0 | | 400 | 101.8 | | 800 | 2,010,511.7 | After: | Input Length | Time (us) | |-:|--:| | 100 | 2.5 | | 200 | 5.8 | | 400 | 10.1 | | 800 | 20.7 | Pathological input from scikit-learn benchmarks: val y= ((0 until length) ++ (-(length - 1) until length) ++ (-(length - 1) to 0)).toArray.map(_.toDouble) val x = (1 to y.length).toArray.map(_.toDouble) Before: | Input Length | Time (us) | |-:|--:| | 40 | 2.2 | | 80 | 4.7 | | 160 | 2140.0 | | 320 | 3,076,508.1 | After: | Input Length | Time (us) | |-:|--:| | 40 | 5.3 | | 80 | 10.4 | | 160 | 21.3 | | 320 | 41.9 | --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 Alright, the PAV algorithm has been completely re-written to follow what's outlined in "Minimizing Separable Convex Functions Subject to Simple Chain Constraints". I've tested it with a bunch of different inputs that caused previous version of this algorithm to go non-polynomial. It stays linear for all of them. I will note that it's slightly slower on very small datasets (< 5000 points or so), but those still finish in less than a millisecond on my laptop, so I'm not too concerned. The only caveat is that there was one test that I couldn't get passing, so I removed it. It involved passing input data with 0 weights. I'd argue that the isotonic regression problem doesn't even have a unique solution in that case, so we shouldn't support it. Still, it is a slight change in behavior. Looks like Jenkins is pointing out some style issues, so I'll get to work on those. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 Found another input that triggers non-polynomial time with the code in this PR. I'm again borrowing from scikit-learn. I think this is the case they found that led them to re-write their implementation. ``` val y = ((0 until length) ++ (-(length - 1) until length) ++ (-(length - 1) to 0)).toArray.map(_.toDouble) val x = (1 to y.length).toArray.map(_.toDouble) ``` | Input Length | Time (ns) | | --: | --: | | 40 | 2059 | | 80 | 4604 | | 160 | 1974269 | | 320 | 3246603433 | I'm now working on implementing what's described in the Best papers. This should give O(n), even in the worst case. Should I close this and open a new PR with the new algorithm, or just add it here and you can squash when you merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 Looks like the new scikit-learn implementation suffers from a similar problem to the one that the original Spark implementation had. I left them a note pointing it out. Right now I'm working on implementing what's in "Minimizing Separable Convex Functions Subject to Simple Chain Constraints", by Best, Chakravarti, and Ubhaya. Running down the referecence trail, it looks like the original algorithm is in "Projections onto Order Simplexes" by Grotzinger and Witzgall, and this particular formulation of it is originally in "Active set algorithms for isotonic regression; A unifying framework" by Best and Chakravarti. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 Given that the scikit-learn implementation I based this on has changed (https://github.com/scikit-learn/scikit-learn/pull/7444) since I submitted the PR, and @jkbradley has pointed out a reference I hadn't seen, I'd like to take a little bit of time to see if there's yet a better algorithm. This one is an improvement over what we had, but as long as we're changing it, we may as well get it right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92432636 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { --- End diff -- I hadn't actually seen that paper. I just worked from the scikit-learn implementation (which, incidentally, has changed since I submitted this PR). I believe that either way will get you correct results. Doing `>=` should be faster in some cases, because it's greedier about pooling, which should lead to fewer iterations of the outer loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92428442 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { + j = j + 1 +} - // If monotonicity was not violated, move to next data point. - if (i == j) { -i = i + 1 - } else { -// Otherwise pool the violating sequence -// and check if pooling caused monotonicity violation in previously processed points. -while (i >= 0 && input(i)._1 > input(i + 1)._1) { +// Pool the violating sequence with the data point preceding it +if (input(i)._1 != input(j)._1) { --- End diff -- Correct. Any violations that are introduced will be fixed in the next iteration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15396: [SPARK-14804][Spark][Graphx] Fix checkpointing of Vertex...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15396 Any update on this? Would love to see it in an upcoming release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 Anyone? @srowen? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12576: [SPARK-14804][Spark][Graphx] Fix Graph vertexRDD/EdgeRDD...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/12576 Not sure what's going on regarding the multiple PRs for this issue, but I cherry-picked this PR on top of 1.6.2 and it fixed the problem for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...
Github user neggert commented on the issue: https://github.com/apache/spark/pull/15018 Could someone please take a look at this? @mengxr maybe? Looks like you merged the initial implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
GitHub user neggert opened a pull request: https://github.com/apache/spark/pull/15018 [SPARK-17455][MLlib] Improve PAVA implementation in IsotonicRegression ## What changes were proposed in this pull request? New implementation of the Pool Adjacent Violators Algorithm (PAVA) in mllib.IsotonicRegression. The previous implementation could have factorial complexity in the worst case. This implementation, which closely follows those in scikit-learn and the R `iso` package, runs in quadratic time in the worst case. ## How was this patch tested? Existing unit tests in both `mllib` and `ml` passed before and after this patch. Scaling properties were tested by running the `poolAdjacentViolators` method in [scala-benchmarking-template](https://github.com/sirthias/scala-benchmarking-template) with the input generated by ```{scala} val x = (1 to length).toArray.map(_.toDouble) val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi} val w = Array.fill(length)(1d) val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), w) => (y, x, w)} ``` Before this patch: | Input Length | Time (us) | | :| -:| |100 | 1.35| |200 | 3.14| |400 | 116.10| |800 | 2134225.90| After this patch: | Input Length | Time (us) | | :| -:| |100 | 1.25| |200 | 2.53| |400 | 5.86| |800 | 10.55| Benchmarking was also performed with randomly-generated y values, with similar results. You can merge this pull request into a Git repository by running: $ git pull https://github.com/neggert/spark SPARK-17455-isoreg-algo Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15018.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15018 commit 5608ae1f4c9089dd3a299eef14af59bed6c96192 Author: z001qdp <nicholas.egg...@target.com> Date: 2016-09-08T21:43:24Z New PAVA implementation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14140: [SPARK-16426][MLlib] Fix bug that caused NaNs in ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/14140#discussion_r70818220 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -408,9 +409,11 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali */ private def parallelPoolAdjacentViolators( input: RDD[(Double, Double, Double)]): Array[(Double, Double, Double)] = { -val parallelStepResult = input - .sortBy(x => (x._2, x._1)) - .glom() +val keyedInput = input.keyBy(_._2) +val parallelStepResult = keyedInput + .partitionBy(new RangePartitioner(keyedInput.getNumPartitions, keyedInput)) + .values + .mapPartitions( p => Iterator(p.toSeq.sortBy(x => (x._2, x._1)).toArray)) --- End diff -- Actually, the `Iterator` is necessary. Since we removed the `glom`, we need to return an `RDD[Array[T]]`. So the function passed to `mapPartitions` needs to be `Iterator[T] => Iterator[Array[T]]`. Removed the extraneous `toSeq`, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14140: [SPARK-16426][MLlib] Fix bug that caused NaNs in ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/14140#discussion_r70816925 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -408,9 +409,11 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali */ private def parallelPoolAdjacentViolators( input: RDD[(Double, Double, Double)]): Array[(Double, Double, Double)] = { -val parallelStepResult = input - .sortBy(x => (x._2, x._1)) - .glom() +val keyedInput = input.keyBy(_._2) +val parallelStepResult = keyedInput + .partitionBy(new RangePartitioner(keyedInput.getNumPartitions, keyedInput)) + .values + .mapPartitions( p => Iterator(p.toSeq.sortBy(x => (x._2, x._1)).toArray)) --- End diff -- `keyedInput` is separated out because it needs to be used several times. Will clean up the sorting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14140: [SPARK-16426][MLlib] Fix bug that caused NaNs in ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/14140#discussion_r70451628 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -408,8 +409,12 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali */ private def parallelPoolAdjacentViolators( input: RDD[(Double, Double, Double)]): Array[(Double, Double, Double)] = { -val parallelStepResult = input - .sortBy(x => (x._2, x._1)) +val keyedInput = input --- End diff -- Also, doesn't the `glom` load the whole partition into memory anyway? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14140: [SPARK-16426][MLlib] Fix bug that caused NaNs in ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/14140#discussion_r70450795 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -408,8 +409,12 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali */ private def parallelPoolAdjacentViolators( input: RDD[(Double, Double, Double)]): Array[(Double, Double, Double)] = { -val parallelStepResult = input - .sortBy(x => (x._2, x._1)) +val keyedInput = input --- End diff -- `repartitionAndSortWithinPartitions` requires the partition key and the sortBy key to be the same. We want to partition by feature, then sort by feature *and* label. So it would still require a second step to sort the partitions, although they'd be mostly sorted already. Maybe there could be a speed-up by using an insertion or Timsort on the mostly-sorted partition... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14140: [SPARK-16426][MLlib] Fixed bug that caused NaNs i...
GitHub user neggert opened a pull request: https://github.com/apache/spark/pull/14140 [SPARK-16426][MLlib] Fixed bug that caused NaNs in IsotonicRegression ## What changes were proposed in this pull request? Fixed a bug that caused `NaN`s in `IsotonicRegression`. The problem occurs when training rows with the same feature value but different labels end up on different partitions. This patch changes a `sortBy` call to a `partitionBy(RangePartitioner)` followed by a `mapPartitions(sortBy)` in order to ensure that all rows with the same feature value end up on the same partition. ## How was this patch tested? Added a unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/neggert/spark SPARK-16426-isotonic-nan Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14140.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14140 commit 3f431164920fb6bd5dc7116748eb178493750136 Author: z001qdp <nicholas.egg...@target.com> Date: 2016-07-11T20:26:17Z Fixed bug that caused NaNs in IsotonicRegression --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org