Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r93266668 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,69 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } - // Pools sub array within given bounds assigning weighted average value to all elements. - def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum + // Keeps track of the start and end indices of the blocks. blockBounds(start) gives the + // index of the end of the block and blockBounds(end) gives the index of the start of the + // block. Entries that are not the start or end of the block are meaningless. The idea is that --- End diff -- It relies on knowing ahead of time wether `x` is the start or end index<sup>1</sup>. If it's a start index, `blockBounds(x)` gives the ending index of that block. If it's an end index, `blockBounds(x)` gives the starting index of the block. So yes, the implementations of `blockStart` and `blockEnd` are identical. I just have two different functions because it makes the code more readable. Maybe the comment from scikit-learn (where I borrowed this idea from) explains it better? (their `target` = my `blockBounds`) > target describes a list of blocks. At any time, if [i..j] (inclusive) is > an active block, then blockBounds[i] := j and target[j] := i. The trick is just in maintaining the array so that the above property is always true. At initialization, it's trivially true because all blocks have only one element, and `blockBounds(x)` = `x`. After initialization, `blockBounds` is only modified by the merge function, which is set up to modify `blockBounds` so that this property is preserved, then return the starting index of the newly-merged block. This is admittedly a bit tricky, but it's a lot faster than the implementation I did where created a doubly-linked list of `Block` case classes. I'm open to suggestions on how to explain it better. <sup>1</sup> You could actually figure out whether you have a start or an end index by comparing `blockBounds(x)` to `x`. The lesser value will be the start index.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org