[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15018 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r95914526 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -312,90 +313,120 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } /** - * Performs a pool adjacent violators algorithm (PAV). - * Uses approach with single processing of data where violators - * in previously processed data created by pooling are fixed immediately. - * Uses optimization of discovering monotonicity violating sequences (blocks). + * Performs a pool adjacent violators algorithm (PAV). Implements the algorithm originally + * described in [1], using the formulation from [2, 3]. Uses an array to keep track of start + * and end indices of blocks. * - * @param input Input data of tuples (label, feature, weight). + * [1] Grotzinger, S. J., and C. Witzgall. "Projections onto order simplexes." Applied + * mathematics and Optimization 12.1 (1984): 247-270. + * + * [2] Best, Michael J., and Nilotpal Chakravarti. "Active set algorithms for isotonic + * regression; a unifying framework." Mathematical Programming 47.1-3 (1990): 425-439. + * + * [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. "Minimizing separable convex + * functions subject to simple chain constraints." SIAM Journal on Optimization 10.3 (2000): + * 658-672. + * + * @param input Input data of tuples (label, feature, weight). Weights must + be non-negative. * @return Result tuples (label, feature, weight) where labels were updated * to form a monotone sequence as per isotonic regression definition. */ private def poolAdjacentViolators( input: Array[(Double, Double, Double)]): Array[(Double, Double, Double)] = { -if (input.isEmpty) { - return Array.empty +val cleanInput = input.flatMap{ case (y, x, weight) => + require(weight >= 0.0, "weights must be non-negative") + if (weight == 0.0) { +logWarning(s"Dropping zero-weight point ($y, $x, $weight)") +Array[(Double, Double, Double)]() + } else { +Array((y, x, weight)) + } } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) +if (cleanInput.isEmpty) { + return Array.empty +} - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. if [i, j] is a valid block from +// cleanInput(i) to cleanInput(j) (inclusive), then blockBounds(i) = j and blockBounds(j) = i +// Initially, each data point is its own block. +val blockBounds = Array.range(0, cleanInput.length) - var i = start - while (i <= end) { -input(i) = (weightedSum / weight, input(i)._2, input(i)._3) -i = i + 1 - } +// Keep track of the sum of weights and sum of weight * y for each block. weights(start) +// gives the values for the block. Entries that are not at the start of a block +// are meaningless. +val weights: Array[(Double, Double)] = cleanInput.map { case (y, _, weight) => + (weight, weight * y) } -var i = 0 -val len = input.length -while (i < len) { - var j = i +// a few convenience functions to make the code more readable - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +// blockStart and blockEnd have identical implementations. We create two different +// functions to make the code more expressive +def blockEnd(start: Int): Int = blockBounds(start) +def blockStart(end: Int): Int = blockBounds(end) - // If monotonicity was not violated, move to next data point. - if (i == j) { -i = i + 1 - } else { -// Otherwise pool the violating sequence -// and check if pooling caused monotonicity violation in previously processed points. -while (i >= 0 && input(i)._1 > input(i + 1)._1) { - pool(input, i, j) - i = i - 1 -} +// the next block starts at the index after the end of this block +def nextBlock(start: Int): Int = blockEnd(start) + 1 -i = j - } +// the previou
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r95914244 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -312,90 +313,120 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } /** - * Performs a pool adjacent violators algorithm (PAV). - * Uses approach with single processing of data where violators - * in previously processed data created by pooling are fixed immediately. - * Uses optimization of discovering monotonicity violating sequences (blocks). + * Performs a pool adjacent violators algorithm (PAV). Implements the algorithm originally + * described in [1], using the formulation from [2, 3]. Uses an array to keep track of start + * and end indices of blocks. * - * @param input Input data of tuples (label, feature, weight). + * [1] Grotzinger, S. J., and C. Witzgall. "Projections onto order simplexes." Applied + * mathematics and Optimization 12.1 (1984): 247-270. + * + * [2] Best, Michael J., and Nilotpal Chakravarti. "Active set algorithms for isotonic + * regression; a unifying framework." Mathematical Programming 47.1-3 (1990): 425-439. + * + * [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. "Minimizing separable convex + * functions subject to simple chain constraints." SIAM Journal on Optimization 10.3 (2000): + * 658-672. + * + * @param input Input data of tuples (label, feature, weight). Weights must + be non-negative. * @return Result tuples (label, feature, weight) where labels were updated * to form a monotone sequence as per isotonic regression definition. */ private def poolAdjacentViolators( input: Array[(Double, Double, Double)]): Array[(Double, Double, Double)] = { -if (input.isEmpty) { - return Array.empty +val cleanInput = input.flatMap{ case (y, x, weight) => --- End diff -- filter would be simpler than flatMap --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r95914074 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -312,90 +313,120 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } /** - * Performs a pool adjacent violators algorithm (PAV). - * Uses approach with single processing of data where violators - * in previously processed data created by pooling are fixed immediately. - * Uses optimization of discovering monotonicity violating sequences (blocks). + * Performs a pool adjacent violators algorithm (PAV). Implements the algorithm originally + * described in [1], using the formulation from [2, 3]. Uses an array to keep track of start + * and end indices of blocks. * - * @param input Input data of tuples (label, feature, weight). + * [1] Grotzinger, S. J., and C. Witzgall. "Projections onto order simplexes." Applied + * mathematics and Optimization 12.1 (1984): 247-270. + * + * [2] Best, Michael J., and Nilotpal Chakravarti. "Active set algorithms for isotonic + * regression; a unifying framework." Mathematical Programming 47.1-3 (1990): 425-439. + * + * [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. "Minimizing separable convex + * functions subject to simple chain constraints." SIAM Journal on Optimization 10.3 (2000): + * 658-672. + * + * @param input Input data of tuples (label, feature, weight). Weights must + be non-negative. * @return Result tuples (label, feature, weight) where labels were updated * to form a monotone sequence as per isotonic regression definition. */ private def poolAdjacentViolators( input: Array[(Double, Double, Double)]): Array[(Double, Double, Double)] = { -if (input.isEmpty) { - return Array.empty +val cleanInput = input.flatMap{ case (y, x, weight) => + require(weight >= 0.0, "weights must be non-negative") + if (weight == 0.0) { +logWarning(s"Dropping zero-weight point ($y, $x, $weight)") --- End diff -- I disagree about logging for zero-weight points. If an instance has zero weight, then it should be ignored, and that's not a problem. If we really want to log this, then let's do it once per dataset, not once per row, which could make logs explode. I'd also say we should lower the logging level to logInfo. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r95914179 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -312,90 +313,120 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } /** - * Performs a pool adjacent violators algorithm (PAV). - * Uses approach with single processing of data where violators - * in previously processed data created by pooling are fixed immediately. - * Uses optimization of discovering monotonicity violating sequences (blocks). + * Performs a pool adjacent violators algorithm (PAV). Implements the algorithm originally + * described in [1], using the formulation from [2, 3]. Uses an array to keep track of start + * and end indices of blocks. * - * @param input Input data of tuples (label, feature, weight). + * [1] Grotzinger, S. J., and C. Witzgall. "Projections onto order simplexes." Applied + * mathematics and Optimization 12.1 (1984): 247-270. + * + * [2] Best, Michael J., and Nilotpal Chakravarti. "Active set algorithms for isotonic + * regression; a unifying framework." Mathematical Programming 47.1-3 (1990): 425-439. + * + * [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. "Minimizing separable convex + * functions subject to simple chain constraints." SIAM Journal on Optimization 10.3 (2000): + * 658-672. + * + * @param input Input data of tuples (label, feature, weight). Weights must + be non-negative. * @return Result tuples (label, feature, weight) where labels were updated * to form a monotone sequence as per isotonic regression definition. */ private def poolAdjacentViolators( input: Array[(Double, Double, Double)]): Array[(Double, Double, Double)] = { -if (input.isEmpty) { - return Array.empty +val cleanInput = input.flatMap{ case (y, x, weight) => + require(weight >= 0.0, "weights must be non-negative") --- End diff -- State invalid y,x,weight. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r95775162 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -312,10 +313,19 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } /** - * Performs a pool adjacent violators algorithm (PAV). - * Uses approach with single processing of data where violators - * in previously processed data created by pooling are fixed immediately. - * Uses optimization of discovering monotonicity violating sequences (blocks). + * Performs a pool adjacent violators algorithm (PAV). Implements the algorithm originally + * described in [1], using the formulation from [2, 3]. Uses an array to keep track of start + * and end indices of blocks. + * + * [1] Grotzinger, S. J., and C. Witzgall. "Projections onto order simplexes." Applied + * mathematics and Optimization 12.1 (1984): 247-270. + * + * [2] Best, Michael J., and Nilotpal Chakravarti. "Active set algorithms for isotonic + * regression; a unifying framework." Mathematical Programming 47.1-3 (1990): 425-439. + * + * [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. "Minimizing separable convex + * functions subject to simple chain constraints." SIAM Journal on Optimization 10.3 (2000): + * 658-672. * * @param input Input data of tuples (label, feature, weight). --- End diff -- Can you update the param doc to indicate that now negative weights are prohibited? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r95094551 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,80 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. if [i, j] is a valid block from +// input(i) to input(j) (inclusive), then blockBounds(i) = j and blockBounds(j) = i +val blockBounds = Array.range(0, input.length) // Initially, each data point is its own block - var i = start - while (i <= end) { -input(i) = (weightedSum / weight, input(i)._2, input(i)._3) -i = i + 1 - } +// Keep track of the sum of weights and sum of weight * y for each block. weights(start) +// gives the values for the block. Entries that are not at the start of a block +// are meaningless. +val weights: Array[(Double, Double)] = input.map { case (y, _, weight) => + require(weight != 0.0) --- End diff -- Please add an informative error message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r95094550 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,80 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. if [i, j] is a valid block from +// input(i) to input(j) (inclusive), then blockBounds(i) = j and blockBounds(j) = i +val blockBounds = Array.range(0, input.length) // Initially, each data point is its own block - var i = start - while (i <= end) { -input(i) = (weightedSum / weight, input(i)._2, input(i)._3) -i = i + 1 - } +// Keep track of the sum of weights and sum of weight * y for each block. weights(start) +// gives the values for the block. Entries that are not at the start of a block +// are meaningless. +val weights: Array[(Double, Double)] = input.map { case (y, _, weight) => + require(weight != 0.0) + (weight, weight * y) } -var i = 0 -val len = input.length -while (i < len) { - var j = i +// a few convenience functions to make the code more readable + +// blockStart and blockEnd have identical implementations. We create two different +// functions to make the code more expressive +def blockEnd(start: Int): Int = blockBounds(start) +def blockStart(end: Int): Int = blockBounds(end) + +// the next block starts at the index after the end of this block +def nextBlock(start: Int): Int = blockEnd(start) + 1 + +// the previous block ends at the index before the start of this block +// we then use blockStart to find the start +def prevBlock(start: Int): Int = blockStart(start - 1) + +// Merge two adjacent blocks, updating blockBounds and weights to reflect the merge +// Return the start index of the merged block +def merge(block1: Int, block2: Int): Int = { + assert(blockEnd(block1) + 1 == block2, "attempting to merge non-consecutive blocks") --- End diff -- Please make the error message more informative: * Make it clear that this indicates an internal bug within IsotonicRegression * State the invalid values --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r94641912 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,80 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. if [i, j] is a valid block from +// input(i) to input(j) (inclusive), then blockBounds(i) = j and blockBounds(j) = i +val blockBounds = Array.range(0, input.length) // Initially, each data point is its own block - var i = start - while (i <= end) { -input(i) = (weightedSum / weight, input(i)._2, input(i)._3) -i = i + 1 - } +// Keep track of the sum of weights and sum of weight * y for each block. weights(start) +// gives the values for the block. Entries that are not at the start of a block +// are meaningless. +val weights: Array[(Double, Double)] = input.map { case (y, _, weight) => + require(weight != 0.0) --- End diff -- It's a weird thing to do, but I don't think there's any reason we can't support it. The problem still has a unique solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r94640811 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,80 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. if [i, j] is a valid block from +// input(i) to input(j) (inclusive), then blockBounds(i) = j and blockBounds(j) = i +val blockBounds = Array.range(0, input.length) // Initially, each data point is its own block - var i = start - while (i <= end) { -input(i) = (weightedSum / weight, input(i)._2, input(i)._3) -i = i + 1 - } +// Keep track of the sum of weights and sum of weight * y for each block. weights(start) +// gives the values for the block. Entries that are not at the start of a block +// are meaningless. +val weights: Array[(Double, Double)] = input.map { case (y, _, weight) => + require(weight != 0.0) --- End diff -- Are negative weights OK? (you don't need a type on `weight`, but hardly matters) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r94571308 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,81 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. if [i, j] is a valid block from +// input(i) to input(j) (inclusive), then blockBounds(i) = j and blockBounds(j) = i +val blockBounds = Array.range(0, input.length) // Initially, each data point is its own block - var i = start - while (i <= end) { -input(i) = (weightedSum / weight, input(i)._2, input(i)._3) -i = i + 1 - } +// Keep track of the sum of weights and sum of weight * y for each block. weights(start) +// gives the values for the block. Entries that are not at the start of a block +// are meaningless. +val weights: Array[(Double, Double)] = input.map { + case (_, _, weight) if weight == 0d => --- End diff -- I'd always write 0.0 instead of 0d for clarity. I also think we want an `IllegalArgumentException`, so maybe: ``` val weights = input.map { case (y, _, weight) => require(weight > 0.0) (weight, weight * y) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r9328 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,69 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. blockBounds(start) gives the +// index of the end of the block and blockBounds(end) gives the index of the start of the +// block. Entries that are not the start or end of the block are meaningless. The idea is that --- End diff -- It relies on knowing ahead of time wether `x` is the start or end index1. If it's a start index, `blockBounds(x)` gives the ending index of that block. If it's an end index, `blockBounds(x)` gives the starting index of the block. So yes, the implementations of `blockStart` and `blockEnd` are identical. I just have two different functions because it makes the code more readable. Maybe the comment from scikit-learn (where I borrowed this idea from) explains it better? (their `target` = my `blockBounds`) > target describes a list of blocks. At any time, if [i..j] (inclusive) is > an active block, then blockBounds[i] := j and target[j] := i. The trick is just in maintaining the array so that the above property is always true. At initialization, it's trivially true because all blocks have only one element, and `blockBounds(x)` = `x`. After initialization, `blockBounds` is only modified by the merge function, which is set up to modify `blockBounds` so that this property is preserved, then return the starting index of the newly-merged block. This is admittedly a bit tricky, but it's a lot faster than the implementation I did where created a doubly-linked list of `Block` case classes. I'm open to suggestions on how to explain it better. 1 You could actually figure out whether you have a start or an end index by comparing `blockBounds(x)` to `x`. The lesser value will be the start index. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r93229282 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,69 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum +// Keeps track of the start and end indices of the blocks. blockBounds(start) gives the +// index of the end of the block and blockBounds(end) gives the index of the start of the +// block. Entries that are not the start or end of the block are meaningless. The idea is that --- End diff -- I'm still not sure about this comment -- how can `blockBounds(x)` be both the start and end of a block? the implementation below is identical. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92920572 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,68 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum - - var i = start - while (i <= end) { -input(i) = (weightedSum / weight, input(i)._2, input(i)._3) -i = i + 1 - } +/* +Keeps track of the start and end indices of the blocks. blockBounds(start) gives the +index of the end of the block and blockBounds(end) gives the index of the start of the +block. Entries that are not the start or end of the block are meaningless. +*/ +val blockBounds = Array.range(0, input.length) // Initially, each data point is its own block + +/* +Keep track of the sum of weights and sum of weight * y for each block. weights(start) +gives the values for the block. Entries that are not at the start of a block +are meaningless. +*/ +val weights: Array[(Double, Double)] = input.map(x => (x._3, x._3 * x._1)) // (weight, weight * y) --- End diff -- Up to you; might be clearer as `input.map { case (y, _, weight) => (weight, weight * y) }` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92920619 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,68 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum - - var i = start - while (i <= end) { -input(i) = (weightedSum / weight, input(i)._2, input(i)._3) -i = i + 1 - } +/* +Keeps track of the start and end indices of the blocks. blockBounds(start) gives the +index of the end of the block and blockBounds(end) gives the index of the start of the --- End diff -- blockBounds(start) is the _end_ of a block? this is reflected in your helper functions below but it seems reversed. Can you double check and elaborate a bit? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92920879 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -328,74 +336,68 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali return Array.empty } -// Pools sub array within given bounds assigning weighted average value to all elements. -def pool(input: Array[(Double, Double, Double)], start: Int, end: Int): Unit = { - val poolSubArray = input.slice(start, end + 1) - - val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum - val weight = poolSubArray.map(_._3).sum - - var i = start - while (i <= end) { -input(i) = (weightedSum / weight, input(i)._2, input(i)._3) -i = i + 1 - } +/* +Keeps track of the start and end indices of the blocks. blockBounds(start) gives the +index of the end of the block and blockBounds(end) gives the index of the start of the +block. Entries that are not the start or end of the block are meaningless. +*/ +val blockBounds = Array.range(0, input.length) // Initially, each data point is its own block + +/* +Keep track of the sum of weights and sum of weight * y for each block. weights(start) +gives the values for the block. Entries that are not at the start of a block +are meaningless. +*/ +val weights: Array[(Double, Double)] = input.map(x => (x._3, x._3 * x._1)) // (weight, weight * y) + +// a few convenience functions to make the code more readable +def blockEnd(start: Int) = blockBounds(start) +def blockStart(end: Int) = blockBounds(end) +def nextBlock(start: Int) = blockEnd(start) + 1 +def prevBlock(start: Int) = blockStart(start - 1) +def merge(block1: Int, block2: Int): Int = { + assert(blockEnd(block1) + 1 == block2, "attempting to merge non-consecutive blocks") + blockBounds(block1) = blockEnd(block2) + blockBounds(blockEnd(block2)) = block1 + val w1 = weights(block1) + val w2 = weights(block2) + weights(block1) = (w1._1 + w2._1, w1._2 + w2._2) + block1 } +def average(start: Int) = weights(start)._2 / weights(start)._1 +/* +Implement Algorithm PAV from [3]. +Merge on >= instead of > because it elimnate adjacent blocks with the same average, and we want +to compress our output as much as possible. Both give correct results. +*/ var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } - - // If monotonicity was not violated, move to next data point. - if (i == j) { -i = i + 1 - } else { -// Otherwise pool the violating sequence -// and check if pooling caused monotonicity violation in previously processed points. -while (i >= 0 && input(i)._1 > input(i + 1)._1) { - pool(input, i, j) - i = i - 1 +while (nextBlock(i) < input.length) { + if (average(i) >= average(nextBlock(i))) { +merge(i, nextBlock(i)) +while((i > 0) && (average(prevBlock(i)) >= average(i))) { --- End diff -- Super nit: space before `while` here and below. You can probably nix the extra parens around each of the two terms, but no big deal --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92432636 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { --- End diff -- I hadn't actually seen that paper. I just worked from the scikit-learn implementation (which, incidentally, has changed since I submitted this PR). I believe that either way will get you correct results. Doing `>=` should be faster in some cases, because it's greedier about pooling, which should lead to fewer iterations of the outer loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user neggert commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92428442 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { + j = j + 1 +} - // If monotonicity was not violated, move to next data point. - if (i == j) { -i = i + 1 - } else { -// Otherwise pool the violating sequence -// and check if pooling caused monotonicity violation in previously processed points. -while (i >= 0 && input(i)._1 > input(i + 1)._1) { +// Pool the violating sequence with the data point preceding it +if (input(i)._1 != input(j)._1) { --- End diff -- Correct. Any violations that are introduced will be fixed in the next iteration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92331409 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { --- End diff -- I think the original one i.e., `input(j)._1 > input(j + 1)._1` is correct. Here it is going to select out-of-order blocks. Quoted from the paper: > We refer to two blocks [p, q] and [q + 1, r] as consecutive. We refer to two consecutive blocks [p, q] and [q +1, r] as in-order if theta_pq <= theta_q+1, r and out-of-order otherwise. LEMMA 1 is pointing the how a merged block is also a single-valued block. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92322282 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { + j = j + 1 +} - // If monotonicity was not violated, move to next data point. - if (i == j) { -i = i + 1 - } else { -// Otherwise pool the violating sequence -// and check if pooling caused monotonicity violation in previously processed points. -while (i >= 0 && input(i)._1 > input(i + 1)._1) { +// Pool the violating sequence with the data point preceding it --- End diff -- Is this comment correct? Looks like you pool the violating sequence [i, j] only. The preceding data points are checked for pooling in next outer loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92298473 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { + j = j + 1 +} - // If monotonicity was not violated, move to next data point. - if (i == j) { -i = i + 1 - } else { -// Otherwise pool the violating sequence -// and check if pooling caused monotonicity violation in previously processed points. -while (i >= 0 && input(i)._1 > input(i + 1)._1) { +// Pool the violating sequence with the data point preceding it +if (input(i)._1 != input(j)._1) { --- End diff -- I believe the logic is: * If equality holds here, then either (a) i = j or (b) the block [i,j] has equal elements. * A violation could be introduced but will be caught on the next iteration of the outer loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92298204 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { --- End diff -- That should be correct. Lemma 1 from "A Fast Scaling Algorithm for Minimizing Separable Convex Functions Subject to Chain Constraints" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92232757 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { + j = j + 1 +} - // If monotonicity was not violated, move to next data point. - if (i == j) { -i = i + 1 - } else { -// Otherwise pool the violating sequence -// and check if pooling caused monotonicity violation in previously processed points. -while (i >= 0 && input(i)._1 > input(i + 1)._1) { +// Pool the violating sequence with the data point preceding it +if (input(i)._1 != input(j)._1) { --- End diff -- This condition is "if there was any decrease, not just == from i to j" right? and `pool` makes everything in [i,j] non-violating. Comments from the previous code note that this could introduce a violation -- is this still possible? it wasn't obvious why it can't happen now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92230452 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { + j = j + 1 --- End diff -- Nit: `j += 1` and likewise elsewhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15018#discussion_r92231712 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -344,27 +344,30 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali } var i = 0 -val len = input.length -while (i < len) { - var j = i - - // Find monotonicity violating sequence, if any. - while (j < len - 1 && input(j)._1 > input(j + 1)._1) { -j = j + 1 - } +val n = input.length - 1 +var notFinished = true + +while (notFinished) { + i = 0 + notFinished = false + + // Iterate through the data, fix any monotonicity violations we find + // We may need to do this multiple times, as pooling can introduce violations + // at locations that were previously fine. + while (i < n) { +var j = i + +// Find next monotonicity violating sequence, if any. +while (j < n && input(j)._1 >= input(j + 1)._1) { --- End diff -- The condition changed to >= here on purpose right? you're now looking for a region that's non-increasing, not just strictly decreasing. Just checking. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...
GitHub user neggert opened a pull request: https://github.com/apache/spark/pull/15018 [SPARK-17455][MLlib] Improve PAVA implementation in IsotonicRegression ## What changes were proposed in this pull request? New implementation of the Pool Adjacent Violators Algorithm (PAVA) in mllib.IsotonicRegression. The previous implementation could have factorial complexity in the worst case. This implementation, which closely follows those in scikit-learn and the R `iso` package, runs in quadratic time in the worst case. ## How was this patch tested? Existing unit tests in both `mllib` and `ml` passed before and after this patch. Scaling properties were tested by running the `poolAdjacentViolators` method in [scala-benchmarking-template](https://github.com/sirthias/scala-benchmarking-template) with the input generated by ```{scala} val x = (1 to length).toArray.map(_.toDouble) val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi} val w = Array.fill(length)(1d) val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), w) => (y, x, w)} ``` Before this patch: | Input Length | Time (us) | | :| -:| |100 | 1.35| |200 | 3.14| |400 | 116.10| |800 | 2134225.90| After this patch: | Input Length | Time (us) | | :| -:| |100 | 1.25| |200 | 2.53| |400 | 5.86| |800 | 10.55| Benchmarking was also performed with randomly-generated y values, with similar results. You can merge this pull request into a Git repository by running: $ git pull https://github.com/neggert/spark SPARK-17455-isoreg-algo Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15018.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15018 commit 5608ae1f4c9089dd3a299eef14af59bed6c96192 Author: z001qdp Date: 2016-09-08T21:43:24Z New PAVA implementation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org