[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2017-01-16 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
Looks like the build completed successfully. Somehow Jenkins didn't post 
the message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2017-01-13 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
My thought was that there's nothing in the computation that prevents it 
(unlike 0 weights, which cause NaNs). Why not err on the side of maximum 
flexibility to the end user? Just because there's no use case for it now 
doesn't mean someone won't find one in the future.

Just my 2 cents. I don't feel strongly about it and will defer to commiters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2017-01-09 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
Your changes sound fine. I don't have strong feelings one way or another, 
although I think we should at least throw a warning if we're discarding points. 
I do want to point out that disallowing negative weights also breaks the API, 
since they were previously supported (there's even a test for them).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2017-01-04 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r94641912
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,80 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. if [i, j] 
is a valid block from
+// input(i) to input(j) (inclusive), then blockBounds(i) = j and 
blockBounds(j) = i
+val blockBounds = Array.range(0, input.length) // Initially, each data 
point is its own block
 
-  var i = start
-  while (i <= end) {
-input(i) = (weightedSum / weight, input(i)._2, input(i)._3)
-i = i + 1
-  }
+// Keep track of the sum of weights and sum of weight * y for each 
block. weights(start)
+// gives the values for the block. Entries that are not at the start 
of a block
+// are meaningless.
+val weights: Array[(Double, Double)] = input.map { case (y, _, weight) 
=>
+  require(weight != 0.0)
--- End diff --

It's a weird thing to do, but I don't think there's any reason we can't 
support it. The problem still has a unique solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2017-01-04 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
Done. It was surprisingly tricky to test, since the exception is thrown in 
the executor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2017-01-03 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
Sorry, I've been on vacation. I'll make the requested changes in the next 
day or two.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2016-12-20 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
@viirya Better to remove them, or throw an error? Personally, I'd rather be 
alerted that I'm passing invalid input, rather than have it "fixed" for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-20 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r9328
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -328,74 +336,69 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
   return Array.empty
 }
 
-// Pools sub array within given bounds assigning weighted average 
value to all elements.
-def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
-  val poolSubArray = input.slice(start, end + 1)
 
-  val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
-  val weight = poolSubArray.map(_._3).sum
+// Keeps track of the start and end indices of the blocks. 
blockBounds(start) gives the
+// index of the end of the block and blockBounds(end) gives the index 
of the start of the
+// block. Entries that are not the start or end of the block are 
meaningless. The idea is that
--- End diff --

It relies on knowing ahead of time wether `x` is the start or end 
index1. If it's a start index, `blockBounds(x)` gives the ending 
index of that block. If it's an end index, `blockBounds(x)` gives the starting 
index of the block. So yes, the implementations of `blockStart` and `blockEnd` 
are identical. I just have two different functions because it makes the code 
more readable.

Maybe the comment from scikit-learn (where I borrowed this idea from) 
explains it better? (their `target` = my `blockBounds`)

> target describes a list of blocks.  At any time, if [i..j] (inclusive) is
> an active block, then blockBounds[i] := j and target[j] := i.

The trick is just in maintaining the array so that the above property is 
always true. At initialization, it's trivially true because all blocks have 
only one element, and `blockBounds(x)` = `x`. After initialization, 
`blockBounds` is only modified by the merge function, which is set up to modify 
`blockBounds` so that this property is preserved, then return the starting 
index of the newly-merged block.

This is admittedly a bit tricky, but it's a lot faster than the 
implementation I did where created a doubly-linked list of `Block` case 
classes. I'm open to suggestions on how to explain it better.

1 You could actually figure out whether you have a start or an 
end index by comparing `blockBounds(x)` to `x`. The lesser value will be the 
start index.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2016-12-19 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
@srowen Addressed your comments and fixed some style issues.

Some updated timings:

Alternating decreasing input

val x = (1 to length).toArray.map(_.toDouble)
val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi 
- 1.5 else yi}

Before: 

| Input Length | Time (us) |
|-:|--:|
| 100 | 1.4 |
| 200 | 3.0 |
| 400 | 101.8 |
| 800 | 2,010,511.7 |

After:

| Input Length | Time (us) |
|-:|--:|
| 100 | 2.5 |
| 200 | 5.8 |
| 400 | 10.1 |
| 800 | 20.7 |

Pathological input from scikit-learn benchmarks:

val y= ((0 until length) ++ (-(length - 1) until length) ++ (-(length - 
1) to 0)).toArray.map(_.toDouble)
val x = (1 to y.length).toArray.map(_.toDouble)

Before:

| Input Length | Time (us) |
|-:|--:|
| 40 | 2.2 |
| 80 | 4.7 |
| 160 | 2140.0 |
| 320 | 3,076,508.1 |

After:

| Input Length | Time (us) |
|-:|--:|
| 40 | 5.3 |
| 80 | 10.4 |
| 160 | 21.3 |
| 320 | 41.9 |


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2016-12-15 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
Alright, the PAV algorithm has been completely re-written to follow what's 
outlined in "Minimizing Separable Convex Functions Subject to Simple Chain 
Constraints". I've tested it with a bunch of different inputs that caused 
previous version of this algorithm to go non-polynomial. It stays linear for 
all of them. I will note that it's slightly slower on very small datasets (< 
5000 points or so), but those still finish in less than a millisecond on my 
laptop, so I'm not too concerned.

The only caveat is that there was one test that I couldn't get passing, so 
I removed it. It involved passing input data with 0 weights. I'd argue that the 
isotonic regression problem doesn't even have a unique solution in that case, 
so we shouldn't support it. Still, it is a slight change in behavior.

Looks like Jenkins is pointing out some style issues, so I'll get to work 
on those.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2016-12-15 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
Found another input that triggers non-polynomial time with the code in this 
PR. I'm again borrowing from scikit-learn. I think this is the case they found 
that led them to re-write their implementation.

```
val y = ((0 until length) ++ (-(length - 1) until length) ++ (-(length 
- 1) to 0)).toArray.map(_.toDouble)
val x = (1 to y.length).toArray.map(_.toDouble)
```

| Input Length | Time (ns) |
| --: | --: |
| 40 | 2059 |
| 80 | 4604 |
| 160 | 1974269 |
| 320 | 3246603433 |

I'm now working on implementing what's described in the Best papers. This 
should give O(n), even in the worst case. 

Should I close this and open a new PR with the new algorithm, or just add 
it here and you can squash when you merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2016-12-14 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
Looks like the new scikit-learn implementation suffers from a similar 
problem to the one that the original Spark implementation had. I left them a 
note pointing it out.

Right now I'm working on implementing what's in "Minimizing Separable 
Convex Functions Subject to Simple Chain Constraints", by Best, Chakravarti, 
and Ubhaya. Running down the referecence trail, it looks like the original 
algorithm is in "Projections onto Order Simplexes" by Grotzinger and Witzgall, 
and this particular formulation of it is originally in "Active set algorithms 
for isotonic regression; A unifying framework" by Best and Chakravarti.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2016-12-14 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
Given that the scikit-learn implementation I based this on has changed 
(https://github.com/scikit-learn/scikit-learn/pull/7444) since I submitted the 
PR, and @jkbradley has pointed out a reference I hadn't seen, I'd like to take 
a little bit of time to see if there's yet a better algorithm.

This one is an improvement over what we had, but as long as we're changing 
it, we may as well get it right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-14 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92432636
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
--- End diff --

I hadn't actually seen that paper. I just worked from the scikit-learn 
implementation (which, incidentally, has changed since I submitted this PR).

I believe that either way will get you correct results. Doing `>=` should 
be faster in some cases, because it's greedier about pooling, which should lead 
to fewer iterations of the outer loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-12-14 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/15018#discussion_r92428442
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -344,27 +344,30 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
 }
 
 var i = 0
-val len = input.length
-while (i < len) {
-  var j = i
-
-  // Find monotonicity violating sequence, if any.
-  while (j < len - 1 && input(j)._1 > input(j + 1)._1) {
-j = j + 1
-  }
+val n = input.length - 1
+var notFinished = true
+
+while (notFinished) {
+  i = 0
+  notFinished = false
+
+  // Iterate through the data, fix any monotonicity violations we find
+  // We may need to do this multiple times, as pooling can introduce 
violations
+  // at locations that were previously fine.
+  while (i < n) {
+var j = i
+
+// Find next monotonicity violating sequence, if any.
+while (j < n && input(j)._1 >= input(j + 1)._1) {
+  j = j + 1
+}
 
-  // If monotonicity was not violated, move to next data point.
-  if (i == j) {
-i = i + 1
-  } else {
-// Otherwise pool the violating sequence
-// and check if pooling caused monotonicity violation in 
previously processed points.
-while (i >= 0 && input(i)._1 > input(i + 1)._1) {
+// Pool the violating sequence with the data point preceding it
+if (input(i)._1 != input(j)._1) {
--- End diff --

Correct. Any violations that are introduced will be fixed in the next 
iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15396: [SPARK-14804][Spark][Graphx] Fix checkpointing of Vertex...

2016-11-11 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15396
  
Any update on this? Would love to see it in an upcoming release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2016-10-20 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
Anyone? @srowen?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12576: [SPARK-14804][Spark][Graphx] Fix Graph vertexRDD/EdgeRDD...

2016-10-11 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/12576
  
Not sure what's going on regarding the multiple PRs for this issue, but I 
cherry-picked this PR on top of 1.6.2 and it fixed the problem for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15018: [SPARK-17455][MLlib] Improve PAVA implementation in Isot...

2016-10-04 Thread neggert
Github user neggert commented on the issue:

https://github.com/apache/spark/pull/15018
  
Could someone please take a look at this? @mengxr maybe? Looks like you 
merged the initial implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

2016-09-08 Thread neggert
GitHub user neggert opened a pull request:

https://github.com/apache/spark/pull/15018

[SPARK-17455][MLlib] Improve PAVA implementation in IsotonicRegression

## What changes were proposed in this pull request?

New implementation of the Pool Adjacent Violators Algorithm (PAVA) in 
mllib.IsotonicRegression. The previous implementation could have factorial 
complexity in the worst case. This implementation, which closely follows those 
in scikit-learn and the R `iso` package, runs in quadratic time in the worst 
case.

## How was this patch tested?

Existing unit tests in both `mllib` and `ml` passed before and after this 
patch. Scaling properties were tested by running the `poolAdjacentViolators` 
method in 
[scala-benchmarking-template](https://github.com/sirthias/scala-benchmarking-template)
 with the input generated by

```{scala}
val x = (1 to length).toArray.map(_.toDouble)
val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 
1.5 else yi}
val w = Array.fill(length)(1d)

val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, 
x), w) => (y, x, w)}
```

Before this patch:

| Input Length | Time (us) |
| :| -:|
|100   |   1.35|
|200   |   3.14|
|400   | 116.10|
|800   | 2134225.90|

After this patch:

| Input Length | Time (us) |
| :| -:|
|100   |   1.25|
|200   |   2.53|
|400   |   5.86|
|800   |  10.55|

Benchmarking was also performed with randomly-generated y values, with 
similar results.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/neggert/spark SPARK-17455-isoreg-algo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15018


commit 5608ae1f4c9089dd3a299eef14af59bed6c96192
Author: z001qdp <nicholas.egg...@target.com>
Date:   2016-09-08T21:43:24Z

New PAVA implementation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14140: [SPARK-16426][MLlib] Fix bug that caused NaNs in ...

2016-07-14 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/14140#discussion_r70818220
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -408,9 +409,11 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
*/
   private def parallelPoolAdjacentViolators(
   input: RDD[(Double, Double, Double)]): Array[(Double, Double, 
Double)] = {
-val parallelStepResult = input
-  .sortBy(x => (x._2, x._1))
-  .glom()
+val keyedInput = input.keyBy(_._2)
+val parallelStepResult = keyedInput
+  .partitionBy(new RangePartitioner(keyedInput.getNumPartitions, 
keyedInput))
+  .values
+  .mapPartitions( p => Iterator(p.toSeq.sortBy(x => (x._2, 
x._1)).toArray))
--- End diff --

Actually, the `Iterator` is necessary. Since we removed the `glom`, we need 
to return an `RDD[Array[T]]`. So the function passed to `mapPartitions` needs 
to be `Iterator[T] => Iterator[Array[T]]`. Removed the extraneous `toSeq`, 
though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14140: [SPARK-16426][MLlib] Fix bug that caused NaNs in ...

2016-07-14 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/14140#discussion_r70816925
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -408,9 +409,11 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
*/
   private def parallelPoolAdjacentViolators(
   input: RDD[(Double, Double, Double)]): Array[(Double, Double, 
Double)] = {
-val parallelStepResult = input
-  .sortBy(x => (x._2, x._1))
-  .glom()
+val keyedInput = input.keyBy(_._2)
+val parallelStepResult = keyedInput
+  .partitionBy(new RangePartitioner(keyedInput.getNumPartitions, 
keyedInput))
+  .values
+  .mapPartitions( p => Iterator(p.toSeq.sortBy(x => (x._2, 
x._1)).toArray))
--- End diff --

`keyedInput` is separated out because it needs to be used several times. 
Will clean up the sorting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14140: [SPARK-16426][MLlib] Fix bug that caused NaNs in ...

2016-07-12 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/14140#discussion_r70451628
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -408,8 +409,12 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
*/
   private def parallelPoolAdjacentViolators(
   input: RDD[(Double, Double, Double)]): Array[(Double, Double, 
Double)] = {
-val parallelStepResult = input
-  .sortBy(x => (x._2, x._1))
+val keyedInput = input
--- End diff --

Also, doesn't the `glom` load the whole partition into memory anyway?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14140: [SPARK-16426][MLlib] Fix bug that caused NaNs in ...

2016-07-12 Thread neggert
Github user neggert commented on a diff in the pull request:

https://github.com/apache/spark/pull/14140#discussion_r70450795
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -408,8 +409,12 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
*/
   private def parallelPoolAdjacentViolators(
   input: RDD[(Double, Double, Double)]): Array[(Double, Double, 
Double)] = {
-val parallelStepResult = input
-  .sortBy(x => (x._2, x._1))
+val keyedInput = input
--- End diff --

`repartitionAndSortWithinPartitions` requires the partition key and the 
sortBy key to be the same. We want to partition by feature, then sort by 
feature *and* label. So it would still require a second step to sort the 
partitions, although they'd be mostly sorted already. Maybe there could be a 
speed-up by using an insertion or Timsort on the mostly-sorted partition...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14140: [SPARK-16426][MLlib] Fixed bug that caused NaNs i...

2016-07-11 Thread neggert
GitHub user neggert opened a pull request:

https://github.com/apache/spark/pull/14140

[SPARK-16426][MLlib] Fixed bug that caused NaNs in IsotonicRegression

## What changes were proposed in this pull request?

Fixed a bug that caused `NaN`s in `IsotonicRegression`. The problem occurs 
when training rows with the same feature value but different labels end up on 
different partitions. This patch changes a `sortBy` call to a 
`partitionBy(RangePartitioner)` followed by a `mapPartitions(sortBy)` in order 
to ensure that all rows with the same feature value end up on the same 
partition.

## How was this patch tested?

Added a unit test.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/neggert/spark SPARK-16426-isotonic-nan

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14140.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14140


commit 3f431164920fb6bd5dc7116748eb178493750136
Author: z001qdp <nicholas.egg...@target.com>
Date:   2016-07-11T20:26:17Z

Fixed bug that caused NaNs in IsotonicRegression




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org