subject:"\[GitHub\] spark pull request #15593\: \[SPARK\-18060\]\[ML\] Avoid unnecessary computation f..."

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15593


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-11 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r87551541
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
   val initialCoefWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
--- End diff --

By the way, for LiR or binary LoR, row- or col- major will both loop "in 
the direction of features" since whether the coefficients are a "row vector" or 
a "column vector" doesn't matter, no?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-11 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r87550060
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1486,57 +1504,65 @@ private class LogisticAggregator(
 var marginOfLabel = 0.0
 var maxMargin = Double.NegativeInfinity
 
-val margins = Array.tabulate(numClasses) { i =>
-  var margin = 0.0
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  margin += localCoefficients(i * numFeaturesPlusIntercept + 
index) *
-value / localFeaturesStd(index)
-}
+val margins = new Array[Double](numClasses)
+features.foreachActive { (index, value) =>
+  val stdValue = value / localFeaturesStd(index)
+  var j = 0
+  while (j < numClasses) {
+margins(j) += localCoefficients(index * numClasses + j) * stdValue
+j += 1
   }
-
+}
+var i = 0
+while (i < numClasses) {
   if (fitIntercept) {
-margin += localCoefficients(i * numFeaturesPlusIntercept + 
numFeatures)
+margins(i) += localCoefficients(numClasses * numFeatures + i)
   }
-  if (i == label.toInt) marginOfLabel = margin
-  if (margin > maxMargin) {
-maxMargin = margin
+  if (i == label.toInt) marginOfLabel = margins(i)
+  if (margins(i) > maxMargin) {
+maxMargin = margins(i)
   }
-  margin
+  i += 1
 }
 
 /**
  * When maxMargin > 0, the original formula could cause overflow.
  * We address this by subtracting maxMargin from all the margins, so 
it's guaranteed
  * that all of the new margins will be smaller than zero to prevent 
arithmetic overflow.
  */
+val multipliers = new Array[Double](numClasses)
 val sum = {
   var temp = 0.0
-  if (maxMargin > 0) {
-for (i <- 0 until numClasses) {
-  margins(i) -= maxMargin
-  temp += math.exp(margins(i))
-}
-  } else {
-for (i <- 0 until numClasses) {
-  temp += math.exp(margins(i))
-}
+  var i = 0
+  while (i < numClasses) {
+if (maxMargin > 0) margins(i) -= maxMargin
+val exp = math.exp(margins(i))
+temp += exp
+multipliers(i) = exp
+i += 1
   }
   temp
 }
 
-for (i <- 0 until numClasses) {
-  val multiplier = math.exp(margins(i)) / sum - {
-if (label == i) 1.0 else 0.0
-  }
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  localGradientArray(i * numFeaturesPlusIntercept + index) +=
-weight * multiplier * value / localFeaturesStd(index)
+margins.indices.foreach { i =>
+  multipliers(i) = multipliers(i) / sum - (if (label == i) 1.0 else 
0.0)
+}
+features.foreachActive { (index, value) =>
+  if (localFeaturesStd(index) != 0.0 && value != 0.0) {
+val stdValue = value / localFeaturesStd(index)
+var j = 0
+while (j < numClasses) {
+  localGradientArray(index * numClasses + j) +=
+weight * multipliers(j) * stdValue
+  j += 1
 }
   }
-  if (fitIntercept) {
-localGradientArray(i * numFeaturesPlusIntercept + numFeatures) += 
weight * multiplier
+}
+if (fitIntercept) {
+  var i = 0
+  while (i < numClasses) {
+localGradientArray(numFeatures * numClasses + i) += weight * 
multipliers(i)
+i += 1
   }
 }
 
--- End diff --

My (perhaps incorrect) understanding of what you are saying:

"The aggregator should internally convert to col-major during training & 
in-place update and from `gradient` will return a Matrix that can be used 
during L2 update (cycle through using `foreachActive`) in L1678-L1716."

But to me that would require swapping `coeffs.foreachActive` for 
`gradient.foreachActive` in L1684. Since `coeff` is itself a vector (which we 
can't really change due to `BreezeDiffFunction`, and we laid it out col-major 
for training) we would need to index back into that using the same col-major 
indexing anyway, no? So I don't see that we gain anything for L2 reg. Also, if 
we change back and forth every iteration that seems like it would be 
inefficient.

For the L1 reg function, that is purely index-based so `val isIntercept = 
$(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0)` doesn't seem 
any more or less complex than `val isIntercept =

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-10 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r87547519
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1486,57 +1504,65 @@ private class LogisticAggregator(
 var marginOfLabel = 0.0
 var maxMargin = Double.NegativeInfinity
 
-val margins = Array.tabulate(numClasses) { i =>
-  var margin = 0.0
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  margin += localCoefficients(i * numFeaturesPlusIntercept + 
index) *
-value / localFeaturesStd(index)
-}
+val margins = new Array[Double](numClasses)
+features.foreachActive { (index, value) =>
+  val stdValue = value / localFeaturesStd(index)
+  var j = 0
+  while (j < numClasses) {
+margins(j) += localCoefficients(index * numClasses + j) * stdValue
+j += 1
   }
-
+}
+var i = 0
+while (i < numClasses) {
   if (fitIntercept) {
-margin += localCoefficients(i * numFeaturesPlusIntercept + 
numFeatures)
+margins(i) += localCoefficients(numClasses * numFeatures + i)
   }
-  if (i == label.toInt) marginOfLabel = margin
-  if (margin > maxMargin) {
-maxMargin = margin
+  if (i == label.toInt) marginOfLabel = margins(i)
+  if (margins(i) > maxMargin) {
+maxMargin = margins(i)
   }
-  margin
+  i += 1
 }
 
 /**
  * When maxMargin > 0, the original formula could cause overflow.
  * We address this by subtracting maxMargin from all the margins, so 
it's guaranteed
  * that all of the new margins will be smaller than zero to prevent 
arithmetic overflow.
  */
+val multipliers = new Array[Double](numClasses)
 val sum = {
   var temp = 0.0
-  if (maxMargin > 0) {
-for (i <- 0 until numClasses) {
-  margins(i) -= maxMargin
-  temp += math.exp(margins(i))
-}
-  } else {
-for (i <- 0 until numClasses) {
-  temp += math.exp(margins(i))
-}
+  var i = 0
+  while (i < numClasses) {
+if (maxMargin > 0) margins(i) -= maxMargin
+val exp = math.exp(margins(i))
+temp += exp
+multipliers(i) = exp
+i += 1
   }
   temp
 }
 
-for (i <- 0 until numClasses) {
-  val multiplier = math.exp(margins(i)) / sum - {
-if (label == i) 1.0 else 0.0
-  }
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  localGradientArray(i * numFeaturesPlusIntercept + index) +=
-weight * multiplier * value / localFeaturesStd(index)
+margins.indices.foreach { i =>
+  multipliers(i) = multipliers(i) / sum - (if (label == i) 1.0 else 
0.0)
+}
+features.foreachActive { (index, value) =>
+  if (localFeaturesStd(index) != 0.0 && value != 0.0) {
+val stdValue = value / localFeaturesStd(index)
+var j = 0
+while (j < numClasses) {
+  localGradientArray(index * numClasses + j) +=
+weight * multipliers(j) * stdValue
+  j += 1
 }
   }
-  if (fitIntercept) {
-localGradientArray(i * numFeaturesPlusIntercept + numFeatures) += 
weight * multiplier
+}
+if (fitIntercept) {
+  var i = 0
+  while (i < numClasses) {
+localGradientArray(numFeatures * numClasses + i) += weight * 
multipliers(i)
+i += 1
   }
 }
 
--- End diff --

I'm not sure I fully get where you intend to use `foreachActive` over the 
gradient matrix? Maybe it's the location of this comment that is confusing me 
... 

... but here in `multinomialUpdateInPlace`, we are iterating over features 
using `foreachActive`, then for each feature iterating over `numClasses`. If we 
iterate over the gradient using `foreachActive` how will that work? Won't it be 
super inefficient? Perhaps I am missing something about what you intend, could 
you clarify with an example?





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-10 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r87518789
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1486,57 +1504,65 @@ private class LogisticAggregator(
 var marginOfLabel = 0.0
 var maxMargin = Double.NegativeInfinity
 
-val margins = Array.tabulate(numClasses) { i =>
-  var margin = 0.0
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  margin += localCoefficients(i * numFeaturesPlusIntercept + 
index) *
-value / localFeaturesStd(index)
-}
+val margins = new Array[Double](numClasses)
+features.foreachActive { (index, value) =>
+  val stdValue = value / localFeaturesStd(index)
+  var j = 0
+  while (j < numClasses) {
+margins(j) += localCoefficients(index * numClasses + j) * stdValue
+j += 1
   }
-
+}
+var i = 0
+while (i < numClasses) {
   if (fitIntercept) {
-margin += localCoefficients(i * numFeaturesPlusIntercept + 
numFeatures)
+margins(i) += localCoefficients(numClasses * numFeatures + i)
   }
-  if (i == label.toInt) marginOfLabel = margin
-  if (margin > maxMargin) {
-maxMargin = margin
+  if (i == label.toInt) marginOfLabel = margins(i)
+  if (margins(i) > maxMargin) {
+maxMargin = margins(i)
   }
-  margin
+  i += 1
 }
 
 /**
  * When maxMargin > 0, the original formula could cause overflow.
  * We address this by subtracting maxMargin from all the margins, so 
it's guaranteed
  * that all of the new margins will be smaller than zero to prevent 
arithmetic overflow.
  */
+val multipliers = new Array[Double](numClasses)
 val sum = {
   var temp = 0.0
-  if (maxMargin > 0) {
-for (i <- 0 until numClasses) {
-  margins(i) -= maxMargin
-  temp += math.exp(margins(i))
-}
-  } else {
-for (i <- 0 until numClasses) {
-  temp += math.exp(margins(i))
-}
+  var i = 0
+  while (i < numClasses) {
+if (maxMargin > 0) margins(i) -= maxMargin
+val exp = math.exp(margins(i))
+temp += exp
+multipliers(i) = exp
+i += 1
   }
   temp
 }
 
-for (i <- 0 until numClasses) {
-  val multiplier = math.exp(margins(i)) / sum - {
-if (label == i) 1.0 else 0.0
-  }
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  localGradientArray(i * numFeaturesPlusIntercept + index) +=
-weight * multiplier * value / localFeaturesStd(index)
+margins.indices.foreach { i =>
+  multipliers(i) = multipliers(i) / sum - (if (label == i) 1.0 else 
0.0)
+}
+features.foreachActive { (index, value) =>
+  if (localFeaturesStd(index) != 0.0 && value != 0.0) {
+val stdValue = value / localFeaturesStd(index)
+var j = 0
+while (j < numClasses) {
+  localGradientArray(index * numClasses + j) +=
+weight * multipliers(j) * stdValue
+  j += 1
 }
   }
-  if (fitIntercept) {
-localGradientArray(i * numFeaturesPlusIntercept + numFeatures) += 
weight * multiplier
+}
+if (fitIntercept) {
+  var i = 0
+  while (i < numClasses) {
+localGradientArray(numFeatures * numClasses + i) += weight * 
multipliers(i)
+i += 1
   }
 }
 
--- End diff --

You can make `def gradient: Vector` returning `Matrix`, and for MLOR, the 
implementation can be col major matrix so when we use `foreachActive` we don't 
need to worry about the underlying implementation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-10 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r87516339
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
   val initialCoefWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
--- End diff --

When we are looping the coefficients array in LiR or BLOR in a flatten 
array, people think about it by looping through in the direction of features. 
In this PR, the logic is changed, and can lead to some maintenance issues when 
developers work on those code later. For example, I have difficulty to 
understand the code at the first glance. I think what we can do alternatively 
when we do L1/L2 update logic and initial coefficients things, we can put them 
into col major matrix, and use foreachActive with explicit col and row index so 
reader can easily understand the code logic.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-10 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r87504228
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
   val initialCoefWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
--- End diff --

Are there other linear models that use a matrix of coefficients? I agree 
that thinking about indexing when flattening a matrix into an array is a pain, 
but I don't really see how column major is _more_ difficult than row major. The 
only place this would make much difference is that we wouldn't have to modify 
the L2 reg update logic and the initial coefficients, but we still need to swap 
between layouts in every iteration. I guess I don't see that this is obviously 
simpler. 

And for the review of this PR, I think we can be confident of the 
correctness due to the robustness of the LogisticRegression test suite - which 
has extensive correctness tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-10 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r87501621
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
   val initialCoefWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
--- End diff --

It's harder to think about we are using column major in MLOR, and not 
consistent with the rest of linear models. Do you think we can still use row 
major as much as possible so we don't need to touch code here, and only do the 
column major thing in `LogisticAggregator` internally, and when `gradient` of 
`LogisticAggregator` is called, we convert it back from column major to row 
major?   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-09 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r87275543
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
   val initialCoefWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
--- End diff --

I added a comment, not sure if it's exactly what you had in mind. What do 
you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-09 Thread thunterdb

Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r87267275
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
   val initialCoefWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
--- End diff --

can you document on line 465 the layout of the data, to help future 
developers (all the coefficients, in column major order, maybe followed by a 
last column that contain the intercept)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-04 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r86497058
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1486,57 +1489,75 @@ private class LogisticAggregator(
 var marginOfLabel = 0.0
 var maxMargin = Double.NegativeInfinity
 
-val margins = Array.tabulate(numClasses) { i =>
-  var margin = 0.0
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  margin += localCoefficients(i * numFeaturesPlusIntercept + 
index) *
-value / localFeaturesStd(index)
-}
+val margins = new Array[Double](numClasses)
--- End diff --

I think a slightly more detailed comment would be good for the aggregator, 
something like the following (please feel free to make it more clear):

```
In order to avoid unnecessary computation during calculation of the 
gradient updates,
we lay out the coefficients in column major order during training. This 
allows us to 
perform feature standardization once, while still retaining sequential 
memory access 
for speed. We convert back to row-major order when we create the model, 
since this form is optimal for the matrix operations used for prediction.
```

Here we can amend slightly to say:
```
Additionally, since the coefficients were laid out in column major order 
during training to
avoid extra computation, we convert them back to row major before passing 
them to the model.
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r86434451
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
   val initialCoefWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
   val providedCoef = optInitialModel.get.coefficientMatrix
   providedCoef.foreachActive { (row, col, value) =>
-val flatIndex = row * numFeaturesPlusIntercept + col
+// convert matrix to column major for training
--- End diff --

There are unit tests for it. Also, this is used in the old mllib code, 
which exposed an initial model but wraps ML now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r86433115
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1486,57 +1489,75 @@ private class LogisticAggregator(
 var marginOfLabel = 0.0
 var maxMargin = Double.NegativeInfinity
 
-val margins = Array.tabulate(numClasses) { i =>
-  var margin = 0.0
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  margin += localCoefficients(i * numFeaturesPlusIntercept + 
index) *
-value / localFeaturesStd(index)
-}
+val margins = new Array[Double](numClasses)
--- End diff --

Ok, I added a comment inside of `LogisticRegression.train` above. Do you 
think it needs to be moved or just duplicated inside `LogisticAggregator`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r86436188
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1486,57 +1489,75 @@ private class LogisticAggregator(
 var marginOfLabel = 0.0
 var maxMargin = Double.NegativeInfinity
 
-val margins = Array.tabulate(numClasses) { i =>
-  var margin = 0.0
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  margin += localCoefficients(i * numFeaturesPlusIntercept + 
index) *
-value / localFeaturesStd(index)
-}
+val margins = new Array[Double](numClasses)
+features.foreachActive { (index, value) =>
+  val stdValue = value / localFeaturesStd(index)
+  var j = 0
+  while (j < numClasses) {
+margins(j) += localCoefficients(index * numClasses + j) * stdValue
+j += 1
   }
-
+}
+var i = 0
+while (i < numClasses) {
   if (fitIntercept) {
-margin += localCoefficients(i * numFeaturesPlusIntercept + 
numFeatures)
+margins(i) += localCoefficients(numClasses * numFeatures + i)
   }
-  if (i == label.toInt) marginOfLabel = margin
-  if (margin > maxMargin) {
-maxMargin = margin
+  if (i == label.toInt) marginOfLabel = margins(i)
+  if (margins(i) > maxMargin) {
+maxMargin = margins(i)
   }
-  margin
+  i += 1
 }
 
 /**
  * When maxMargin > 0, the original formula could cause overflow.
  * We address this by subtracting maxMargin from all the margins, so 
it's guaranteed
  * that all of the new margins will be smaller than zero to prevent 
arithmetic overflow.
  */
+val multipliers = new Array[Double](numClasses)
 val sum = {
   var temp = 0.0
   if (maxMargin > 0) {
-for (i <- 0 until numClasses) {
+var i = 0
--- End diff --

Yep, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r86421340
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1486,57 +1489,75 @@ private class LogisticAggregator(
 var marginOfLabel = 0.0
 var maxMargin = Double.NegativeInfinity
 
-val margins = Array.tabulate(numClasses) { i =>
-  var margin = 0.0
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  margin += localCoefficients(i * numFeaturesPlusIntercept + 
index) *
-value / localFeaturesStd(index)
-}
+val margins = new Array[Double](numClasses)
--- End diff --

It may be good to just add a general comment (perhaps in 
LogisticAggregator) about the training being done using col major order, and 
that this is converted to row major once training is done, etc? And perhaps a 
little detail on why it is (was) necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r86420744
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
   val initialCoefWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
   val providedCoef = optInitialModel.get.coefficientMatrix
   providedCoef.foreachActive { (row, col, value) =>
-val flatIndex = row * numFeaturesPlusIntercept + col
+// convert matrix to column major for training
--- End diff --

Is this code path ever tested? Since the current `initialModel` is just a 
sort of dummy placeholder, always `None`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15593#discussion_r86420010
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1486,57 +1489,75 @@ private class LogisticAggregator(
 var marginOfLabel = 0.0
 var maxMargin = Double.NegativeInfinity
 
-val margins = Array.tabulate(numClasses) { i =>
-  var margin = 0.0
-  features.foreachActive { (index, value) =>
-if (localFeaturesStd(index) != 0.0 && value != 0.0) {
-  margin += localCoefficients(i * numFeaturesPlusIntercept + 
index) *
-value / localFeaturesStd(index)
-}
+val margins = new Array[Double](numClasses)
+features.foreachActive { (index, value) =>
+  val stdValue = value / localFeaturesStd(index)
+  var j = 0
+  while (j < numClasses) {
+margins(j) += localCoefficients(index * numClasses + j) * stdValue
+j += 1
   }
-
+}
+var i = 0
+while (i < numClasses) {
   if (fitIntercept) {
-margin += localCoefficients(i * numFeaturesPlusIntercept + 
numFeatures)
+margins(i) += localCoefficients(numClasses * numFeatures + i)
   }
-  if (i == label.toInt) marginOfLabel = margin
-  if (margin > maxMargin) {
-maxMargin = margin
+  if (i == label.toInt) marginOfLabel = margins(i)
+  if (margins(i) > maxMargin) {
+maxMargin = margins(i)
   }
-  margin
+  i += 1
 }
 
 /**
  * When maxMargin > 0, the original formula could cause overflow.
  * We address this by subtracting maxMargin from all the margins, so 
it's guaranteed
  * that all of the new margins will be smaller than zero to prevent 
arithmetic overflow.
  */
+val multipliers = new Array[Double](numClasses)
 val sum = {
   var temp = 0.0
   if (maxMargin > 0) {
-for (i <- 0 until numClasses) {
+var i = 0
--- End diff --

Can this not be:

```scala
val sum = {
  var temp = 0.0
  var i = 0
  while (i < numClasses) { 
if (maxMargin > 0) margins(i) -= maxMargin
val exp = math.exp(margins(i))
temp += exp
multipliers(i) = exp
i += 1
  }
  temp
}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-10-21 Thread sethah

GitHub user sethah opened a pull request:

https://github.com/apache/spark/pull/15593

[SPARK-18060][ML] Avoid unnecessary computation for MLOR

## What changes were proposed in this pull request?

Before this patch, the gradient updates for multinomial logistic regression 
were computed by an outer loop over the number of classes and an inner loop 
over the number of features. Inside the inner loop, we standardized the feature 
value (`value / featuresStd(index)`), which means we performed the computation 
`numFeatures * numClasses` times. We only need to perform that computation 
`numFeatures` times, however. If we re-order the inner and outer loop, we can 
avoid this, but then we lose sequential memory access. In this patch, we 
instead lay out the coefficients in column major order while we train, so that 
we can avoid the extra computation and retain sequential memory access.

We convert back to row-major order when we create the model, since the 
vector matrix multiply required by predict will access the coefficients in 
row-major order.

## How was this patch tested?

This is an implementation detail only, so the original behavior should be 
maintained. All tests pass. I ran some performance tests to verify speedups. 
The results are below, and show significant speedups.

## Performance Tests

**Setup**

3 node bare-metal cluster
120 cores total
384 gb RAM total


**Results**

||   numPoints |   numFeatures |   numClasses |   regParam |   
elasticNetParam |   currentMasterTime (sec) |   thisPatchTime (sec) |   
pctSpeedup |

||-|---|--||---|---|---|--|
|  0 |   1e+07 |   100 |  500 |   0.5  |
 0 |90 |18 |   80 |
|  1 |   1e+08 |   100 |   50 |   0.5  |
 0 |90 |19 |   78 |
|  2 |   1e+08 |   100 |   50 |   0.05 |
 1 |72 |19 |   73 |
|  3 |   1e+06 |   100 | 5000 |   0.5  |
 0 |93 |53 |   43 |
|  4 |   1e+07 |   100 | 5000 |   0.5  |
 0 |   900 |   390 |   56 |
|  5 |   1e+08 |   100 |  500 |   0.5  |
 0 |   840 |   174 |   79 |
|  6 |   1e+08 |   100 |  200 |   0.5  |
 0 |   360 |72 |   80 |

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sethah/spark MLOR_PERF_COL_MAJOR_COEF

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15593.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15593


commit 4c19abebe0b78bcd26fc142ef6787517e1e4482d
Author: sethah 
Date:   2016-10-21T17:19:50Z

tests pass except initial model

commit fcab96a3d608ca49d8a8963f79a277163d87ddce
Author: sethah 
Date:   2016-10-21T19:49:39Z

initialModel passes

commit 07fd1504136ad7b1ce37f443e26f407b07345991
Author: sethah 
Date:   2016-10-21T23:01:27Z

clean up and refactoring exp op in log agg




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

18 matches

Site Navigation

Mail list logo

Footer information