subject:"\[GitHub\] spark pull request #17078\: \[SPARK\-19746\]\[ML\] Faster indexing for logistic ag..."

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

2017-02-27 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17078


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

2017-02-27 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17078#discussion_r103343872
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1431,7 +1431,12 @@ private class LogisticAggregator(
   private var weightSum = 0.0
   private var lossSum = 0.0
 
-  private val gradientSumArray = Array.fill[Double](coefficientSize)(0.0D)
+  @transient private lazy val coefficientsArray = bcCoefficients.value 
match {
--- End diff --

Yeah, I'll update it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

2017-02-27 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17078#discussion_r103342591
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -456,6 +456,32 @@ class LogisticRegressionSuite
 assert(blrModel.intercept !== 0.0)
   }
 
+  test("sparse coefficients in LogisticAggregator") {
+val bcCoefficientsBinary = 
spark.sparkContext.broadcast(Vectors.sparse(2, Array(0), Array(1.0)))
+val bcFeaturesStd = spark.sparkContext.broadcast(Array(1.0))
+val binaryAgg = new LogisticAggregator(bcCoefficientsBinary, 
bcFeaturesStd, 2,
+  fitIntercept = true, multinomial = false)
+val thrownBinary = withClue("binary logistic aggregator cannot handle 
sparse coefficients") {
--- End diff --

I think we should handle sparse coefficients for further performance 
improvement. But not in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

2017-02-27 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17078#discussion_r103342093
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1447,7 +1447,7 @@ private class LogisticAggregator(
   label: Double): Unit = {
 
 val localFeaturesStd = bcFeaturesStd.value
-val localCoefficients = bcCoefficients.value
+val localCoefficients = bcCoefficients.value.toArray
--- End diff --

My concern is that if `coefficients` is sparse, we are not just doing the 
pointer indirection but creating a new dense array from sparse matrix. I know 
we always pass in a dense matrix so this will not be an issue now, but being 
said that, in the following code, if we call `compress` in the `coefficients`, 
we may be able to broadcast a smaller object when L1 is applied or in the 
initial iteration that most of the elements in `coefficients` are zero.


https://github.com/sethah/spark/blob/3bea389f6780e1fd0385fbe26954fa4f59b69e37/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L1674


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

2017-02-27 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17078#discussion_r103342317
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1431,7 +1431,12 @@ private class LogisticAggregator(
   private var weightSum = 0.0
   private var lossSum = 0.0
 
-  private val gradientSumArray = Array.fill[Double](coefficientSize)(0.0D)
+  @transient private lazy val coefficientsArray = bcCoefficients.value 
match {
--- End diff --

Can you have the type of `coefficientsArray` so people can clear know that 
it's a primitive array? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

2017-02-27 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17078#discussion_r103239934
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1447,7 +1447,7 @@ private class LogisticAggregator(
   label: Double): Unit = {
 
 val localFeaturesStd = bcFeaturesStd.value
-val localCoefficients = bcCoefficients.value
+val localCoefficients = bcCoefficients.value.toArray
--- End diff --

Another thought is that by tagging the class member as `@transient lazy 
val` we are at least making this hidden problem more explicit. I think using 
the transient method makes it a bit less likely that someone will come along 
and make a change in the future that serializes the coefficients. I'll plan to 
update it here with the transient tag then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

2017-02-27 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17078#discussion_r103236865
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1447,7 +1447,7 @@ private class LogisticAggregator(
   label: Double): Unit = {
 
 val localFeaturesStd = bcFeaturesStd.value
-val localCoefficients = bcCoefficients.value
+val localCoefficients = bcCoefficients.value.toArray
--- End diff --

The above check got us into trouble because if we don't add `@transient 
lazy val` then we'll serialize the coefficients. The call to `toArray` is 
really just a small bit of pointer indirection, and while I agree it is not 
_great_ to call it every time, the extra function call should pale in 
comparison to the `O(numClasses * numFeatures)` ops we do in the method. 

That said, I'm ok with either solution, just wanted to point out the 
pros/cons of each. Let me know what you think, thanks for reviewing!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

2017-02-27 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17078#discussion_r103154658
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1447,7 +1447,7 @@ private class LogisticAggregator(
   label: Double): Unit = {
 
 val localFeaturesStd = bcFeaturesStd.value
-val localCoefficients = bcCoefficients.value
+val localCoefficients = bcCoefficients.value.toArray
--- End diff --

In the first version of LOR, we have the following code which avoid this 
issue you pointed out. 

```scala
  private val weightsArray = weights match {
 case dv: DenseVector => dv.values
 case _ =>
   throw new IllegalArgumentException(
 s"weights only supports dense vector but got type 
${weights.getClass}.")
   }
```

I think order approach will be more efficient since `toArray` is only 
called once (you can add the case for sparse), and for sparse initial 
coefficients, we will not convert from sparse to dense again and again. 

This can be a future work. With L1 applied, the coefficients can be very 
sparse, so we can compress the coefficients for each iteration, and have 
specialized implementation for `UpdateInPlace`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

2017-02-26 Thread sethah

GitHub user sethah opened a pull request:

https://github.com/apache/spark/pull/17078

[SPARK-19746][ML] Faster indexing for logistic aggregator

## What changes were proposed in this pull request?

JIRA: [SPARK-19746](https://issues.apache.org/jira/browse/SPARK-19746)

The following code is inefficient:

scala
val localCoefficients: Vector = bcCoefficients.value

features.foreachActive { (index, value) =>
  val stdValue = value / localFeaturesStd(index)
  var j = 0
  while (j < numClasses) {
margins(j) += localCoefficients(index * numClasses + j) * stdValue
j += 1
  }
}


`localCoefficients(index * numClasses + j)` calls `Vector.apply` which 
creates a new Breeze vector and indexes that. Even if it is not that slow to 
create the object, we will generate a lot of extra garbage that may result in 
longer GC pauses. This is a hot inner loop, so we should optimize wherever 
possible.

## How was this patch tested?

I don't think there's a great way to test this patch. It's purely 
performance related, so unit tests should guarantee that we haven't made any 
unwanted changes. Empirically I observed between 10-40% speedups just running 
short local tests. I suspect the big differences will be seen when large 
data/coefficient sizes have to pause for GC more often. I welcome other ideas 
for testing.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sethah/spark logistic_agg_indexing

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17078.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17078


commit 3bea389f6780e1fd0385fbe26954fa4f59b69e37
Author: sethah 
Date:   2017-02-27T05:22:09Z

better indexing for logistic agg




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

[GitHub] spark pull request #17078: [SPARK-19746][ML] Faster indexing for logistic ag...

9 matches

Site Navigation

Mail list logo

Footer information