[GitHub] spark pull request #15023: Backport [SPARK-5847] Allow for configuring Metri...

2016-09-09 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/15023

Backport [SPARK-5847] Allow for configuring MetricsSystem's prefix

This is a backport of #14270. Because the spark.internal.config system
does not exists in branch 1.6, a simpler substitution scheme for ${} in
the spark.metrics.namespace value, using only Spark configuration had to
be added to preserve the behaviour discussed in the tickets and tested.

This backport is contributed by Criteo SA under the Apache v2 licence.

## What changes were proposed in this pull request?

Adding a new property to SparkConf called spark.metrics.namespace that 
allows users to
set a custom namespace for executor and driver metrics in the metrics 
systems.

By default, the root namespace used for driver or executor metrics is
the value of `spark.app.id`. However, often times, users want to be able to 
track the metrics
across apps for driver and executor metrics, which is hard to do with 
application ID
(i.e. `spark.app.id`) since it changes with every invocation of the app. 
For such use cases,
users can set the `spark.metrics.namespace` property to any given value
or to another spark configuration key reference like `${spark.app.name}`
which is then used to populate the root namespace of the metrics system
(with the app name in our example). `spark.metrics.namespace` property can 
be set to any
arbitrary spark property key, whose value would be used to set the root 
namespace of the
metrics system. Non driver and executor metrics are never prefixed with 
`spark.app.id`, nor
does the `spark.metrics.namespace` property have any such affect on such 
metrics.


## How was this patch tested?

Added new unit tests, modified existing unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark backport-SPARK-5847

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15023.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15023


commit c1c57fbb8593346410995578668cc7bff79f77e0
Author: Anthony Truchet 
Date:   2016-09-01T09:37:37Z

[SPARK-5847][CORE][BRANCH-1.6] Allow for configuring MetricsSystem's use of 
app ID to namespace all metrics

This is a backport of #14270. Because the spark.internal.config system
does not exists in branch 1.6, a simpler substitution scheme for ${} in
the spark.metrics.namespace value, using only Spark configuration had to
be added to preserve the behaviour discussed in the tickets and tested.

This backport is contributed by Criteo SA under the Apache v2 licence.

Adding a new property to SparkConf called spark.metrics.namespace that 
allows users to
set a custom namespace for executor and driver metrics in the metrics 
systems.

By default, the root namespace used for driver or executor metrics is
the value of `spark.app.id`. However, often times, users want to be able to 
track the metrics
across apps for driver and executor metrics, which is hard to do with 
application ID
(i.e. `spark.app.id`) since it changes with every invocation of the app. 
For such use cases,
users can set the `spark.metrics.namespace` property to any given value
or to another spark configuration key reference like `${spark.app.name}`
which is then used to populate the root namespace of the metrics system
(with the app name in our example). `spark.metrics.namespace` property can 
be set to any
arbitrary spark property key, whose value would be used to set the root 
namespace of the
metrics system. Non driver and executor metrics are never prefixed with 
`spark.app.id`, nor
does the `spark.metrics.namespace` property have any such affect on such 
metrics.

Added new unit tests, modified existing unit tests.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...

2016-09-09 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/14270
  
Hello @markgrover @vanzin ! As this is just a backport of your work, would 
you please consider reviewing it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15023: Backport [SPARK-5847] Allow for configuring MetricsSyste...

2016-09-12 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/15023
  
I'm aware that features are not generally back-ported. The point is, for us 
this is a bug, preventing a deployment in production. We thus back-ported the 
fix internally and now propose to share it with the community as the work is 
already done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15023: Backport [SPARK-5847] Allow for configuring MetricsSyste...

2016-09-13 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/15023
  
This is perfectly understandable, Just in the same way that we err on the 
side of caution by not switching to 2.0 branch for prod just now :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15905: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-16 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/15905

[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0

## What changes were proposed in this pull request?

CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.

## How was this patch tested?

Tests run by hand locally. 
(Setting up local infra to run the official Spark tests is in progress)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AnthonyTruchet/spark ENG-17719-lbfgs-only

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15905.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15905


commit e0a88bd716feae54f5ff0c235c7bb566fefcc7bd
Author: Eugene Kharitonov 
Date:   2016-10-14T11:25:34Z

[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0

CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15905: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-17 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/15905#discussion_r88409682
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,28 @@ object LBFGS extends Logging {
   val bcW = data.context.broadcast(w)
   val localGradient = gradient
 
-  val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 
0.0))(
-  seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, 
features)) =>
-val l = localGradient.compute(
-  features, label, bcW.value, grad)
-(grad, loss + l)
-  },
-  combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), 
(grad2, loss2)) =>
-axpy(1.0, grad2, grad1)
-(grad1, loss1 + loss2)
-  })
+  /** Given (current accumulated gradient, current loss) and (label, 
features)
+   * tuples, updates the current gradient and current loss
+   */
+  val seqOp = (c: (Vector, Double), v: (Double, Vector)) => {
+(c, v) match { case ((grad, loss), (label, features)) =>
--- End diff --

ok will do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15905: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-17 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/15905
  
By he way do you think that this should be addressed in core or just in 
each ML specific use ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15905: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-17 Thread AnthonyTruchet
Github user AnthonyTruchet closed the pull request at:

https://github.com/apache/spark/pull/15905


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15905: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-17 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/15905
  
I missed part of my company guidelines. Closing this PR and creating a new 
one shortly from my company account. Sorry for the noise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15931: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-18 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/15931

[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0

## What changes were proposed in this pull request?

The zero for the aggregation used to be shipped into a closure which is
higly problematic when this zero is big (100s of MB is typical for ML).
This change introduces a new overload of treeAggregate which only ships a
function able to generate this zero.

## How was this patch tested?

Unit tests for core module launched locally


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AnthonyTruchet/spark ENG-17719-lbfgs-only

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15931.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15931


commit d5d110b8fc0d7165cce116c1e8342ce7aca2467b
Author: Eugene Kharitonov 
Date:   2016-10-14T11:25:34Z

[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0

CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15931: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-18 Thread AnthonyTruchet
Github user AnthonyTruchet closed the pull request at:

https://github.com/apache/spark/pull/15931


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15931: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-18 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/15931
  
Sure, sorry I mistakenly pushed a badly rebased version


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15963: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-21 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/15963

[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0

## What changes were proposed in this pull request?

CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.

## How was this patch tested?

Unit test for module mllib run locally for correctness.

As for performance we run an heavy optimization on our production data (50 
iterations on 128 MB weight vectors) and have seen significant decrease in 
terms both of runtime and container being killed by lack of off-heap memory.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15963.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15963


commit d5d110b8fc0d7165cce116c1e8342ce7aca2467b
Author: Eugene Kharitonov 
Date:   2016-10-14T11:25:34Z

[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0

CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.

commit 7095bc2c568564ebc4584c1101cb94801079d1bd
Author: Anthony Truchet 
Date:   2016-11-18T12:36:00Z

Style fix according to reviewers' feedback

commit 011a8d77cac26f820c3cecda1c28e623a64f803b
Author: Anthony Truchet 
Date:   2016-11-21T12:26:10Z

Merge pull request #11 from AnthonyTruchet/ENG-17719-lbfgs-only

[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15963: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-21 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/15963
  
@srowen Here is at last the real PR for SPARK-18471. 

Sorry for the noise due to GitHub fiddling...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15963: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-27 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/15963
  
Hello @srowen , sorry not to have updated this lately, I've been taken by 
other emergencies at work. I'll update this on Monday. Actually I'll submit a 
variant of treeAggregate in core that we will be able to use for other similar 
use case in ML(lib). 

I appreciate your care about those patches and will attempt to reach a 
reasonable reactivity too :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-28 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/16037

[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0

## What changes were proposed in this pull request?

CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.

## How was this patch tested?

Unit test for module mllib run locally for correctness.

As for performance we run an heavy optimization on our production data (50 
iterations on 128 MB weight vectors) and have seen significant decrease in 
terms both of runtime and container being killed by lack of off-heap memory.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark ENG-17719-lbfgs-only

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16037.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16037


commit 0e924ec09554acf3191dea51c5a13b024a0c643f
Author: Eugene Kharitonov 
Date:   2016-10-14T11:25:34Z

[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0

CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.

commit af4b1c96ec5a5759e8b66117794d2a011f6313c3
Author: Anthony Truchet 
Date:   2016-11-18T12:36:00Z

Style fix according to reviewers' feedback

commit 0a0571dc9557e94a909ab8d8ef62c7dca474a1c0
Author: Anthony Truchet 
Date:   2016-11-28T15:39:16Z

Code style according to reviewers feed-back.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15963: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet closed the pull request at:

https://github.com/apache/spark/pull/15963


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15963: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/15963
  
Once more (last time hopefully) I mistakenly fiddled with PR. Closing this 
one and replace it with #16037.
Code style review above taken into account in new PR.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16038: [SPARK-18471][CORE] New treeAggregate overload fo...

2016-11-28 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/16038

[SPARK-18471][CORE] New treeAggregate overload for big large aggregators

## What changes were proposed in this pull request?

The zero for the aggregation used to be shipped into a closure which is
higly problematic when this zero is big (100s of MB is typical for ML).
This change introduces a new overload of treeAggregate which only ships a
function able to generate this zero.

## How was this patch tested?

Test suite for the core and mllib modules launched locally. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark SPARK-18471

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16038.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16038


commit f6e23650fcaf943f058494ddd13282f37f9452f5
Author: Anthony Truchet 
Date:   2016-11-17T15:20:19Z

[SPARK-18471][CORE] New treeAggregate overload for big large aggregators

The zero for the aggregation used to be shipped into a closure which is
higly problematic when this zero is big (100s of MB is typical for ML).
This change introduces a new overload of treeAggregate which only ships a
function able to generate this zero.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
@srowen here is the companion PR to #16037.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
It is related as if this PR get accepted, then the fix for PR to #16037 
becomes trivial: replace treeAggregate with a call to 
`treeAggregateWithZeroGenerator( () => (Vectors.zeros(n), 0.0))`, in which case 
only the n gets caught in the closure if I'm not mistaken. This fix then gets 
much easier to port to other similat use cases in MLlib.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16040: [SPARK-18612][MLLIB] Delete broadcasted variable ...

2016-11-28 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/16040

[SPARK-18612][MLLIB] Delete broadcasted variable in LBFGS CostFun

## What changes were proposed in this pull request?

Fix a broadcasted variable leak occurring at each invocation of CostFun in 
L-BFGS.

## How was this patch tested?

UTests + check that fixed fatal memory consumption on Criteo's use cases.

This contribution is made on behalf of Criteo S.A.
(http://labs.criteo.com/) under the terms of the Apache v2 License.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark SPARK-18612-lbfgs-cost-fun

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16040.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16040


commit 7aa7976c695ba9519164c2e5f76ff2faed373960
Author: Anthony Truchet 
Date:   2016-11-28T16:22:23Z

[SPARK-18612][MLLIB] Delete broadcasted variable in LBFGS CostFun

This contribution is made on behalf of Criteo S.A.
(http://labs.criteo.com/) under the terms of the Apache v2 License.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16040: [SPARK-18612][MLLIB] Delete broadcasted variable in LBFG...

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16040
  
Agreed, and we might also consider making the broadcasted variable 
compatible with some Automatic Resource Management to avoid such leak (like 
https://github.com/jsuereth/scala-arm).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16042: Fix of dev scripts and new, Criteo specific, ones...

2016-11-28 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/16042

Fix of dev scripts and new, Criteo specific, ones WIP




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AnthonyTruchet/spark dev-tools

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16042.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16042


commit 18661a2bb527adbd01e98158696a16f6d8162411
Author: Tommy YU 
Date:   2016-02-12T02:38:49Z

[SPARK-13153][PYSPARK] ML persistence failed when handle no default value 
parameter

Fix this defect by check default value exist or not.

yanboliang Please help to review.

Author: Tommy YU 

Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.

(cherry picked from commit d3e2e202994e063856c192e9fdd0541777b88e0e)
Signed-off-by: Xiangrui Meng 

commit 93a55f3df3c9527ecf4143cb40ac7212bc3a975a
Author: markpavey 
Date:   2016-02-13T08:39:43Z

[SPARK-13142][WEB UI] Problem accessing Web UI /logPage/ on Microsoft 
Windows

Due to being on a Windows platform I have been unable to run the tests as 
described in the "Contributing to Spark" instructions. As the change is only to 
two lines of code in the Web UI, which I have manually built and tested, I am 
submitting this pull request anyway. I hope this is OK.

Is it worth considering also including this fix in any future 1.5.x 
releases (if any)?

I confirm this is my own original work and license it to the Spark project 
under its open source license.

Author: markpavey 

Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.

(cherry picked from commit 374c4b2869fc50570a68819cf0ece9b43ddeb34b)
Signed-off-by: Sean Owen 

commit 107290c94312524bfc4560ebe0de268be4ca56af
Author: Liang-Chi Hsieh 
Date:   2016-02-13T23:56:20Z

[SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed 
test

JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from 
PowerIterationClustering, one of the tests will be failed. I found that some 
`dstAttr`s of the normalized graph are not correct values but 0.0. By setting 
`TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh 
Author: Xiangrui Meng 

Closes #10539 from viirya/fix-poweriter.

(cherry picked from commit e3441e3f68923224d5b576e6112917cf1fe1f89a)
Signed-off-by: Xiangrui Meng 

commit ec40c5a59fe45e49496db6e0082ddc65c937a857
Author: Amit Dev 
Date:   2016-02-14T11:41:27Z

[SPARK-13300][DOCUMENTATION] Added pygments.rb dependancy

Looks like pygments.rb gem is also required for jekyll build to work. At 
least on Ubuntu/RHEL I could not do build without this dependency. So added 
this to steps.

Author: Amit Dev 

Closes #11180 from amitdev/master.

(cherry picked from commit 331293c30242dc43e54a25171ca51a1c9330ae44)
Signed-off-by: Sean Owen 

commit 71f53edc0e39bc907755153b9603be8c6fcc1d93
Author: JeremyNixon 
Date:   2016-02-15T09:25:13Z

[SPARK-13312][MLLIB] Update java train-validation-split example in ml-guide

Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312.

This contribution is my original work and I license the work to this 
project.

Author: JeremyNixon 

Closes #11199 from JeremyNixon/update_train_val_split_example.

(cherry picked from commit adb548365012552e991d51740bfd3c25abf0adec)
Signed-off-by: Sean Owen 

commit d95089190d714e3e95579ada84ac42d463f824b5
Author: Miles Yucht 
Date:   2016-02-16T13:01:21Z

Correct SparseVector.parse documentation

There's a small typo in the SparseVector.parse docstring (which says that 
it returns a DenseVector rather than a SparseVector), which seems to be 
incorrect.

Author: Miles Yucht 

Closes #11213 from mgyucht/fix-sparsevector-docs.

(cherry picked from commit 827ed1c06785692d14857bd41f1fd94a0853874a)
Signed-off-by: Sean Owen 

commit 98354cae984e3719a49050e7a6aa75dae78b12bb
Author: Sital Kedia 
Date:   2016-02-17T06:27:34Z

[SPARK-13279] Remove O(n^2) operation from scheduler.

This commit removes an unnecessary duplicate check in addPendingTask that 
meant
that scheduling a task set took time proportional to (# tasks)^2.

Author: Sital Kedia 

Closes #11175 from sitalkedia/fix_stuck_driver.

(cherry picked from commit 1e1e31e03df14f2e7a9654e640fb2796cf059fe0)
Signed-off-by: Kay Ousterhout 

commit 66106a660149607348b8e51994eb2ce29d67abc0
Author: Christopher C. Aycock 
Date:   2016-02-17T19:24:18Z

[SPARK-13350][D

[GitHub] spark issue #16042: Fix of dev scripts and new, Criteo specific, ones WIP

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16042
  
Sorry created by mistake


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16042: Fix of dev scripts and new, Criteo specific, ones...

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet closed the pull request at:

https://github.com/apache/spark/pull/16042


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
Maybe we could have a live talk/chat this evening Paris time / Morning West 
Coast time?

If I'm not mistaken both  #16037 and this change fix the issue at two 
different levels.
* #16037 consider the the is specific to ML and that we can & should work 
around it in MLlib
* This change consider the root cause is an inefficiency of treeAggregate 
in core whose fix would enable a better fix in MLlib. I updated this PR to 
reflect this.

I might have misunderstood your comment 
https://github.com/apache/spark/pull/15905#pullrequestreview-8844854


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
No this does not do the trick as the result of the aggregation IS dense. 
And the zero in (tree)aggregate has the same type as the result. Said 
otherwise, in L-BFGS, we do aggregate vectors that are each pretty sparse but 
whose aggregation is dense as they have different support. So taking a dense 
vector as the zero of the aggregator make perfect sense, adding a sparse 
contribution to a dense aggregator yield a dense aggregator and this is what is 
desired. We just need not to waiste bandwith by sending this huge zero to 
executors.

If you do not want to change or add a public API in core, we might 
contribute a treeAggregateWithZeroGenerator to MLlib using a third way to solve 
the issue: use an Option[DenseVector] as the aggregate type and unwrap it as a 
dense zero bby wrapping the seqOp and comboOp, we have a draft of that 
locally...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
Yes sure the zero Vector is very sparse. Bu ti do not get your suggestion ? 
I see no way to pass a Sparse Vector as zero and get the type of vector to 
change underway to Dense Vector with only operations on Sparse vectors. 

What do you propose ? We do want to accumulate the Sparse Vector built from 
each datapoint into a Dense Vector. And because the type of the zero is the 
type of the return then we need to use a Dense vector as the zero, doesn't it ? 
You might have seen a way out of this I haven't, have you ?
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
THe big doubt I have is on *Of course, the first call to axpy should 
produce a dense vector, but, that's already on the executor*: the other operand 
is Sparse and has to be sparse, and this axpy call between to sparse vectors 
has to return a sparse vector. And we do not want to cast at the end because 
the performance penalties of using sparse instead of dense will already have 
been paid.

Fully exploring the trade-off between sparse and  dense aggregation os 
ot-of-scope and likely very application specific.  I'll have a new look at the 
problem trying to manage a work around along the line of Sparse vectors, but 
this will not solve the typing issue as far as i can see for now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
Actually aggregating (and thus sending on the network) on quite dense 
SparseVectors with 10s of million elements is not to taken lightly. This would 
required serious benchmarking. What I tell is that we can not do that lightly 
because we know that aggregating  `acc += partial`  when `partial` is sparse 
and `acc` dense is efficient locally (i.e. within a partition) and that after a 
partition has been aggregated the vector is already quite dense (enough so that 
the dense and sparse representation have similar size).

Anyway if you consider this is not worth an new API and an optimization in 
core, the discusion should probably go on on the MLlib only PR #16037 for which 
we already have a specific work around.
I just naively assumed  that aggregating over big accumulator was typical 
enough to get some support from core, if this is not the case, I'll fall back 
on a MLlib only workaround. 

Feel free to close this PR is you deem it (ir)relevant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16078: [SPARK-18471][MLLIB] Fix huge vectors of zero sen...

2016-11-30 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/16078

[SPARK-18471][MLLIB] Fix huge vectors of zero send in closure in L-BFGS

## What changes were proposed in this pull request?

Introduced util TreeAggregatorWithZeroGenerator and used it to avoid 
sending huge zero vector in L-BFGS or similar aggregation.

## How was this patch tested?

Run custom L-BGFS using this aggregator instead of treeAggregate with 
significantly increased performance and stability.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark ENG-17719-wrapper

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16078.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16078


commit b4333a3b89362952e900acd7824e5d40500e3b9f
Author: Anthony Truchet 
Date:   2016-11-29T18:20:38Z

[SPARK-18471][MLLIB] Fix huge vectors of zero send in closure in L-BFGS

Introduced util tTreeAggregatoreWithZeroGenerator to avoid sending huge
zero vector in L-BFGS or similar agregation, as only the size of the zero
value to be generated is captured in the closure.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
@mridulm we don't know how to monitor the size of the serialize task. Sure 
it would not 10MB due to all those zeros. But we nonetheless measure a 
significant increase in performance and (more importantly) in stability when 
using the workaround and our custom TreeAggregatorWithZeroGenerator 

@srowen when the density of a Sparse Vector increases it became *very* 
inefficient, not only because it is using almost twice the size of memory, but 
mainly because you fragment memory access on overload the GC :( 

See #16078 for a generic wrapper around treeAggregate with uses Option[U] 
do denote the zero of U by None and generate a full representation by need when 
calling seqOp or comboOp. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16078: [SPARK-18471][MLLIB] Fix huge vectors of zero send in cl...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16078
  
It is necessary to address it in L-BFGS at least. We propose a solution in 
core which can be legitimately rejected as not relevant for core. And two 
solutions in MLlib, one provide for a reusable util to aggregate large vectors, 
the over one is a specific hack in L-BFGS. At least one of them deserves to be 
worked on I believe a Sparse vectors are not an option for performance reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16038: [SPARK-18471][CORE] New treeAggregate overload fo...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet closed the pull request at:

https://github.com/apache/spark/pull/16038


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
Agreed this is too ML specific to be deserve a fix in core.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16078: [SPARK-18471][MLLIB] Fix huge vectors of zero send in cl...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16078
  
In order to reduce the mess around multiple PRs I'll close this one and 
rebase the change in #16037 as you requested.

What is the right way,convenient for you, to share a proposal without 
creating a PR btw ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16078: [SPARK-18471][MLLIB] Fix huge vectors of zero sen...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet closed the pull request at:

https://github.com/apache/spark/pull/16078


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16037
  
Hello @MLnick,

Sorry that my push of a new version just crossed with your suggestion.

I'll test you suggestion. Thanks to for having written it, I didn't get 
that Sean's hint was this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16037
  
@MLnick I have to add a transtyping denseGrad in the expression returned by 
seqOp from DenseVector to Vector. The only way I could find to do it with the 
API are: `Vectors.fromBreeze(denseGrad.asBreeze)` or 
`Vectors.dense(denseGrad.values)`: I'm a biut concerned with the potential 
recopy of a 128MB piece of memory... 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-12-01 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16037
  
Scala style checks fixed, and spurious cast removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-12-01 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16037
  
This seems very related to the change. I know nothing wrt the way PySpark 
interfaces Python and Scala and I'm surprised that changing an internal f the 
Scala lib causes this ? Ready to learn and help though :-)

```
File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/classification.py",
 line 155, in __main__.LogisticRegressionModel
Failed example:
mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, 
numClasses=3)
Exception raised:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
compileflags, 1) in test.globs
  File "", line 1, in 

mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, 
numClasses=3)
  [...]
Py4JJavaError: An error occurred while calling 
o173.trainLogisticRegressionModelWithLBFGS.
 [...]
Caused by: java.lang.IllegalArgumentException: axpy only supports adding to 
a dense vector but got type class org.apache.spark.mllib.linalg.SparseVector.
at org.apache.spark.mllib.linalg.BLAS$.axpy(BLAS.scala:58)
at 
org.apache.spark.mllib.optimization.LBFGS$CostFun$$anonfun$2.apply(LBFGS.scala:257)
at 
org.apache.spark.mllib.optimization.LBFGS$CostFun$$anonfun$2.apply(LBFGS.scala:255)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-12-01 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/16037#discussion_r90422375
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,27 @@ object LBFGS extends Logging {
   val bcW = data.context.broadcast(w)
   val localGradient = gradient
 
-  val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 
0.0))(
-  seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, 
features)) =>
-val l = localGradient.compute(
-  features, label, bcW.value, grad)
-(grad, loss + l)
-  },
-  combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), 
(grad2, loss2)) =>
-axpy(1.0, grad2, grad1)
-(grad1, loss1 + loss2)
-  })
+  // Given (current accumulated gradient, current loss) and (label, 
features)
+  // tuples, updates the current gradient and current loss
+  val seqOp = (c: (Vector, Double), v: (Double, Vector)) =>
+(c, v) match {
+  case ((grad, loss), (label, features)) =>
+val denseGrad = grad.toDense
+val l = localGradient.compute(features, label, bcW.value, 
denseGrad)
+(denseGrad, loss + l)
+}
+
+  // Adds two (gradient, loss) tuples
+  val combOp = (c1: (Vector, Double), c2: (Vector, Double)) =>
+(c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
+  val denseGrad1 = grad1.toDense
+  val denseGrad2 = grad2.toDense
+  axpy(1.0, denseGrad2, denseGrad1)
+  (grad1, loss1 + loss2)
--- End diff --

done, thanks for spotting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-12-01 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16037
  
@MLnick there it is all tests passing :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14268: [SPARK-16440][MLlib] Destroy broadcasted variable...

2016-07-19 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/14268

[SPARK-16440][MLlib] Destroy broadcasted variables even on driver

## What changes were proposed in this pull request?
Forgotten broadcasted variables were persisted into a previous #PR 14153). 
This PR turns those `unpersist()` into `destroy()` so that memory is freed even 
on the driver.


## How was this patch tested?
Unit Tests in Word2VecSuite were run locally.


This contribution is done on behalf of Criteo, according to the
terms of the Apache license 2.0.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark-1 SPARK-16440

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14268.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14268


commit 4ad38360290d59fdf25a009bc65823553cea9b10
Author: Anthony Truchet 
Date:   2016-07-08T12:54:24Z

[SPARK-16440][MLlib] Destroy broadcasted variables even on driver

This contribution is on done on behalf of Criteo, according to the
terms of the Apache license.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14299: Ensure broadcasted variables are destroyed even i...

2016-07-21 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/14299

Ensure broadcasted variables are destroyed even in case of exception

## What changes were proposed in this pull request?

Ensure broadcasted variable are destroyed even in case of exception

## How was this patch tested?

Word2VecSuite was run locally



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark SPARK-16440

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14299.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14299


commit 4ad38360290d59fdf25a009bc65823553cea9b10
Author: Anthony Truchet 
Date:   2016-07-08T12:54:24Z

[SPARK-16440][MLlib] Destroy broadcasted variables even on driver

This contribution is on done on behalf of Criteo, according to the
terms of the Apache license.

commit 53911e04f82f58ee936edededea1d8e72bcb4ea8
Author: Anthony Truchet 
Date:   2016-07-19T17:42:26Z

[SPARK-16440][MLlib] Destroy broadcasted variables in a try finally

This contribution is on done on behalf of Criteo, according to the
terms of the Apache license.

commit 568e4915f6b1c3cd30c1b9796764f543e27f91fc
Author: Anthony Truchet 
Date:   2016-07-21T08:45:44Z

Merge remote-tracking branch 'apache/master' into SPARK-16440




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14299: Ensure broadcasted variables are destroyed even i...

2016-07-21 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/14299#discussion_r71723751
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -313,133 +313,139 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
-// each partition is a collection of sentences,
-// will be translated into arrays of Index integer
-val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter 
=>
-  // Each sentence will map to 0 or more Array[Int]
-  sentenceIter.flatMap { sentence =>
-// Sentence of words, some of which map to a word index
-val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
-// break wordIndexes into trunks of maxSentenceLength when has more
-wordIndexes.grouped(maxSentenceLength).map(_.toArray)
+
+try {
+  // each partition is a collection of sentences,
+  // will be translated into arrays of Index integer
+  val sentences: RDD[Array[Int]] = dataset.mapPartitions { 
sentenceIter =>
+// Each sentence will map to 0 or more Array[Int]
+sentenceIter.flatMap { sentence =>
+  // Sentence of words, some of which map to a word index
+  val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
+  // break wordIndexes into trunks of maxSentenceLength when has 
more
+  wordIndexes.grouped(maxSentenceLength).map(_.toArray)
+}
   }
-}
 
-val newSentences = sentences.repartition(numPartitions).cache()
-val initRandom = new XORShiftRandom(seed)
+  val newSentences = sentences.repartition(numPartitions).cache()
+  val initRandom = new XORShiftRandom(seed)
 
-if (vocabSize.toLong * vectorSize >= Int.MaxValue) {
-  throw new RuntimeException("Please increase minCount or decrease 
vectorSize in Word2Vec" +
-" to avoid an OOM. You are highly recommended to make your 
vocabSize*vectorSize, " +
-"which is " + vocabSize + "*" + vectorSize + " for now, less than 
`Int.MaxValue`.")
-}
+  if (vocabSize.toLong * vectorSize >= Int.MaxValue) {
+throw new RuntimeException("Please increase minCount or decrease 
vectorSize in Word2Vec" +
+  " to avoid an OOM. You are highly recommended to make your 
vocabSize*vectorSize, " +
+  "which is " + vocabSize + "*" + vectorSize + " for now, less 
than `Int.MaxValue`.")
+  }
 
-val syn0Global =
-  Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 
0.5f) / vectorSize)
-val syn1Global = new Array[Float](vocabSize * vectorSize)
-var alpha = learningRate
-
-for (k <- 1 to numIterations) {
-  val bcSyn0Global = sc.broadcast(syn0Global)
-  val bcSyn1Global = sc.broadcast(syn1Global)
-  val partial = newSentences.mapPartitionsWithIndex { case (idx, iter) 
=>
-val random = new XORShiftRandom(seed ^ ((idx + 1) << 16) ^ ((-k - 
1) << 8))
-val syn0Modify = new Array[Int](vocabSize)
-val syn1Modify = new Array[Int](vocabSize)
-val model = iter.foldLeft((bcSyn0Global.value, bcSyn1Global.value, 
0L, 0L)) {
-  case ((syn0, syn1, lastWordCount, wordCount), sentence) =>
-var lwc = lastWordCount
-var wc = wordCount
-if (wordCount - lastWordCount > 1) {
-  lwc = wordCount
-  // TODO: discount by iteration?
-  alpha =
-learningRate * (1 - numPartitions * wordCount.toDouble / 
(trainWordsCount + 1))
-  if (alpha < learningRate * 0.0001) alpha = learningRate * 
0.0001
-  logInfo("wordCount = " + wordCount + ", alpha = " + alpha)
-}
-wc += sentence.length
-var pos = 0
-while (pos < sentence.length) {
-  val word = sentence(pos)
-  val b = random.nextInt(window)
-  // Train Skip-gram
-  var a = b
-  while (a < window * 2 + 1 - b) {
-if (a != window) {
-  val c = pos - window + a
-  if (c >= 0 && c < sentence.length) {
-val lastWord = sentence(c)
-val l1 = lastWord * vectorSize
-val neu1e = new Array[Float](vectorSize)
-// Hier

[GitHub] spark issue #14299: Ensure broadcasted variables are destroyed even in case ...

2016-07-31 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/14299
  
@pzz2011 Nothing _that_ bad if your application is just learning once and 
saving / using the embeddings. But if you run hundreds or thousands of 
learning, you'll just crash because of uncontrolled memory usage... and that's 
bad !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-12-02 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16037
  
L-BFGS is the only optimizer I've used so far. I'm not sure how much time I 
can free to take care of the other ones, but I'll try :-)

Regarding the bench, I'll check if we have archived the results, otherwise 
I'll relaunch it next week


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-12-06 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/16037#discussion_r91036283
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,27 @@ object LBFGS extends Logging {
   val bcW = data.context.broadcast(w)
   val localGradient = gradient
 
-  val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 
0.0))(
-  seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, 
features)) =>
-val l = localGradient.compute(
-  features, label, bcW.value, grad)
-(grad, loss + l)
-  },
-  combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), 
(grad2, loss2)) =>
-axpy(1.0, grad2, grad1)
-(grad1, loss1 + loss2)
-  })
+  // Given (current accumulated gradient, current loss) and (label, 
features)
+  // tuples, updates the current gradient and current loss
+  val seqOp = (c: (Vector, Double), v: (Double, Vector)) =>
+(c, v) match {
+  case ((grad, loss), (label, features)) =>
+val denseGrad = grad.toDense
+val l = localGradient.compute(features, label, bcW.value, 
denseGrad)
+(denseGrad, loss + l)
+}
+
+  // Adds two (gradient, loss) tuples
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-12-06 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16037
  
@sethah Agreed, that's why I uggested to add a dedicated treeAggregate 
wrapper to MLlin Utils which would take care of that without fiddling with 
sparsity for each seqOP and comboOp. See closed  #16078 for the idea...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-12-08 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/16037#discussion_r91493798
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,24 @@ object LBFGS extends Logging {
   val bcW = data.context.broadcast(w)
   val localGradient = gradient
 
-  val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 
0.0))(
-  seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, 
features)) =>
-val l = localGradient.compute(
-  features, label, bcW.value, grad)
-(grad, loss + l)
-  },
-  combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), 
(grad2, loss2)) =>
-axpy(1.0, grad2, grad1)
-(grad1, loss1 + loss2)
-  })
+  val seqOp = (c: (Vector, Double), v: (Double, Vector)) =>
+(c, v) match {
+  case ((grad, loss), (label, features)) =>
+val denseGrad = grad.toDense
+val l = localGradient.compute(features, label, bcW.value, 
denseGrad)
+(denseGrad, loss + l)
+}
+
+  val combOp = (c1: (Vector, Double), c2: (Vector, Double)) =>
+(c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
+  val denseGrad1 = grad1.toDense
--- End diff --

I'm sorry, I have no clue how to generate a non-empty RDD with an empty 
partition. Can you please hint me at some entry points so that I can contribute 
the UTest you request ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-12-12 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/16037#discussion_r91949738
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,24 @@ object LBFGS extends Logging {
   val bcW = data.context.broadcast(w)
   val localGradient = gradient
 
-  val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 
0.0))(
-  seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, 
features)) =>
-val l = localGradient.compute(
-  features, label, bcW.value, grad)
-(grad, loss + l)
-  },
-  combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), 
(grad2, loss2)) =>
-axpy(1.0, grad2, grad1)
-(grad1, loss1 + loss2)
-  })
+  val seqOp = (c: (Vector, Double), v: (Double, Vector)) =>
+(c, v) match {
+  case ((grad, loss), (label, features)) =>
+val denseGrad = grad.toDense
+val l = localGradient.compute(features, label, bcW.value, 
denseGrad)
+(denseGrad, loss + l)
+}
+
+  val combOp = (c1: (Vector, Double), c2: (Vector, Double)) =>
+(c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
+  val denseGrad1 = grad1.toDense
--- End diff --

I'll try to work on this in the next two days. I will not be able to 
relaunch the actual benchmark we did weeks ago, our internal codebase has 
changed too much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-12-13 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/16037#discussion_r92150485
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,24 @@ object LBFGS extends Logging {
   val bcW = data.context.broadcast(w)
   val localGradient = gradient
 
-  val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 
0.0))(
-  seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, 
features)) =>
-val l = localGradient.compute(
-  features, label, bcW.value, grad)
-(grad, loss + l)
-  },
-  combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), 
(grad2, loss2)) =>
-axpy(1.0, grad2, grad1)
-(grad1, loss1 + loss2)
-  })
+  val seqOp = (c: (Vector, Double), v: (Double, Vector)) =>
+(c, v) match {
+  case ((grad, loss), (label, features)) =>
+val denseGrad = grad.toDense
+val l = localGradient.compute(features, label, bcW.value, 
denseGrad)
+(denseGrad, loss + l)
+}
+
+  val combOp = (c1: (Vector, Double), c2: (Vector, Double)) =>
+(c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
+  val denseGrad1 = grad1.toDense
--- End diff --

I did thanks and I merged it. I was just comming back to this contrib when 
I saw it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-12-14 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16037
  
Thanks for keeping up with this merge request, I've learned a lot wrt the 
contribution process and good practice, and next contrib will hopefully be much 
more straightforward. Thanks to the Spark commiter team for this great piece of 
software :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16279: [SPARK-18471][MLLIB][BACKPORT-2.0] In LBFGS, avoi...

2016-12-14 Thread AnthonyTruchet
GitHub user AnthonyTruchet opened a pull request:

https://github.com/apache/spark/pull/16279

[SPARK-18471][MLLIB][BACKPORT-2.0] In LBFGS, avoid sending huge vectors of 
0 

# What changes were proposed in this pull request? #

CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we change the aggregation operations
to convert sparse vectors into dense vectors on the fly if needed and we 
pass a sparse 0 vector which is lightweight.

# How was this patch tested? #

Unit test for module mllib run locally for correctness.

As for performance we run an heavy optimization on our production data (50 
iterations on 128 MB weight vectors) and have seen significant decrease in 
terms both of runtime and container being killed by lack of off-heap memory.

Author: Anthony Truchet 
Author: sethah 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/criteo-forks/spark SPARK-18471-branch-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16279


commit 27f5796831ccb7931303a337d6125714aaa66b5a
Author: Anthony Truchet 
Date:   2016-12-13T21:30:57Z

[SPARK-18471][MLLIB][BACKPORT-2.0] In LBFGS, avoid sending huge vectors of 0

CostFun used to send a dense vector of zeroes as a closure in a
treeAggregate call. To avoid that, we replace treeAggregate by
mapPartition + treeReduce, creating a zero vector inside the mapPartition
block in-place.

Unit test for module mllib run locally for correctness.

As for performance we run an heavy optimization on our production data (50 
iterations on 128 MB weight vectors) and have seen significant decrease in 
terms both of runtime and container being killed by lack of off-heap memory.

Author: Anthony Truchet 
Author: sethah 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16279: [SPARK-18471][MLLIB][BACKPORT-2.0] In LBFGS, avoid sendi...

2016-12-14 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16279
  
@srowen sorry to bother you but this is just the backport (without any 
conflict) of #16037. Could you please validate it ?
  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16279: [SPARK-18471][MLLIB][BACKPORT-2.0] In LBFGS, avoid sendi...

2016-12-14 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16279
  
May I ask why ? There was no conflicts so no additional qualification work 
is required and this looks like a performance bug fix to me, not a new feature. 
In order to adjust our contribution policy  (fix to our internal version vs 
pushing upstream) we would need to understand better the backporting policy: 
would you have any more detailed pointer at it ?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16279: [SPARK-18471][MLLIB][BACKPORT-2.0] In LBFGS, avoid sendi...

2016-12-15 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16279
  
Ok, thanks for the pointer. I do agree this is a non critical judgement 
call .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17076: [SPARK-19745][ML] SVCAggregator captures coeffici...

2017-02-27 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/17076#discussion_r103249915
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -440,19 +440,9 @@ private class LinearSVCAggregator(
 
   private val numFeatures: Int = bcFeaturesStd.value.length
   private val numFeaturesPlusIntercept: Int = if (fitIntercept) 
numFeatures + 1 else numFeatures
-  private val coefficients: Vector = bcCoefficients.value
   private var weightSum: Double = 0.0
   private var lossSum: Double = 0.0
-  require(numFeaturesPlusIntercept == coefficients.size, s"Dimension 
mismatch. Coefficients " +
-s"length ${coefficients.size}, FeaturesStd length ${numFeatures}, 
fitIntercept: $fitIntercept")
-
-  private val coefficientsArray = coefficients match {
-case dv: DenseVector => dv.values
-case _ =>
-  throw new IllegalArgumentException(
-s"coefficients only supports dense vector but got type 
${coefficients.getClass}.")
-  }
-  private val gradientSumArray = 
Array.fill[Double](coefficientsArray.length)(0)
+  private lazy val gradientSumArray = new 
Array[Double](numFeaturesPlusIntercept)
--- End diff --

I'm not very confident in my understanding of @transcient annotation when 
comined with lazy but I would expect it here. Can s.o. more versed confirm or 
infirm ?

http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/

http://stackoverflow.com/questions/34769220/difference-when-serializing-a-lazy-val-with-or-without-transient
 (unanswered) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17076: [SPARK-19745][ML] SVCAggregator captures coeffici...

2017-02-27 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/17076#discussion_r103254026
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -440,19 +440,9 @@ private class LinearSVCAggregator(
 
   private val numFeatures: Int = bcFeaturesStd.value.length
   private val numFeaturesPlusIntercept: Int = if (fitIntercept) 
numFeatures + 1 else numFeatures
-  private val coefficients: Vector = bcCoefficients.value
   private var weightSum: Double = 0.0
   private var lossSum: Double = 0.0
-  require(numFeaturesPlusIntercept == coefficients.size, s"Dimension 
mismatch. Coefficients " +
-s"length ${coefficients.size}, FeaturesStd length ${numFeatures}, 
fitIntercept: $fitIntercept")
-
-  private val coefficientsArray = coefficients match {
--- End diff --

Race condition with my review below...
Thanks for anticipating it :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14299: Ensure broadcasted variables are destroyed even in case ...

2017-02-27 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/14299
  
@thunterdb Copy that, working on it and sorry for the acknowledge delay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14299: [SPARK-16440][MLlib] Ensure broadcasted variables are de...

2017-02-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/14299
  
@vanzin The ticket was filled a long time ago, I updated the PR to make it 
clearer. Is any manual linking in some other forge needed ?

@thunterdb I have updated the review following your suggestion, without 
using ARM: ready fo review.

I'll check with the team working on W2V how to could contribute to  make 
this piece of code "follow better engineering practice" :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14299: [SPARK-16440][MLlib] Ensure broadcasted variables...

2017-02-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/14299#discussion_r103421239
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -314,6 +315,20 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
+  do_fit(dataset, sc, expTable, bcVocab, bcVocabHash)
--- End diff --

my bad, local rebase issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16805: [SPARK-19353][CORE] Generalize PipedRDD to use I/O forma...

2017-02-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16805
  
@thunterdb @srowen could you please point us to the right reviewers / 
commiters regarding this proposal of improvement for PipedRDD, supporting 
binary streams ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14299: [SPARK-16440][MLlib] Ensure broadcasted variables...

2017-03-01 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/14299#discussion_r103656554
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -314,6 +315,21 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
+try {
+  doFit(dataset, sc, expTable, bcVocab, bcVocabHash)
+}
+finally
--- End diff --

I already spotted and corrected a few instances of forgotten broadcasted 
variables. I've already reported and patched all I was aware off.

I was suggesting introducing an ARM pattern and making broadcasted variable 
ARM aware (see comment above). This is by no way a general solution but it 
might help reduce the issue when the use of a broadcasted variable is 
compatibble with a syntactic scope. @thunterdb legitimately raised concern 
about introducing new dependencies (to scala-arm specifically). What is your 
insight wrt to this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14299: [SPARK-16440][MLlib] Ensure broadcasted variables...

2017-03-01 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/14299#discussion_r103657660
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -314,6 +315,21 @@ class Word2Vec extends Serializable with Logging {
 val expTable = sc.broadcast(createExpTable())
 val bcVocab = sc.broadcast(vocab)
 val bcVocabHash = sc.broadcast(vocabHash)
+try {
+  doFit(dataset, sc, expTable, bcVocab, bcVocabHash)
+}
+finally
--- End diff --

Formatting and scala style check fixed.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14299: [SPARK-16440][MLlib] Ensure broadcasted variables are de...

2017-03-02 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/14299
  
I'm fixing this (issue compiling Spark locally delay me). The whole point 
is that they are *not* destroyed within the method in case of exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14299: [SPARK-16440][MLlib] Ensure broadcasted variables are de...

2017-03-02 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/14299
  
Actually we *did* observe memory leak (severe enough to lead to application 
failure) due to this when doing thousands of Word2Vec learning in one Spark 
application, a few of them failing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14299: [SPARK-16440][MLlib] Ensure broadcasted variables are de...

2017-03-02 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/14299
  
The blocking flag is not the issue.
The issue is that there was not try / finally and the destruction was only 
done if the fitting succeded and skipped otherwise.

In this PR I move the destruction into a try / finally construct, all the 
while extracting the actual fitting into a dedicated method to keep nesting 
under control.

Whether the destruction should be blocking or not is pretty debatable and I 
do not have any strong mind on the question, I choose to make them blocking 
"just to be sure" this is actually performed. If you'd rather keep the non 
blocking best effort current behaviour I'll perform the change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17076: [SPARK-19745][ML] SVCAggregator captures coeffici...

2017-03-03 Thread AnthonyTruchet
Github user AnthonyTruchet commented on a diff in the pull request:

https://github.com/apache/spark/pull/17076#discussion_r104115651
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -440,19 +440,9 @@ private class LinearSVCAggregator(
 
   private val numFeatures: Int = bcFeaturesStd.value.length
   private val numFeaturesPlusIntercept: Int = if (fitIntercept) 
numFeatures + 1 else numFeatures
-  private val coefficients: Vector = bcCoefficients.value
   private var weightSum: Double = 0.0
   private var lossSum: Double = 0.0
-  require(numFeaturesPlusIntercept == coefficients.size, s"Dimension 
mismatch. Coefficients " +
-s"length ${coefficients.size}, FeaturesStd length ${numFeatures}, 
fitIntercept: $fitIntercept")
-
-  private val coefficientsArray = coefficients match {
-case dv: DenseVector => dv.values
-case _ =>
-  throw new IllegalArgumentException(
-s"coefficients only supports dense vector but got type 
${coefficients.getClass}.")
-  }
-  private val gradientSumArray = 
Array.fill[Double](coefficientsArray.length)(0)
+  private lazy val gradientSumArray = new 
Array[Double](numFeaturesPlusIntercept)
--- End diff --

Thanks for caring to explain :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14299: [SPARK-16440][MLlib] Ensure broadcasted variables are de...

2017-03-06 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/14299
  
For the "normal failure case", the one we encounter is that when the 
vocabulary is too small the embedding learning raises. And it is pretty 
legitimate in our use cases to have some "too small vocabularies". 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14299: [SPARK-16440][MLlib] Ensure broadcasted variables are de...

2017-03-06 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/14299
  
Ok, thanks. Do you  sen any missing task for it to be merged then ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16279: [SPARK-18471][MLLIB][BACKPORT-2.0] In LBFGS, avoid sendi...

2017-01-10 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16279
  
We have backported it to our internal version of Spark anyhow. As told 
above feel free to close it if you consider this is not worth officially 
backporting :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org