[jira] [Commented] (SPARK-1655) In naive Bayes, store conditional probabilities distributively.

2015-08-12 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694571#comment-14694571
 ] 

Aaron Staple commented on SPARK-1655:
-

Totally fine with me, but I'm not sure who the main decision maker on this is.

 In naive Bayes, store conditional probabilities distributively.
 ---

 Key: SPARK-1655
 URL: https://issues.apache.org/jira/browse/SPARK-1655
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 In the current implementation, we collect all conditional probabilities to 
 the driver node. When there are many labels and many features, this puts 
 heavy load on the driver. For scalability, we should provide a way to store 
 conditional probabilities distributively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2015-06-23 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598709#comment-14598709
 ] 

Aaron Staple commented on SPARK-1503:
-

I believe this stopping criteria was added after the paper was written. It is 
documented on page 8 of the userguide 
(https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no 
explanation is provided. (The userguide also documents this as a = test, while 
the current code uses .) And unfortunately I couldn’t find an explanation in 
the code or git history.

I think the switch to absolute tolerance may be because a relative difference 
measurement could be less useful when the weights are extremely small, and 1 is 
a convenient cutoff point. (Using 1, the equation is simple and the 
interpretation is clear.) I believe [~mengxr] alluded to switching to an 
absolute tolerance at 1 already 
(https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be 
able to provide more information.

With regard to using the new weight norms as the basis for measuring relative 
weight difference, I think that if the convergence test passes using either the 
old or new weight norms, then the old and new norms are going to be very 
similar. It may not make a significant difference which test is used. (It may 
also be worth pointing out that in cases where the tolerance tests with respect 
to different old/new weights return different results, if the tolerance wrt new 
weights is met (and wrt old weights is not) then the weight norm increased 
slightly; if the tolerance wrt the old weights is met (and wrt new weights not) 
then we weight norm decreased slightly.)

Finally, TFOCS adopts a policy of skipping the convergence test after the first 
iteration if the weights are unchanged. I believe this condition is based on 
implementation specific behavior and does not need to be adopted generally.


 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple
 Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png


 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2015-06-23 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598709#comment-14598709
 ] 

Aaron Staple edited comment on SPARK-1503 at 6/24/15 1:46 AM:
--

I believe this stopping criteria was added after the paper was written. It is 
documented on page 8 of the userguide 
(https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no 
explanation is provided. (The userguide also documents this as a = test, while 
the current code uses .) And unfortunately I couldn’t find an explanation in 
the code or git history.

I think the switch to absolute tolerance may be because a relative difference 
measurement could be less useful when the weights are extremely small, and 1 is 
a convenient cutoff point. (Using 1, the equation is simple and the 
interpretation is clear.) I believe [~mengxr] alluded to switching to an 
absolute tolerance at 1 already 
(https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be 
able to provide more information.

With regard to using the new weight norms as the basis for measuring relative 
weight difference, I think that if the convergence test passes using either the 
old or new weight norms, then the old and new norms are going to be very 
similar. It may not make a significant difference which test is used. (It may 
also be worth pointing out that in cases where the tolerance tests with respect 
to different old/new weights return different results, if the tolerance wrt new 
weights is met (and wrt old weights is not) then the weight norm increased 
slightly; if the tolerance wrt the old weights is met (and wrt new weights not) 
then we weight norm decreased slightly.)

Finally, TFOCS adopts a policy of skipping the convergence test on the first 
iteration if the weights are unchanged. I believe this condition is based on 
implementation specific behavior and does not need to be adopted generally.



was (Author: staple):
I believe this stopping criteria was added after the paper was written. It is 
documented on page 8 of the userguide 
(https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no 
explanation is provided. (The userguide also documents this as a = test, while 
the current code uses .) And unfortunately I couldn’t find an explanation in 
the code or git history.

I think the switch to absolute tolerance may be because a relative difference 
measurement could be less useful when the weights are extremely small, and 1 is 
a convenient cutoff point. (Using 1, the equation is simple and the 
interpretation is clear.) I believe [~mengxr] alluded to switching to an 
absolute tolerance at 1 already 
(https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be 
able to provide more information.

With regard to using the new weight norms as the basis for measuring relative 
weight difference, I think that if the convergence test passes using either the 
old or new weight norms, then the old and new norms are going to be very 
similar. It may not make a significant difference which test is used. (It may 
also be worth pointing out that in cases where the tolerance tests with respect 
to different old/new weights return different results, if the tolerance wrt new 
weights is met (and wrt old weights is not) then the weight norm increased 
slightly; if the tolerance wrt the old weights is met (and wrt new weights not) 
then we weight norm decreased slightly.)

Finally, TFOCS adopts a policy of skipping the convergence test after the first 
iteration if the weights are unchanged. I believe this condition is based on 
implementation specific behavior and does not need to be adopted generally.


 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple
 Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png


 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2015-06-22 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596991#comment-14596991
 ] 

Aaron Staple commented on SPARK-1503:
-

[~josephkb] and [~lewuathe] Sure, happy to coordinate. So far I have just been 
duplicating the the convergence tolerance check in TFOCS, the matlab package on 
which the accelerated gradient descent implementation is based. TFOCS also 
tests for convergence by checking if the relative change in the weight vector 
is below a specified threshold. But there are some differences from the 
SPARK-3382 implementation. For example in TFOCS the relative difference between 
new and old weight vectors is measured with respect to the new weight vector 
instead of the old. And if the new weight vector is smaller than the unit 
vector the convergence test is changed to be an absolute rather than relative 
difference between successive weight vectors. I am just describing the 
implementation here, happy to discuss further and potentially look at making 
changes.

Here is the relevant code if you are interested (there is also a separate 
condition when the weight vector does not change between iterations):
https://github.com/cvxr/TFOCS/blob/e34c0daeb136935d23b8df506de8b7b191f6b0a3/private/tfocs_iterate.m#L19-L24

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple
 Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png


 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2015-03-06 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350812#comment-14350812
 ] 

Aaron Staple edited comment on SPARK-1503 at 3/6/15 11:25 PM:
--

I recently created a PR for an implementation of accelerated gradient descent 
without backtracking, as discussed above.

I also ran some simple, small scale (single server) benchmarks to compare 
different optimization methods. (The benchmark result graphs are provided as 
images attached to this ticket.) The optimization methods tested were:

- gra: existing gradient descent implementation (using full batch)
- acc: accelerated descent (as implemented in the PR), but without automatic 
restart
- acc_r: accelerated descent, with automatic restart
- acc_b: accelerated descent, with backtracking, without automatic restart
- acc_rb: accelerated descent, with backtracking, with automatic restart
- lbfgs: existing lbfgs implementation

(Note that backtracking is not part of the PR, and was just tested as a point 
of comparison.)

The x axis shows the number of outer loop iterations of the optimization 
algorithm. (Note that for backtracking implementations, the full cost of 
backtracking is not represented in this outer loop count. For non backtracking 
implementations, the number of outer loop iterations is the same as the number 
of spark map reduce jobs). The y axis is the log of the difference from best 
determined optimized value.

The optimization test runs were:

- linear: A scaled up version of the test data from TFOCS’s test_LASSO.m 
example, with 1 observations on 1024 features. 512 of the features are 
actually correlated with result. Unregularized linear regression was used. (The 
scala acceleration implementation was observed to be consistent with the TFOCS 
implementation on this dataset.)
- linear l1: The same as ‘linear’, but with l1 regularization
- logistic: Each feature of each observation is generated by summing a feature 
gaussian specific to the observation’s binary category with a noise gaussian. 
1 observations on 250 features. Unregularized logistic regression was used.
- logistic l2: Same as ‘logistic’, but using l2 regularization

Note that for each test run, all optimization methods were given the same 
initial step size.

Observations:
- Acceleration consistently converges more quickly than standard gradient 
descent, given the same initial step size.
- Automatic restart is helpful for acceleration convergence
- Backtracking can significantly boost convergence rates in some cases 
(measured in terms of outer loop iterations). But the full cost of backtracking 
was not measured in these runs.
- lbfgs generally outperformed accelerated gradient descent in these test runs. 
Accelerated gradient descent was competitive with lbfgs for linear l1 (lasso) 
regression. However as noted in the documentation, the L1Updater “will not 
work” for the lbfgs implementation. It seems l1 regularization is currently a 
weak spot for the lbfgs implementation.


was (Author: staple):
I recently created a PR for an implementation of accelerated gradient descent 
without backtracking, as discussed above.

I also ran some simple, small scale (single server) benchmarks to compare 
different optimization methods. (The benchmark result graphs are provided as 
images attached to this ticket.) The optimization methods tested were:

- gra: existing gradient descent implementation (using full batch)
- acc: accelerated descent (as implemented in the PR), but without automatic 
restart
- acc_r: accelerated descent, with automatic restart
- acc_b: accelerated descent, with backtracking, without automatic restart
- acc_rb: accelerated descent, with backtracking, with automatic restart
- lbfgs: existing lbfgs implementation

(Note that backtracking is not part of the PR, and was just tested for a point 
of comparison.)

The x axis shows the number of outer loop iterations of the optimization 
algorithm. (Note that for backtracking implementations, the full cost of 
backtracking is not represented in this outer loop count. For non backtracking 
implementations, the number of outer loop iterations is the same as the number 
of spark map reduce jobs). The y axis is the log of the difference from best 
determined optimized value.

The optimization test runs were:

- linear: A scaled up version of the test data from TFOCS’s test_LASSO.m 
example, with 1 observations on 1024 features. 512 of the features are 
actually correlated with result. Unregularized linear regression was used. (The 
scala acceleration implementation was observed to be consistent with the TFOCS 
implementation on this dataset.)
- linear l1: The same as ‘linear’, but with l1 regularization
- logistic: Each feature of each observation is generated by summing a feature 
gaussian specific to the observation’s binary category with a noise gaussian. 

[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2015-03-06 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350812#comment-14350812
 ] 

Aaron Staple commented on SPARK-1503:
-

I recently created a PR for an implementation of accelerated gradient descent 
without backtracking, as discussed above.

I also ran some simple, small scale (single server) benchmarks to compare 
different optimization methods. (The benchmark result graphs are provided as 
images attached to this ticket.) The optimization methods tested were:

- gra: existing gradient descent implementation (using full batch)
- acc: accelerated descent (as implemented in the PR), but without automatic 
restart
- acc_r: accelerated descent, with automatic restart
- acc_b: accelerated descent, with backtracking, without automatic restart
- acc_rb: accelerated descent, with backtracking, with automatic restart
- lbfgs: existing lbfgs implementation

(Note that backtracking is not part of the PR, and was just tested for.)

The x axis shows the number of outer loop iterations of the optimization 
algorithm. (Note that for backtracking implementations, the full cost of 
backtracking is not represented in this outer loop count. For non backtracking 
implementations, the number of outer loop iterations is the same as the number 
of spark map reduce jobs). The y axis is the log of the difference from best 
determined optimized value.

The optimization test runs were:

- linear: A scaled up version of the test data from TFOCS’s test_LASSO.m 
example, with 1 observations on 1024 features. 512 of the features are 
actually correlated with result. Unregularized linear regression was used. (The 
scala acceleration implementation was observed to be consistent with the TFOCS 
implementation on this dataset.)
- linear l1: The same as ‘linear’, but with l1 regularization
- logistic: Each feature of each observation is generated by summing a feature 
gaussian specific to the observation’s binary category with a noise gaussian. 
1 observations on 250 features. Unregularized logistic regression was used.
- logistic l2: Same as ‘logistic’, but using l2 regularization

Note that for each test run, all optimization methods were given the same 
initial step size.

Observations:
- Acceleration consistently converges more quickly than standard gradient 
descent, given the same initial step size.
- Automatic restart is helpful for acceleration convergence
- Backtracking can significantly boost convergence rates in some cases 
(measured in terms of outer loop iterations). But the full cost of backtracking 
was not measured in these runs.
- lbfgs generally outperformed accelerated gradient descent in these test runs. 
Accelerated gradient descent was competitive with lbfgs for linear l1 (lasso) 
regression. However as noted in the documentation, the L1Updater “will not 
work” for the lbfgs implementation. It seems l1 regularization is currently a 
weak spot for the lbfgs implementation.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1503) Implement Nesterov's accelerated first-order method

2015-03-06 Thread Aaron Staple (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Staple updated SPARK-1503:

Attachment: logistic_l2.png
logistic.png
linear_l1.png
linear.png

Optimization benchmarks, uploaded 2015-03-06.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple
 Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png


 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-26 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958
 ] 

Aaron Staple commented on SPARK-1503:
-

[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning. So, wary of these sorts of comments, I wrote up the proposal 
with the intention of duplicating TFOCS as closely as possible to start with, 
with the possibility of making changes from there based on performance. In 
addition, I’d interpreted an earlier comment in this ticket as suggesting that 
line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking. I’ve also noticed some 
references online to distributed implementations of accelerated methods. It may 
be informative to learn more about them - if you’ve heard of any particularly 
good distributed optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-26 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:09 PM:


[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning. So, wary of these sorts of comments, I wrote up the proposal 
with the intention of duplicating TFOCS as closely as possible to start with, 
with the possibility of making changes from there based on performance. In 
addition, I’d interpreted an earlier comment in this ticket as suggesting that 
line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you’ve heard of any particularly good distributed 
optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.


was (Author: staple):
[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning. So, wary of these sorts of comments, I wrote up the proposal 
with the intention of duplicating TFOCS as closely as possible to start with, 
with the possibility of making changes from there based on performance. In 
addition, I’d interpreted an earlier comment in this ticket as suggesting that 
line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking. I’ve also noticed some 
references online to distributed implementations of accelerated methods. It may 
be informative to learn more about them - if you’ve heard of any particularly 
good distributed optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a 

[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-26 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:10 PM:


[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning applications. So, wary of these sorts of comments, I wrote up 
the proposal with the intention of duplicating TFOCS as closely as possible to 
start with, with the possibility of making changes from there based on 
performance. In addition, I’d interpreted an earlier comment in this ticket as 
suggesting that line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you’ve heard of any particularly good distributed 
optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.


was (Author: staple):
[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning. So, wary of these sorts of comments, I wrote up the proposal 
with the intention of duplicating TFOCS as closely as possible to start with, 
with the possibility of making changes from there based on performance. In 
addition, I’d interpreted an earlier comment in this ticket as suggesting that 
line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you’ve heard of any particularly good distributed 
optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron 

[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-26 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:14 PM:


[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning applications. So, wary of these sorts of comments, I wrote up 
the proposal with the intention of duplicating TFOCS as closely as possible to 
start with, with the possibility of making changes from there based on 
performance. In addition, I’d interpreted an earlier comment in this ticket as 
suggesting that line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you happen to have heard of any particularly good 
distributed optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.


was (Author: staple):
[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning applications. So, wary of these sorts of comments, I wrote up 
the proposal with the intention of duplicating TFOCS as closely as possible to 
start with, with the possibility of making changes from there based on 
performance. In addition, I’d interpreted an earlier comment in this ticket as 
suggesting that line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you’ve heard of any particularly good distributed 
optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
   

[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-24 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224132#comment-14224132
 ] 

Aaron Staple commented on SPARK-1503:
-

[~mengxr] [~rezazadeh] Ok, thanks for the heads up. Let me know if there’s 
anything about the spec that should be handled differently. I covered most of 
the mathematics informally (the details are already covered formally in the 
references). And in addition, the proposal describes a method of implementing 
TFOCS functionality distributively but does not investigate existing 
distributed optimization systems.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-22 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221233#comment-14221233
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/23/14 6:55 AM:
---

[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, and if you'd like me to clarify 
anything.

UPDATE: Ok, here's the document:
https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing


was (Author: staple):
[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, and if you'd like me to clarify 
anything.

UPDATE: On second thought, I'd actually like to make a few changes to the 
proposal. I'll follow up tomorrow with the updated version. Sorry for the 
confusion.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-21 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221233#comment-14221233
 ] 

Aaron Staple commented on SPARK-1503:
-

[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, if you'd like me to clarify 
anything.

https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-21 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221233#comment-14221233
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/21/14 7:03 PM:
---

[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, and if you'd like me to clarify 
anything.

https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing


was (Author: staple):
[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, if you'd like me to clarify 
anything.

https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-21 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221233#comment-14221233
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/21/14 8:10 PM:
---

[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, and if you'd like me to clarify 
anything.

UPDATE: On second thought, I'd actually like to make a few changes to the 
proposal. I'll follow up tomorrow with the updated version. Sorry for the 
confusion.


was (Author: staple):
[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, and if you'd like me to clarify 
anything.

UPDATE: On second thought, I'd actually like to make a few changes to the 
proposal. I'll follow up tomorrow with the updated version.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-21 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221233#comment-14221233
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/21/14 8:09 PM:
---

[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, and if you'd like me to clarify 
anything.

UPDATE: On second thought, I'd actually like to make a few changes to the 
proposal. I'll follow up tomorrow with the updated version.


was (Author: staple):
[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, and if you'd like me to clarify 
anything.

https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-10-10 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167868#comment-14167868
 ] 

Aaron Staple commented on SPARK-1503:
-

[~mengxr] Thanks for the heads up! I’ll definitely go through TFOCS and am 
happy to work carefully and collaboratively on design.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-10-09 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166168#comment-14166168
 ] 

Aaron Staple commented on SPARK-1503:
-

Hi, I’d like to try working on this ticket. If you’d like to assign it to me, I 
can write a short spec and then work on a PR.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle

2014-09-25 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148052#comment-14148052
 ] 

Aaron Staple commented on SPARK-546:


Hi, I think there are two features requested in this ticket:

1) full outer join
2) an RDD function to join 2 rdds in a single shuffle (e.g. multiJoin function)

I’ve implemented #1 in my recent PR, but not #2. I’m happy to implement #2 as 
well though.

Would it make sense to reopen this ticket? File a new ticket?

 Support full outer join and multiple join in a single shuffle
 -

 Key: SPARK-546
 URL: https://issues.apache.org/jira/browse/SPARK-546
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Reporter: Reynold Xin
Assignee: Aaron Staple
 Fix For: 1.2.0


 RDD[(K,V)] now supports left/right outer join but not full outer join.
 Also it'd be nice to provide a way for users to join multiple RDDs on the 
 same key in a single shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners

2014-09-25 Thread Aaron Staple (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Staple resolved SPARK-3550.
-
Resolution: Fixed

 Disable automatic rdd caching in python api for relevant learners
 -

 Key: SPARK-3550
 URL: https://issues.apache.org/jira/browse/SPARK-3550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Aaron Staple

 The python mllib api automatically caches training rdds. However, the 
 NaiveBayes, ALS, and DecisionTree learners do not require external caching to 
 prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates 
 its input RDD once, while ALS and DecisionTree internally persist 
 transformations of their input RDDs. For these learners, we should disable 
 the automatic caching in the python mllib api.
 See discussion here:
 https://github.com/apache/spark/pull/2362#issuecomment-55637953



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners

2014-09-25 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148347#comment-14148347
 ] 

Aaron Staple commented on SPARK-3550:
-

This has been addressed in another commit: 
https://github.com/apache/spark/commit/fce5e251d636c788cda91345867e0294280c074d

See comment here:
https://github.com/apache/spark/pull/2412#issuecomment-56865408

 Disable automatic rdd caching in python api for relevant learners
 -

 Key: SPARK-3550
 URL: https://issues.apache.org/jira/browse/SPARK-3550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Aaron Staple

 The python mllib api automatically caches training rdds. However, the 
 NaiveBayes, ALS, and DecisionTree learners do not require external caching to 
 prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates 
 its input RDD once, while ALS and DecisionTree internally persist 
 transformations of their input RDDs. For these learners, we should disable 
 the automatic caching in the python mllib api.
 See discussion here:
 https://github.com/apache/spark/pull/2362#issuecomment-55637953



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3488) cache deserialized python RDDs before iterative learning

2014-09-25 Thread Aaron Staple (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Staple resolved SPARK-3488.
-
Resolution: Won't Fix

 cache deserialized python RDDs before iterative learning
 

 Key: SPARK-3488
 URL: https://issues.apache.org/jira/browse/SPARK-3488
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Aaron Staple

 When running an iterative learning algorithm, it makes sense that the input 
 RDD be cached for improved performance. When learning is applied to a python 
 RDD, currently the python RDD is always cached, then in scala that cached RDD 
 is mapped to an uncached deserialized RDD, and the uncached RDD is passed to 
 the learning algorithm. Instead the deserialized RDD should be cached.
 This was originally discussed here:
 https://github.com/apache/spark/pull/2347#issuecomment-55181535



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners

2014-09-16 Thread Aaron Staple (JIRA)
Aaron Staple created SPARK-3550:
---

 Summary: Disable automatic rdd caching in python api for relevant 
learners
 Key: SPARK-3550
 URL: https://issues.apache.org/jira/browse/SPARK-3550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Aaron Staple


The python mllib api automatically caches training rdds. However, the 
NaiveBayes, ALS, and DecisionTree learners do not require external caching to 
prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates 
its input RDD once, while ALS and DecisionTree internally persist 
transformations of their input RDDs. For these learners, we should disable the 
automatic caching in the python mllib api.

See discussion here:
https://github.com/apache/spark/pull/2362#issuecomment-55637953



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3488) cache deserialized python RDDs before iterative learning

2014-09-16 Thread Aaron Staple (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Staple updated SPARK-3488:

Component/s: PySpark

 cache deserialized python RDDs before iterative learning
 

 Key: SPARK-3488
 URL: https://issues.apache.org/jira/browse/SPARK-3488
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Aaron Staple

 When running an iterative learning algorithm, it makes sense that the input 
 RDD be cached for improved performance. When learning is applied to a python 
 RDD, currently the python RDD is always cached, then in scala that cached RDD 
 is mapped to an uncached deserialized RDD, and the uncached RDD is passed to 
 the learning algorithm. Instead the deserialized RDD should be cached.
 This was originally discussed here:
 https://github.com/apache/spark/pull/2347#issuecomment-55181535



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3488) cache deserialized python RDDs before iterative learning

2014-09-16 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135613#comment-14135613
 ] 

Aaron Staple commented on SPARK-3488:
-

After further discussion it's been decided that, for now, the present 
implementation’s reduced memory footprint for cached rdds is worth the cpu cost 
of repeated deserialization during learning.

See discussion https://github.com/apache/spark/pull/2362#issuecomment-2191

 cache deserialized python RDDs before iterative learning
 

 Key: SPARK-3488
 URL: https://issues.apache.org/jira/browse/SPARK-3488
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Aaron Staple

 When running an iterative learning algorithm, it makes sense that the input 
 RDD be cached for improved performance. When learning is applied to a python 
 RDD, currently the python RDD is always cached, then in scala that cached RDD 
 is mapped to an uncached deserialized RDD, and the uncached RDD is passed to 
 the learning algorithm. Instead the deserialized RDD should be cached.
 This was originally discussed here:
 https://github.com/apache/spark/pull/2347#issuecomment-55181535



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-695) Exponential recursion in getPreferredLocations

2014-07-31 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081669#comment-14081669
 ] 

Aaron Staple commented on SPARK-695:


Progress has been made on a PR here:
https://github.com/apache/spark/pull/1362

 Exponential recursion in getPreferredLocations
 --

 Key: SPARK-695
 URL: https://issues.apache.org/jira/browse/SPARK-695
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia

 This was reported to happen in DAGScheduler for graphs with many paths from 
 the root up, though I haven't yet found a good test case for it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2781) Analyzer should check resolution of LogicalPlans

2014-07-31 Thread Aaron Staple (JIRA)
Aaron Staple created SPARK-2781:
---

 Summary: Analyzer should check resolution of LogicalPlans
 Key: SPARK-2781
 URL: https://issues.apache.org/jira/browse/SPARK-2781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Aaron Staple


Currently the Analyzer’s CheckResolution rule checks that all attributes are 
resolved by searching for unresolved Expressions.  But some LogicalPlans, 
including Union, contain custom implementations of the resolve attribute that 
validate other criteria in addition to checking for attribute resolution of 
their descendants.  These LogicalPlans are not currently validated by the 
CheckResolution implementation.

As a result, it is currently possible to execute a query generated from 
unresolved LogicalPlans.  One example is a UNION query that produces rows with 
different data types in the same column:

{noformat}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class T1(value:Seq[Int])
val t1 = sc.parallelize(Seq(T1(Seq(0,1
t1.registerAsTable(t1)
sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect()
{noformat}

In this example, the type coercion implementation cannot unify array and 
integer types.  One row contains an array in the returned column and the other 
row contains an integer.  The result is:

{noformat}
res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2])
{noformat}

I believe fixing this is a first step toward improving validation for Union 
(and similar) plans.  (For instance, Union does not currently validate that its 
children contain the same number of columns.)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2781) Analyzer should check resolution of LogicalPlans

2014-07-31 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081833#comment-14081833
 ] 

Aaron Staple commented on SPARK-2781:
-

No problem, I think the current validation checks Expressions but there are 
some cases where LogicalPlans might not be resolved even though Expressions are 
resolved.

 Analyzer should check resolution of LogicalPlans
 

 Key: SPARK-2781
 URL: https://issues.apache.org/jira/browse/SPARK-2781
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Aaron Staple
Assignee: Michael Armbrust
 Fix For: 1.0.1, 1.1.0


 Currently the Analyzer’s CheckResolution rule checks that all attributes are 
 resolved by searching for unresolved Expressions.  But some LogicalPlans, 
 including Union, contain custom implementations of the resolve attribute that 
 validate other criteria in addition to checking for attribute resolution of 
 their descendants.  These LogicalPlans are not currently validated by the 
 CheckResolution implementation.
 As a result, it is currently possible to execute a query generated from 
 unresolved LogicalPlans.  One example is a UNION query that produces rows 
 with different data types in the same column:
 {noformat}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext._
 case class T1(value:Seq[Int])
 val t1 = sc.parallelize(Seq(T1(Seq(0,1
 t1.registerAsTable(t1)
 sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect()
 {noformat}
 In this example, the type coercion implementation cannot unify array and 
 integer types.  One row contains an array in the returned column and the 
 other row contains an integer.  The result is:
 {noformat}
 res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2])
 {noformat}
 I believe fixing this is a first step toward improving validation for Union 
 (and similar) plans.  (For instance, Union does not currently validate that 
 its children contain the same number of columns.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2314) RDD actions are only overridden in Scala, not java or python

2014-07-25 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074836#comment-14074836
 ] 

Aaron Staple commented on SPARK-2314:
-

Hi, I added a PR that I handles overriding collect and take in Python, and 
count in Java:

https://github.com/apache/spark/pull/1592

 RDD actions are only overridden in Scala, not java or python
 

 Key: SPARK-2314
 URL: https://issues.apache.org/jira/browse/SPARK-2314
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0, 1.0.1
Reporter: Michael Armbrust
Assignee: Aaron Staple
  Labels: starter
 Fix For: 1.1.0, 1.0.2


 For example, collect and take().  We should keep these two in sync, or move 
 this code to schemaRDD like if possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2581) complete or withdraw visitedStages optimization in DAGScheduler’s stageDependsOn

2014-07-18 Thread Aaron Staple (JIRA)
Aaron Staple created SPARK-2581:
---

 Summary: complete or withdraw visitedStages optimization in 
DAGScheduler’s stageDependsOn
 Key: SPARK-2581
 URL: https://issues.apache.org/jira/browse/SPARK-2581
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Aaron Staple
Priority: Minor


Right now the visitedStages HashSet is populated with stages, but never queried 
to limit examination of previously visited stages.  It may make sense to check 
whether a mapStage has been visited previously before visiting it again, as in 
the nearby visitedRdds check.  Or it may be that the existing visitedRdds check 
sufficiently optimizes this function, and visitedStages can simply be removed.

See discussion here: 
https://github.com/apache/spark/pull/1362#discussion-diff-15018046L1107



--
This message was sent by Atlassian JIRA
(v6.2#6252)