[jira] [Commented] (SPARK-1655) In naive Bayes, store conditional probabilities distributively.
[ https://issues.apache.org/jira/browse/SPARK-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694571#comment-14694571 ] Aaron Staple commented on SPARK-1655: - Totally fine with me, but I'm not sure who the main decision maker on this is. In naive Bayes, store conditional probabilities distributively. --- Key: SPARK-1655 URL: https://issues.apache.org/jira/browse/SPARK-1655 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple In the current implementation, we collect all conditional probabilities to the driver node. When there are many labels and many features, this puts heavy load on the driver. For scalability, we should provide a way to store conditional probabilities distributively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598709#comment-14598709 ] Aaron Staple commented on SPARK-1503: - I believe this stopping criteria was added after the paper was written. It is documented on page 8 of the userguide (https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no explanation is provided. (The userguide also documents this as a = test, while the current code uses .) And unfortunately I couldn’t find an explanation in the code or git history. I think the switch to absolute tolerance may be because a relative difference measurement could be less useful when the weights are extremely small, and 1 is a convenient cutoff point. (Using 1, the equation is simple and the interpretation is clear.) I believe [~mengxr] alluded to switching to an absolute tolerance at 1 already (https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be able to provide more information. With regard to using the new weight norms as the basis for measuring relative weight difference, I think that if the convergence test passes using either the old or new weight norms, then the old and new norms are going to be very similar. It may not make a significant difference which test is used. (It may also be worth pointing out that in cases where the tolerance tests with respect to different old/new weights return different results, if the tolerance wrt new weights is met (and wrt old weights is not) then the weight norm increased slightly; if the tolerance wrt the old weights is met (and wrt new weights not) then we weight norm decreased slightly.) Finally, TFOCS adopts a policy of skipping the convergence test after the first iteration if the weights are unchanged. I believe this condition is based on implementation specific behavior and does not need to be adopted generally. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598709#comment-14598709 ] Aaron Staple edited comment on SPARK-1503 at 6/24/15 1:46 AM: -- I believe this stopping criteria was added after the paper was written. It is documented on page 8 of the userguide (https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no explanation is provided. (The userguide also documents this as a = test, while the current code uses .) And unfortunately I couldn’t find an explanation in the code or git history. I think the switch to absolute tolerance may be because a relative difference measurement could be less useful when the weights are extremely small, and 1 is a convenient cutoff point. (Using 1, the equation is simple and the interpretation is clear.) I believe [~mengxr] alluded to switching to an absolute tolerance at 1 already (https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be able to provide more information. With regard to using the new weight norms as the basis for measuring relative weight difference, I think that if the convergence test passes using either the old or new weight norms, then the old and new norms are going to be very similar. It may not make a significant difference which test is used. (It may also be worth pointing out that in cases where the tolerance tests with respect to different old/new weights return different results, if the tolerance wrt new weights is met (and wrt old weights is not) then the weight norm increased slightly; if the tolerance wrt the old weights is met (and wrt new weights not) then we weight norm decreased slightly.) Finally, TFOCS adopts a policy of skipping the convergence test on the first iteration if the weights are unchanged. I believe this condition is based on implementation specific behavior and does not need to be adopted generally. was (Author: staple): I believe this stopping criteria was added after the paper was written. It is documented on page 8 of the userguide (https://github.com/cvxr/TFOCS/raw/master/userguide.pdf) but unfortunately no explanation is provided. (The userguide also documents this as a = test, while the current code uses .) And unfortunately I couldn’t find an explanation in the code or git history. I think the switch to absolute tolerance may be because a relative difference measurement could be less useful when the weights are extremely small, and 1 is a convenient cutoff point. (Using 1, the equation is simple and the interpretation is clear.) I believe [~mengxr] alluded to switching to an absolute tolerance at 1 already (https://github.com/apache/spark/pull/3636#discussion_r22078041) so he might be able to provide more information. With regard to using the new weight norms as the basis for measuring relative weight difference, I think that if the convergence test passes using either the old or new weight norms, then the old and new norms are going to be very similar. It may not make a significant difference which test is used. (It may also be worth pointing out that in cases where the tolerance tests with respect to different old/new weights return different results, if the tolerance wrt new weights is met (and wrt old weights is not) then the weight norm increased slightly; if the tolerance wrt the old weights is met (and wrt new weights not) then we weight norm decreased slightly.) Finally, TFOCS adopts a policy of skipping the convergence test after the first iteration if the weights are unchanged. I believe this condition is based on implementation specific behavior and does not need to be adopted generally. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596991#comment-14596991 ] Aaron Staple commented on SPARK-1503: - [~josephkb] and [~lewuathe] Sure, happy to coordinate. So far I have just been duplicating the the convergence tolerance check in TFOCS, the matlab package on which the accelerated gradient descent implementation is based. TFOCS also tests for convergence by checking if the relative change in the weight vector is below a specified threshold. But there are some differences from the SPARK-3382 implementation. For example in TFOCS the relative difference between new and old weight vectors is measured with respect to the new weight vector instead of the old. And if the new weight vector is smaller than the unit vector the convergence test is changed to be an absolute rather than relative difference between successive weight vectors. I am just describing the implementation here, happy to discuss further and potentially look at making changes. Here is the relevant code if you are interested (there is also a separate condition when the weight vector does not change between iterations): https://github.com/cvxr/TFOCS/blob/e34c0daeb136935d23b8df506de8b7b191f6b0a3/private/tfocs_iterate.m#L19-L24 Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350812#comment-14350812 ] Aaron Staple edited comment on SPARK-1503 at 3/6/15 11:25 PM: -- I recently created a PR for an implementation of accelerated gradient descent without backtracking, as discussed above. I also ran some simple, small scale (single server) benchmarks to compare different optimization methods. (The benchmark result graphs are provided as images attached to this ticket.) The optimization methods tested were: - gra: existing gradient descent implementation (using full batch) - acc: accelerated descent (as implemented in the PR), but without automatic restart - acc_r: accelerated descent, with automatic restart - acc_b: accelerated descent, with backtracking, without automatic restart - acc_rb: accelerated descent, with backtracking, with automatic restart - lbfgs: existing lbfgs implementation (Note that backtracking is not part of the PR, and was just tested as a point of comparison.) The x axis shows the number of outer loop iterations of the optimization algorithm. (Note that for backtracking implementations, the full cost of backtracking is not represented in this outer loop count. For non backtracking implementations, the number of outer loop iterations is the same as the number of spark map reduce jobs). The y axis is the log of the difference from best determined optimized value. The optimization test runs were: - linear: A scaled up version of the test data from TFOCS’s test_LASSO.m example, with 1 observations on 1024 features. 512 of the features are actually correlated with result. Unregularized linear regression was used. (The scala acceleration implementation was observed to be consistent with the TFOCS implementation on this dataset.) - linear l1: The same as ‘linear’, but with l1 regularization - logistic: Each feature of each observation is generated by summing a feature gaussian specific to the observation’s binary category with a noise gaussian. 1 observations on 250 features. Unregularized logistic regression was used. - logistic l2: Same as ‘logistic’, but using l2 regularization Note that for each test run, all optimization methods were given the same initial step size. Observations: - Acceleration consistently converges more quickly than standard gradient descent, given the same initial step size. - Automatic restart is helpful for acceleration convergence - Backtracking can significantly boost convergence rates in some cases (measured in terms of outer loop iterations). But the full cost of backtracking was not measured in these runs. - lbfgs generally outperformed accelerated gradient descent in these test runs. Accelerated gradient descent was competitive with lbfgs for linear l1 (lasso) regression. However as noted in the documentation, the L1Updater “will not work” for the lbfgs implementation. It seems l1 regularization is currently a weak spot for the lbfgs implementation. was (Author: staple): I recently created a PR for an implementation of accelerated gradient descent without backtracking, as discussed above. I also ran some simple, small scale (single server) benchmarks to compare different optimization methods. (The benchmark result graphs are provided as images attached to this ticket.) The optimization methods tested were: - gra: existing gradient descent implementation (using full batch) - acc: accelerated descent (as implemented in the PR), but without automatic restart - acc_r: accelerated descent, with automatic restart - acc_b: accelerated descent, with backtracking, without automatic restart - acc_rb: accelerated descent, with backtracking, with automatic restart - lbfgs: existing lbfgs implementation (Note that backtracking is not part of the PR, and was just tested for a point of comparison.) The x axis shows the number of outer loop iterations of the optimization algorithm. (Note that for backtracking implementations, the full cost of backtracking is not represented in this outer loop count. For non backtracking implementations, the number of outer loop iterations is the same as the number of spark map reduce jobs). The y axis is the log of the difference from best determined optimized value. The optimization test runs were: - linear: A scaled up version of the test data from TFOCS’s test_LASSO.m example, with 1 observations on 1024 features. 512 of the features are actually correlated with result. Unregularized linear regression was used. (The scala acceleration implementation was observed to be consistent with the TFOCS implementation on this dataset.) - linear l1: The same as ‘linear’, but with l1 regularization - logistic: Each feature of each observation is generated by summing a feature gaussian specific to the observation’s binary category with a noise gaussian.
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350812#comment-14350812 ] Aaron Staple commented on SPARK-1503: - I recently created a PR for an implementation of accelerated gradient descent without backtracking, as discussed above. I also ran some simple, small scale (single server) benchmarks to compare different optimization methods. (The benchmark result graphs are provided as images attached to this ticket.) The optimization methods tested were: - gra: existing gradient descent implementation (using full batch) - acc: accelerated descent (as implemented in the PR), but without automatic restart - acc_r: accelerated descent, with automatic restart - acc_b: accelerated descent, with backtracking, without automatic restart - acc_rb: accelerated descent, with backtracking, with automatic restart - lbfgs: existing lbfgs implementation (Note that backtracking is not part of the PR, and was just tested for.) The x axis shows the number of outer loop iterations of the optimization algorithm. (Note that for backtracking implementations, the full cost of backtracking is not represented in this outer loop count. For non backtracking implementations, the number of outer loop iterations is the same as the number of spark map reduce jobs). The y axis is the log of the difference from best determined optimized value. The optimization test runs were: - linear: A scaled up version of the test data from TFOCS’s test_LASSO.m example, with 1 observations on 1024 features. 512 of the features are actually correlated with result. Unregularized linear regression was used. (The scala acceleration implementation was observed to be consistent with the TFOCS implementation on this dataset.) - linear l1: The same as ‘linear’, but with l1 regularization - logistic: Each feature of each observation is generated by summing a feature gaussian specific to the observation’s binary category with a noise gaussian. 1 observations on 250 features. Unregularized logistic regression was used. - logistic l2: Same as ‘logistic’, but using l2 regularization Note that for each test run, all optimization methods were given the same initial step size. Observations: - Acceleration consistently converges more quickly than standard gradient descent, given the same initial step size. - Automatic restart is helpful for acceleration convergence - Backtracking can significantly boost convergence rates in some cases (measured in terms of outer loop iterations). But the full cost of backtracking was not measured in these runs. - lbfgs generally outperformed accelerated gradient descent in these test runs. Accelerated gradient descent was competitive with lbfgs for linear l1 (lasso) regression. However as noted in the documentation, the L1Updater “will not work” for the lbfgs implementation. It seems l1 regularization is currently a weak spot for the lbfgs implementation. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Staple updated SPARK-1503: Attachment: logistic_l2.png logistic.png linear_l1.png linear.png Optimization benchmarks, uploaded 2015-03-06. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Attachments: linear.png, linear_l1.png, logistic.png, logistic_l2.png Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958 ] Aaron Staple commented on SPARK-1503: - [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking. I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958 ] Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:09 PM: [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking (for our objective functions in particular). I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. was (Author: staple): [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking. I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958 ] Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:10 PM: [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning applications. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking (for our objective functions in particular). I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. was (Author: staple): [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking (for our objective functions in particular). I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958 ] Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:14 PM: [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning applications. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking (for our objective functions in particular). I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you happen to have heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. was (Author: staple): [~rezazadeh] Thanks for your feedback. Your point about the communication cost of backtracking is well taken. Just to explain where I was coming from with the design proposal: As I was looking to come up to speed on accelerated gradient descent, I came across some scattered comments online suggesting that acceleration was difficult to implement well, was finicky, etc - especially when compared with standard gradient descent for machine learning applications. So, wary of these sorts of comments, I wrote up the proposal with the intention of duplicating TFOCS as closely as possible to start with, with the possibility of making changes from there based on performance. In addition, I’d interpreted an earlier comment in this ticket as suggesting that line search be implemented in the same manner as in TFOCS. I am happy to implement a constant step size first, though. It may also be informative to run some performance tests in spark both with and without backtracking. One (basic, not conclusive) data point I have now is that, if I run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop and 106 iterations of the inner backtracking loop. For this one example, the backtracking iteration overhead is only about 10%. Though keep in mind that in spark if we removed backtracking entirely it would mean only one distributed aggregation per iteration rather than two - so a huge improvement in communication cost assuming there is still a good convergence rate. Incidentally, are there any specific learning benchmarks for spark that you would recommend? I’ll do a bit of research to identify the best ways to manage the lipschitz estimate / step size in the absence of backtracking (for our objective functions in particular). I’ve also noticed some references online to distributed implementations of accelerated methods. It may be informative to learn more about them - if you’ve heard of any particularly good distributed optimizers using acceleration, please let me know. Thanks, Aaron PS Yes, I’ll make sure to follow the lbfgs example so the accelerated implementation can be easily substituted. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224132#comment-14224132 ] Aaron Staple commented on SPARK-1503: - [~mengxr] [~rezazadeh] Ok, thanks for the heads up. Let me know if there’s anything about the spec that should be handled differently. I covered most of the mathematics informally (the details are already covered formally in the references). And in addition, the proposal describes a method of implementing TFOCS functionality distributively but does not investigate existing distributed optimization systems. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221233#comment-14221233 ] Aaron Staple edited comment on SPARK-1503 at 11/23/14 6:55 AM: --- [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, and if you'd like me to clarify anything. UPDATE: Ok, here's the document: https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing was (Author: staple): [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, and if you'd like me to clarify anything. UPDATE: On second thought, I'd actually like to make a few changes to the proposal. I'll follow up tomorrow with the updated version. Sorry for the confusion. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221233#comment-14221233 ] Aaron Staple commented on SPARK-1503: - [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, if you'd like me to clarify anything. https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221233#comment-14221233 ] Aaron Staple edited comment on SPARK-1503 at 11/21/14 7:03 PM: --- [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, and if you'd like me to clarify anything. https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing was (Author: staple): [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, if you'd like me to clarify anything. https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221233#comment-14221233 ] Aaron Staple edited comment on SPARK-1503 at 11/21/14 8:10 PM: --- [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, and if you'd like me to clarify anything. UPDATE: On second thought, I'd actually like to make a few changes to the proposal. I'll follow up tomorrow with the updated version. Sorry for the confusion. was (Author: staple): [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, and if you'd like me to clarify anything. UPDATE: On second thought, I'd actually like to make a few changes to the proposal. I'll follow up tomorrow with the updated version. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221233#comment-14221233 ] Aaron Staple edited comment on SPARK-1503 at 11/21/14 8:09 PM: --- [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, and if you'd like me to clarify anything. UPDATE: On second thought, I'd actually like to make a few changes to the proposal. I'll follow up tomorrow with the updated version. was (Author: staple): [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, and if you'd like me to clarify anything. https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167868#comment-14167868 ] Aaron Staple commented on SPARK-1503: - [~mengxr] Thanks for the heads up! I’ll definitely go through TFOCS and am happy to work carefully and collaboratively on design. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166168#comment-14166168 ] Aaron Staple commented on SPARK-1503: - Hi, I’d like to try working on this ticket. If you’d like to assign it to me, I can write a short spec and then work on a PR. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle
[ https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148052#comment-14148052 ] Aaron Staple commented on SPARK-546: Hi, I think there are two features requested in this ticket: 1) full outer join 2) an RDD function to join 2 rdds in a single shuffle (e.g. multiJoin function) I’ve implemented #1 in my recent PR, but not #2. I’m happy to implement #2 as well though. Would it make sense to reopen this ticket? File a new ticket? Support full outer join and multiple join in a single shuffle - Key: SPARK-546 URL: https://issues.apache.org/jira/browse/SPARK-546 Project: Spark Issue Type: Improvement Components: Spark Core, Streaming Reporter: Reynold Xin Assignee: Aaron Staple Fix For: 1.2.0 RDD[(K,V)] now supports left/right outer join but not full outer join. Also it'd be nice to provide a way for users to join multiple RDDs on the same key in a single shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners
[ https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Staple resolved SPARK-3550. - Resolution: Fixed Disable automatic rdd caching in python api for relevant learners - Key: SPARK-3550 URL: https://issues.apache.org/jira/browse/SPARK-3550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple The python mllib api automatically caches training rdds. However, the NaiveBayes, ALS, and DecisionTree learners do not require external caching to prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs. For these learners, we should disable the automatic caching in the python mllib api. See discussion here: https://github.com/apache/spark/pull/2362#issuecomment-55637953 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners
[ https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148347#comment-14148347 ] Aaron Staple commented on SPARK-3550: - This has been addressed in another commit: https://github.com/apache/spark/commit/fce5e251d636c788cda91345867e0294280c074d See comment here: https://github.com/apache/spark/pull/2412#issuecomment-56865408 Disable automatic rdd caching in python api for relevant learners - Key: SPARK-3550 URL: https://issues.apache.org/jira/browse/SPARK-3550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple The python mllib api automatically caches training rdds. However, the NaiveBayes, ALS, and DecisionTree learners do not require external caching to prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs. For these learners, we should disable the automatic caching in the python mllib api. See discussion here: https://github.com/apache/spark/pull/2362#issuecomment-55637953 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3488) cache deserialized python RDDs before iterative learning
[ https://issues.apache.org/jira/browse/SPARK-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Staple resolved SPARK-3488. - Resolution: Won't Fix cache deserialized python RDDs before iterative learning Key: SPARK-3488 URL: https://issues.apache.org/jira/browse/SPARK-3488 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple When running an iterative learning algorithm, it makes sense that the input RDD be cached for improved performance. When learning is applied to a python RDD, currently the python RDD is always cached, then in scala that cached RDD is mapped to an uncached deserialized RDD, and the uncached RDD is passed to the learning algorithm. Instead the deserialized RDD should be cached. This was originally discussed here: https://github.com/apache/spark/pull/2347#issuecomment-55181535 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners
Aaron Staple created SPARK-3550: --- Summary: Disable automatic rdd caching in python api for relevant learners Key: SPARK-3550 URL: https://issues.apache.org/jira/browse/SPARK-3550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple The python mllib api automatically caches training rdds. However, the NaiveBayes, ALS, and DecisionTree learners do not require external caching to prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs. For these learners, we should disable the automatic caching in the python mllib api. See discussion here: https://github.com/apache/spark/pull/2362#issuecomment-55637953 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3488) cache deserialized python RDDs before iterative learning
[ https://issues.apache.org/jira/browse/SPARK-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Staple updated SPARK-3488: Component/s: PySpark cache deserialized python RDDs before iterative learning Key: SPARK-3488 URL: https://issues.apache.org/jira/browse/SPARK-3488 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple When running an iterative learning algorithm, it makes sense that the input RDD be cached for improved performance. When learning is applied to a python RDD, currently the python RDD is always cached, then in scala that cached RDD is mapped to an uncached deserialized RDD, and the uncached RDD is passed to the learning algorithm. Instead the deserialized RDD should be cached. This was originally discussed here: https://github.com/apache/spark/pull/2347#issuecomment-55181535 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3488) cache deserialized python RDDs before iterative learning
[ https://issues.apache.org/jira/browse/SPARK-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135613#comment-14135613 ] Aaron Staple commented on SPARK-3488: - After further discussion it's been decided that, for now, the present implementation’s reduced memory footprint for cached rdds is worth the cpu cost of repeated deserialization during learning. See discussion https://github.com/apache/spark/pull/2362#issuecomment-2191 cache deserialized python RDDs before iterative learning Key: SPARK-3488 URL: https://issues.apache.org/jira/browse/SPARK-3488 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple When running an iterative learning algorithm, it makes sense that the input RDD be cached for improved performance. When learning is applied to a python RDD, currently the python RDD is always cached, then in scala that cached RDD is mapped to an uncached deserialized RDD, and the uncached RDD is passed to the learning algorithm. Instead the deserialized RDD should be cached. This was originally discussed here: https://github.com/apache/spark/pull/2347#issuecomment-55181535 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-695) Exponential recursion in getPreferredLocations
[ https://issues.apache.org/jira/browse/SPARK-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081669#comment-14081669 ] Aaron Staple commented on SPARK-695: Progress has been made on a PR here: https://github.com/apache/spark/pull/1362 Exponential recursion in getPreferredLocations -- Key: SPARK-695 URL: https://issues.apache.org/jira/browse/SPARK-695 Project: Spark Issue Type: Bug Reporter: Matei Zaharia This was reported to happen in DAGScheduler for graphs with many paths from the root up, though I haven't yet found a good test case for it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2781) Analyzer should check resolution of LogicalPlans
Aaron Staple created SPARK-2781: --- Summary: Analyzer should check resolution of LogicalPlans Key: SPARK-2781 URL: https://issues.apache.org/jira/browse/SPARK-2781 Project: Spark Issue Type: Bug Components: SQL Reporter: Aaron Staple Currently the Analyzer’s CheckResolution rule checks that all attributes are resolved by searching for unresolved Expressions. But some LogicalPlans, including Union, contain custom implementations of the resolve attribute that validate other criteria in addition to checking for attribute resolution of their descendants. These LogicalPlans are not currently validated by the CheckResolution implementation. As a result, it is currently possible to execute a query generated from unresolved LogicalPlans. One example is a UNION query that produces rows with different data types in the same column: {noformat} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class T1(value:Seq[Int]) val t1 = sc.parallelize(Seq(T1(Seq(0,1 t1.registerAsTable(t1) sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect() {noformat} In this example, the type coercion implementation cannot unify array and integer types. One row contains an array in the returned column and the other row contains an integer. The result is: {noformat} res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2]) {noformat} I believe fixing this is a first step toward improving validation for Union (and similar) plans. (For instance, Union does not currently validate that its children contain the same number of columns.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2781) Analyzer should check resolution of LogicalPlans
[ https://issues.apache.org/jira/browse/SPARK-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081833#comment-14081833 ] Aaron Staple commented on SPARK-2781: - No problem, I think the current validation checks Expressions but there are some cases where LogicalPlans might not be resolved even though Expressions are resolved. Analyzer should check resolution of LogicalPlans Key: SPARK-2781 URL: https://issues.apache.org/jira/browse/SPARK-2781 Project: Spark Issue Type: Bug Components: SQL Reporter: Aaron Staple Assignee: Michael Armbrust Fix For: 1.0.1, 1.1.0 Currently the Analyzer’s CheckResolution rule checks that all attributes are resolved by searching for unresolved Expressions. But some LogicalPlans, including Union, contain custom implementations of the resolve attribute that validate other criteria in addition to checking for attribute resolution of their descendants. These LogicalPlans are not currently validated by the CheckResolution implementation. As a result, it is currently possible to execute a query generated from unresolved LogicalPlans. One example is a UNION query that produces rows with different data types in the same column: {noformat} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class T1(value:Seq[Int]) val t1 = sc.parallelize(Seq(T1(Seq(0,1 t1.registerAsTable(t1) sqlContext.sql(SELECT value FROM t1 UNION SELECT 2 FROM t1”).collect() {noformat} In this example, the type coercion implementation cannot unify array and integer types. One row contains an array in the returned column and the other row contains an integer. The result is: {noformat} res3: Array[org.apache.spark.sql.Row] = Array([List(0, 1)], [2]) {noformat} I believe fixing this is a first step toward improving validation for Union (and similar) plans. (For instance, Union does not currently validate that its children contain the same number of columns.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2314) RDD actions are only overridden in Scala, not java or python
[ https://issues.apache.org/jira/browse/SPARK-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074836#comment-14074836 ] Aaron Staple commented on SPARK-2314: - Hi, I added a PR that I handles overriding collect and take in Python, and count in Java: https://github.com/apache/spark/pull/1592 RDD actions are only overridden in Scala, not java or python Key: SPARK-2314 URL: https://issues.apache.org/jira/browse/SPARK-2314 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0, 1.0.1 Reporter: Michael Armbrust Assignee: Aaron Staple Labels: starter Fix For: 1.1.0, 1.0.2 For example, collect and take(). We should keep these two in sync, or move this code to schemaRDD like if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2581) complete or withdraw visitedStages optimization in DAGScheduler’s stageDependsOn
Aaron Staple created SPARK-2581: --- Summary: complete or withdraw visitedStages optimization in DAGScheduler’s stageDependsOn Key: SPARK-2581 URL: https://issues.apache.org/jira/browse/SPARK-2581 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Aaron Staple Priority: Minor Right now the visitedStages HashSet is populated with stages, but never queried to limit examination of previously visited stages. It may make sense to check whether a mapStage has been visited previously before visiting it again, as in the nearby visitedRdds check. Or it may be that the existing visitedRdds check sufficiently optimizes this function, and visitedStages can simply be removed. See discussion here: https://github.com/apache/spark/pull/1362#discussion-diff-15018046L1107 -- This message was sent by Atlassian JIRA (v6.2#6252)