[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo closed the pull request at: https://github.com/apache/spark/pull/828 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-44742037 This solution is not perfect. temporarily close this. The new #929 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-44122991 I am using [the code](https://github.com/witgo/spark/compare/cleanup_checkpoint_date_als) to test ALS. A brief description of the test: | Item | Description | | - | --- | |cluster |`3 servers`,`36 core cpus`,`2.5T HDD`,`120G memory`| |data| `700 million`| |code|`val model = ALS.trainImplicit(ratings, 25, 30, 0.065, -1, 40.0)`| |time|`18.7 h`| |shuffle write| `4.72T`| |largest local dir|`200G`| |checkpoint dir|`16.6G`| @mengxr if checkpoint is used, ALS seemed a lot slower. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43840745 @tdas You're right. the code breaks the fault-tolerance properties of RDDs. The perfect solution is the automatic cleanup and rebuilding shuffle data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43825145 I dont think this cachePoint is a good idea at all. While it *can* give better performance, it fundamentally breaks the fault-tolerance properties of RDDs. If a cachePoint() an RDD with MEMORY_ONLY, and then the executor dies, you have no way to recover the lost partitions as there is not lineage information to how that RDD was created. All of Spark operations maintain this guarantee of fault-tolerance despite failed workers and breaking that is a bad idea. So this is a fundamentally unsafe operation to expose to the end-user. In fact this is the same reason why checkpoint() has been implemented using HDFS, so that fault-tolerance property is maintained (data save to fault-tolerant storage) even if executors die. That said, there is a good middle ground out here. We can do what cachePoint() does while ensuring that the data is replicated within the executors (so better fault-tolerance guarantee) but not expose it to the users (so that it does break public API semantics). This would be a ALS-only solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43790944 @mateiz, @mengxr I am using [the code](https://github.com/witgo/spark/compare/cachePoint) to test ALS. A brief description of the test: | Item | Description | | - | --- | |cluster |`3 servers`,`36 core cpus`,`2.5T HDD`,`120G memory`| |data| `700 million`| |code|`val model = ALS.trainImplicit(ratings, 25, 30, 0.065, -1, 40.0)`| |time|`12.5 h`| |shuffle write| `4.72T`| |largest local dir|`200G`| --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43656940 Another [solution](https://github.com/witgo/spark/compare/cachePoint). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43608181 @mateiz @mengxr I added a new operation `cachePoint` of RDD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43589674 [The code](https://github.com/witgo/spark/commit/6d7f2408a40bf4bb2889bf66fa61bced782cdefc#diff-2b593e0b4bd6eddab37f04968baa826c) will make the checkpoint directory larger and is not clear . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43583620 @mateiz It is not necessary to write it in the file system.After all, there is no other RDD in reading it.I think it should be put checkpoint data into blockManager, so performance will be much higher. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
That's more or less the definition of a checkpoint. Sent from my iPhone > On May 19, 2014, at 7:58 PM, witgo wrote: > > Github user witgo commented on the pull request: > >https://github.com/apache/spark/pull/828#issuecomment-43581755 > >@mateiz Why the checkpoint data must be written to the file system?. > > > > --- > If your project is set up for it, you can reply to this email and have your > reply appear on GitHub as well. If your project does not have this feature > enabled and wishes so, or if the feature is enabled but not working, please > contact infrastructure at infrastruct...@apache.org or file a JIRA ticket > with INFRA. > ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43581755 @mateiz Why the checkpoint data must be written to the file system?. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43581291 @tdas CheckpointRDD is not properly cleaned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43579554 @mengxr I was testing the changes. The environment is as follows, `700 million data`,`3 servers`,`36 core cpus`,`2.5T HDD`,`96G memory`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43575644 @witgo Could you check if checkpoint is used, how long it takes for a simple `model.predict(user, product)` call, compared to in-memory cached? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43569842 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43569848 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15084/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43564707 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43564725 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43564458 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/828#issuecomment-43524803 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---