[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-31 Thread witgo
Github user witgo closed the pull request at:

https://github.com/apache/spark/pull/828


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-31 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-44742037
  
This solution is not perfect. temporarily close this. The new #929 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-25 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-44122991
  
I am using [the 
code](https://github.com/witgo/spark/compare/cleanup_checkpoint_date_als) to 
test ALS.
A brief description of the test:

| Item | Description |
| - | --- |
|cluster |`3 servers`,`36 core cpus`,`2.5T HDD`,`120G memory`|
|data| `700 million`|
|code|`val model = ALS.trainImplicit(ratings, 25, 30, 0.065, -1, 40.0)`|
|time|`18.7 h`|
|shuffle write| `4.72T`|
|largest local dir|`200G`|
|checkpoint  dir|`16.6G`|

@mengxr  
 if checkpoint is used, ALS seemed a lot slower.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-21 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43790944
  
@mateiz, @mengxr 
I am using [the code](https://github.com/witgo/spark/compare/cachePoint) to 
test ALS.
A brief description of the test:

| Item | Description |
| - | --- |
|cluster |`3 servers`,`36 core cpus`,`2.5T HDD`,`120G memory`|
|data| `700 million`|
|code|`val model = ALS.trainImplicit(ratings, 25, 30, 0.065, -1, 40.0)`|
|time|`12.5 h`|
|shuffle write| `4.72T`|
|largest local dir|`200G`|


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-21 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43825145
  
I dont think this cachePoint is a good idea at all. While it *can* give 
better performance, it fundamentally breaks the fault-tolerance properties of 
RDDs. If a cachePoint() an RDD with MEMORY_ONLY, and then the executor dies, 
you have no way to recover the lost partitions as there is not lineage 
information to how that RDD was created. All of Spark operations maintain this 
guarantee of fault-tolerance despite failed workers and breaking that is a bad 
idea. So this is a fundamentally unsafe operation to expose to the end-user.

In fact this is the same reason why checkpoint() has been implemented using 
HDFS, so that fault-tolerance property is maintained (data save to 
fault-tolerant storage) even if executors die. 

That said, there is a good middle ground out here. We can do what 
cachePoint() does while ensuring that the data is replicated within the 
executors (so better fault-tolerance guarantee) but not expose it to the users 
(so that it does break public API semantics). This would be a ALS-only solution.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-21 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43840745
  
@tdas 
You're right. the code breaks the fault-tolerance properties of RDDs.
The perfect solution is the automatic cleanup and rebuilding shuffle data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-20 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43589674
  
[The 
code](https://github.com/witgo/spark/commit/6d7f2408a40bf4bb2889bf66fa61bced782cdefc#diff-2b593e0b4bd6eddab37f04968baa826c)
 will make the checkpoint directory larger and is not clear .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-20 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43608181
  
@mateiz @mengxr  
I added a new operation `cachePoint` of  RDD


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-20 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43656940
  
Another [solution](https://github.com/witgo/spark/compare/cachePoint).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43524803
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-19 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43564458
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43564725
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43564707
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43569848
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15084/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43569842
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-19 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43575644
  
@witgo Could you check if checkpoint is used, how long it takes for a 
simple `model.predict(user, product)` call, compared to in-memory cached?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-19 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43581291
  
@tdas CheckpointRDD is not properly cleaned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-19 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43581755
  
@mateiz Why the checkpoint data must be written to the file system?.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

2014-05-19 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/828#issuecomment-43583620
  
@mateiz  It is not necessary to write it in the file system.After all, 
there is no other RDD in reading it.I think it should be put checkpoint data 
into blockManager, so performance will be much higher.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---