GitHub user mpjlu opened a pull request:

    https://github.com/apache/spark/pull/17742

    [Spark-20446][ML][MLLIB]Optimize MLLIB ALS recommendForAll

    ## What changes were proposed in this pull request?
    
    The recommendForAll of MLLIB ALS is very slow.
    GC is a key problem of the current method.
    The task use the following code to keep temp result:
    val output = new Array[(Int, (Int, Double))](m*n)
    m = n = 4096 (default value, no method to set)
    so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
cause serious GC problem, and it is frequently OOM.
    
    Actually, we don't need to save all the temp result. Support we recommend 
topK (topK is about 10, or 20) product for each user, we only need 4k * topK * 
(4 + 4 + 8) memory to save the temp result.
    
    The Test Environment:
    3 workers: each work 10 core, each work 30G memory, each work 1 executor.
    The Data: User 480,000, and Item 17,000
    
    BlockSize:     1024  2048  4096  8192
    Old method:  245s  332s  488s  OOM
    This solution: 121s  118s   117s  120s
    
    
    
    ## How was this patch tested?
    The existing UT.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mpjlu/spark OptimizeAls

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17742.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17742
    
----
commit 14cdbf63e79ebcf2d1207c79b0b4ba73e15729b2
Author: Peng <peng.m...@intel.com>
Date:   2017-04-24T08:32:16Z

    Optimize ALS recommendForAll

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to