GitHub user nkronenfeld opened a pull request:

    https://github.com/apache/spark/pull/3570

    Clear local copies of accumulators as soon as we're done with them

    I suspect this PR is not really quite ready for merging, but this seemed 
the best way of getting a few comments on it.
    
    A little explanation:
    I'm working with some rather large accumulators.  I've been asked why not 
just use a reducer, and the answer is, that when this works, it works about 10x 
as fast.
    
    And so far, when it doesn't work, it seems improvable.  This is my first 
attempt at such improvement.
    
    As far as I can tell, on each worker, when each time a task is run, the 
following happens:
    1. The global accumulator object is cleared on that thread
    2. The current individual accumulators are registered on that thread (as 
copies)
    3. When the task completes, the accumulator values are collected and 
returned.
    
    Note that when this is done, nothing is cleared.  This means, if one is 
using large accumulators, those accumulators just use up memory uselessly until 
the next task is run on that thread.
    
    And if a thread somehow dies, that memory is used up and can't be retrieved.
    
    And, as far as I can tell, that local thread value will never be used again.
    
    All I've done so far is to clear local values for that thread as it returns 
them.
    
    I would like to do 2 more things:
    1. Reduce this to a two-step process: register, and return, both clearing 
values
    2. Put in something to make sure values are cleared if a task dies (or a 
thread too, if that is possible)
    
    This seems so simple so far, though, I was hoping that I could get some 
confirmation that I understood things correctly before going on.
    
    I don't have a PR or JIRA issue on this so far - the issues I see are a bit 
hard to pin down.  When I run my application without this change, it works, but 
is very fragile - some tasks take 6 seconds, some 4 minutes.  When I implement 
this, everything seems to run smoothly, and I stop getting random, large task 
times.
    
    This is somewhat related to JIRA issue SPARK-664 in that they both deal 
with accumulator performance, but it doesn't really address that directly.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nkronenfeld/spark-1 Accumulator-Improvements

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3570.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3570
    
----
commit 39a82f2b3ccf68a9a4d1abb0561b5680278a2610
Author: Nathan Kronenfeld <nkronenf...@oculusinfo.com>
Date:   2014-12-03T04:11:11Z

    Clear local copies of accumulators as soon as we're done with them

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to