I have promoted https://issues.apache.org/jira/browse/SPARK-9202 to a blocker to ensure that we get a fix for it before 1.5.0 I'm pretty swamped with other tasks for the next few days, but I'd be happy to shepherd a bugfix patch for this (this should be pretty straightforward and the JIRA ticket contains a sketch of how I'd do it).
On Mon, Jul 20, 2015 at 12:37 PM, Richard Marscher <rmarsc...@localytics.com > wrote: > Hi, > > thanks for the follow up. You are right regarding the invalidation of > observation #2. I later realized the Worker UI page directly displays the > entries in the executors map and can see in our production UI it's in a > proper state. > > As for the Killed vs Exited, it's less relevant now since the theory about > the executors map is invalid. However to answer your question, the current > setup is that the SparkContext lifecycle encapsulates exactly one > application. That is we create a single context per application submitted > and close it upon success/failure completion of the application. > > Thanks, > > On Mon, Jul 20, 2015 at 3:20 PM, Josh Rosen <joshro...@databricks.com> > wrote: > >> Hi Richard, >> >> Thanks for your detailed investigation of this issue. I agree with your >> observation that the finishedExecutors hashmap is a source of memory leaks >> for very-long-lived clusters. It looks like the finishedExecutors map is >> only read when rendering the Worker Web UI and in constructing REST API >> responses. I think that we could address this leak by adding a >> configuration to cap the maximum number of retained executors, >> applications, etc. We already have similar caps in the driver UI. If we >> add this configuration, I think that we should pick some sensible default >> value rather than an unlimited one. This is technically a user-facing >> behavior change but I think it's okay since the current behavior is to >> crash / OOM. >> >> Regarding `KillExecutor`, I think that there might be some asynchrony and >> indirection masking the cleanup here. Based on a quick glance through the >> code, it looks like ExecutorRunner's thread will end an >> ExecutorStateChanged RPC back to the Worker after the executor is killed, >> so I think that the cleanup will be triggered by that RPC. Since this >> isn't clear from reading the code, though, it would be great to add some >> comments to the code to explain this, plus a unit test to make sure that >> this indirect cleanup mechanism isn't broken in the future. >> >> I'm not sure what's causing the Killed vs Exited issue, but I have one >> theory: does the behavior vary based on whether your application cleanly >> shuts down the SparkContext via SparkContext.stop()? It's possible that >> omitting the stop() could lead to a "Killed" exit status, but I don't know >> for sure. (This could probably also be clarified with a unit test). >> >> To my knowledge, the spark-perf suite does not contain the sort of >> scale-testing workload that would expose these types of memory leaks; we >> have some tests for very long-lived individual applications, but not tests >> for long-lived clusters that run thousands of applications between >> restarts. I'm going to create some tickets to add such tests. >> >> I've filed https://issues.apache.org/jira/browse/SPARK-9202 to follow up >> on the finishedExecutors leak. >> >> - Josh >> >> On Mon, Jul 20, 2015 at 9:56 AM, Richard Marscher < >> rmarsc...@localytics.com> wrote: >> >>> Hi, >>> >>> we have been experiencing issues in production over the past couple >>> weeks with Spark Standalone Worker JVMs seeming to have memory leaks. They >>> accumulate Old Gen until it reaches max and then reach a failed state that >>> starts critically failing some applications running against the cluster. >>> >>> I've done some exploration of the Spark code base related to Worker in >>> search of potential sources of this problem and am looking for some >>> commentary on a couple theories I have: >>> >>> Observation 1: The `finishedExecutors` HashMap seem to only accumulate >>> new entries over time unbounded. It only seems to be appended and never >>> periodically purged or cleaned of older executors in line with something >>> like the worker cleanup scheduler. >>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L473 >>> >>> I feel somewhat confident that over time this will exhibit a "leak". I >>> quote it just because it may be intentional to hold these references to >>> support functionality versus a true leak where you just accidentally hold >>> onto memory. >>> >>> Observation 2: I feel much less certain about this, but it seemed like >>> if the Worker is messaged with `KillExecutor` then it only kills the ` >>> ExecutorRunner` but does not clean it up from the executor map. >>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L492 >>> >>> I haven't been able to sort out whether I'm missing something indirect >>> where it before/after cleans that executor from the map. However, if it >>> does not, then it may be leaking references on this map. >>> >>> One final observation related to our production metrics and not the >>> codebase itself. We used to periodically see that our completed >>> applications had the status of "Killed" instead of "Exited" for all the >>> executors. However, now we see every completed application has a final >>> state of "Killed" for all the executors. I might speculatively >>> correlate this to Observation 2 as a potential reason we have started >>> seeing this issue more recently. >>> >>> We also have a larger and increasing workload over the past few weeks >>> and possibly code changes to the application description that could be >>> exacerbating these potential underlying issues. We run a lot of smaller >>> applications per day, something in the range of hundreds to maybe 1000 >>> applications per day with 16 executors per application. >>> >>> Thanks >>> -- >>> *Richard Marscher* >>> Software Engineer >>> Localytics >>> Localytics.com <http://localytics.com/> | Our Blog >>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> >>> | Facebook <http://facebook.com/localytics> | LinkedIn >>> <http://www.linkedin.com/company/1148792?trk=tyah> >>> >> >> > > > -- > *Richard Marscher* > Software Engineer > Localytics > Localytics.com <http://localytics.com/> | Our Blog > <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | > Facebook <http://facebook.com/localytics> | LinkedIn > <http://www.linkedin.com/company/1148792?trk=tyah> >