[ https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727190#comment-17727190 ]
Sean R. Owen commented on SPARK-43523: -------------------------------------- OK, I don't think you've established that but it doesn't really matter - do you have a change to propose? > Memory leak in Spark UI > ----------------------- > > Key: SPARK-43523 > URL: https://issues.apache.org/jira/browse/SPARK-43523 > Project: Spark > Issue Type: Bug > Components: Web UI > Affects Versions: 2.4.4, 3.4.0 > Reporter: Amine Bagdouri > Priority: Major > Attachments: spark_shell_oom.log, spark_ui_memory_leak.zip > > > We have a distributed Spark application running on Azure HDInsight using > Spark version 2.4.4. > After a few days of active processing on our application, we have noticed > that the GC CPU time ratio of the driver is close to 100%. We suspected a > memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse > Memory Analyzer. > Here is some interesting data from the driver's heap dump (heap size is 8 GB): > * The estimated retained heap size of String objects (~5M instances) is 3.3 > GB. It seems that most of these instances correspond to spark events. > * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB. > * The number of LiveJob objects with status "RUNNING" is 18K, knowing that > there shouldn't be more than 16 live running jobs since we use a fixed size > thread pool of 16 threads to run spark queries. > * The number of LiveTask objects is 485K. > * The AsyncEventQueue instance associated to the AppStatusListener has a > value of 854 for dropped events count and a value of 10001 for total events > count, knowing that the dropped events counter is reset every minute and that > the queue's default capacity is 10000. > We think that there is a memory leak in Spark UI. Here is our analysis of the > root cause of this leak: > * AppStatusListener is notified of Spark events using a bounded queue in > AsyncEventQueue. > * AppStatusListener updates its state (kvstore, liveTasks, liveStages, > liveJobs, ...) based on the received events. For example, onTaskStart adds a > task to liveTasks map and onTaskEnd removes the task from liveTasks map. > * When the rate of events is very high, the bounded queue in AsyncEventQueue > is full, some events are dropped and don't make it to AppStatusListener. > * Dropped events that signal the end of a processing unit prevent the state > of AppStatusListener from being cleaned. For example, a dropped onTaskEnd > event, will prevent the task from being removed from liveTasks map, and the > task will remain in the heap until the driver's JVM is stopped. > We were able to confirm our analysis by reducing the capacity of the > AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After > having launched many spark queries using this config, we observed that the > number of active jobs in Spark UI increased rapidly and remained high even > though all submitted queries have completed. We have also noticed that some > executor task counters in Spark UI were negative, which confirms that > AppStatusListener state does not accurately reflect the reality and that it > can be a victim of event drops. > Suggested fix: > There are some limits today on the number of "dead" objects in > AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest > enforcing another configurable limit on the number of total objects in > AppStatusListener's maps and kvstore. This should limit the leak in the case > of high events rate, but AppStatusListener stats will remain inaccurate. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org