[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-29 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17727197#comment-17727197
 ] 

Sean R. Owen edited comment on SPARK-43523 at 5/29/23 8:16 PM:
---

I've got one idea. The issue is indeed memory pressure b/c lots of tasks are 
queued up. We want the listeners to go faster if possible, but at least the 
allocation site that actually fails here could be made smarter.  In 
AppStatusListener, 

{code}
  def activeStages(): Seq[v1.StageData] = {
liveStages.values.asScala
  .filter(s => Option(s.info).exists(_.submissionTime.isDefined))
  .map(_.toApi())
  .toList
  .sortBy(_.stageId)
  }
{code}

Change .toList to .toArray. This should avoid a slow sort and a copy or two. 
I'm not sure if that makes a difference but anything to reduce mem pressure and 
speed up event processing should contribute to avoiding the problem even in 
extreme setups like this.



was (Author: srowen):
I've got one idea. The issue is indeed memory pressure b/c lots of tasks are 
queued up. We want the listeners to go faster if possible, but at least the 
allocation site that actually fails here could be made smarter.  In 
AppStatusListener, 

  def activeStages(): Seq[v1.StageData] = {
liveStages.values.asScala
  .filter(s => Option(s.info).exists(_.submissionTime.isDefined))
  .map(_.toApi())
  .toList
  .sortBy(_.stageId)
  }

Change .toList to .toArray. This should avoid a slow sort and a copy or two. 
I'm not sure if that makes a difference but anything to reduce mem pressure and 
speed up event processing should contribute to avoiding the problem even in 
extreme setups like this.


> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.4.0
>Reporter: Amine Bagdouri
>Priority: Major
> Attachments: spark_shell_oom.log, spark_ui_memory_leak.zip
>
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For example, a dropped onTaskEnd 
> event, will prevent the task from being removed from liveTasks map, and the 
> task will remain in the heap until the driver's JVM is stopped.
> We were able to confirm our analysis by reducing the capacity of the 
> AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
> having launched many spark queries using this config, we observed that the 
> number of active jobs in Spark UI increased rapidly and remained high even 
> though all submitted queries have completed. We have also noticed that some 
> executor task counters in Spark UI were negative, which confirms that 
> AppStatusListener state does not accurately reflect the reality and that it 
> can be a victim of event drops.
> Suggested fix:
> There are some limits today on the number of "dead" objects in 
> AppStatusListener's maps (for example: spark.ui.ret

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-20 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724568#comment-17724568
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/20/23 11:16 PM:
--

I have managed to induce a java heap space OutOfMemoryError within 3 hours of 
active processing using the same setup and code while increasing the number of 
iterations from 100 to 1.

A heap dump of the driver (Xmx=1g), generated after the memory error, contains 
128k LiveTask objects with an estimated retained size of 563 MB.

In my view, this is another strong evidence for the presence of the memory leak.


was (Author: JIRAUSER300423):
I have managed to induce a java heap space OutOfMemoryError within 3 hours of 
active processing using the same setup and code while increasing the number of 
iterations from 100 to 1.

A heap dump of the driver (Xmx=1g) generated after the memory error contains 
128k LiveTask objects with an estimated retained size of 563 MB.

In my view, this is another strong evidence for the presence of the memory leak.

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.4.0
>Reporter: Amine Bagdouri
>Priority: Major
> Attachments: spark_ui_memory_leak.zip
>
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For example, a dropped onTaskEnd 
> event, will prevent the task from being removed from liveTasks map, and the 
> task will remain in the heap until the driver's JVM is stopped.
> We were able to confirm our analysis by reducing the capacity of the 
> AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
> having launched many spark queries using this config, we observed that the 
> number of active jobs in Spark UI increased rapidly and remained high even 
> though all submitted queries have completed. We have also noticed that some 
> executor task counters in Spark UI were negative, which confirms that 
> AppStatusListener state does not accurately reflect the reality and that it 
> can be a victim of event drops.
> Suggested fix:
> There are some limits today on the number of "dead" objects in 
> AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest 
> enforcing another configurable limit on the number of total objects in 
> AppStatusListener's maps and kvstore. This should limit the leak in the case 
> of high events rate, but AppStatusListener stats will remain inaccurate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-20 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724545#comment-17724545
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/20/23 3:35 PM:
-

I have managed to reproduce the memory leak with Spark version 3.4.0 in 
standalone mode {*}within just 5 minutes of activity{*}.

{color:#0747a6}*Setup :*{color}
 * Cluster :

{code:java}
SPARK_WORKER_MEMORY=1g
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=32 {code}
 * Application :

{code:java}
spark.scheduler.listenerbus.eventqueue.capacity=10
spark.executor.memory=512m{code}
 * Code submitted using spark-shell :

{code:java}
import Array._
import spark.implicits._
val uuid = udf(() => java.util.UUID.randomUUID().toString)
(1 to 100).foreach(x => sc.parallelize(range(x, 1)).toDF("id")
 .repartition(1000)
 .withColumn("uuid", uuid())
 .withColumn("key", substring(col("uuid"), 0, 2))
 .groupBy("key")
 .agg(count("id").alias("c"))
 .sort(col("c").desc)
 .filter(x => x.getAs[Long](1) % 3 == 0)
 .count()) {code}
{color:#0747a6}*Results :*{color}
 * Logs :

{code:java}
23/05/20 16:44:04 ERROR AsyncEventQueue: Dropping event from queue appStatus. 
This likely means one of the listeners is too slow and cannot keep up with the 
rate at which tasks are being started by the scheduler.
23/05/20 16:45:04 WARN AsyncEventQueue: Dropped 2560 events from appStatus 
since Sat May 20 16:44:04 CEST 2023.
23/05/20 16:46:04 WARN AsyncEventQueue: Dropped 8797 events from appStatus 
since Sat May 20 16:45:04 CEST 2023.
23/05/20 16:47:04 WARN AsyncEventQueue: Dropped 15909 events from appStatus 
since Sat May 20 16:46:04 CEST 2023.
23/05/20 16:48:04 WARN AsyncEventQueue: Dropped 20031 events from appStatus 
since Sat May 20 16:47:04 CEST 2023.{code}
 * Stats in the Spark UI at the end of processing (nothing is running anymore 
on the application) :
 ** 14 active jobs
 ** 15 active stages
 ** 4 pending stages
 ** -5303 active tasks
 * Heap dump of the driver :
 ** AppStatusListener estimated retained heap size is 95 MB.
 ** LiveTask objects count is 19k.
 ** LiveJob objects with status "RUNNING" is 14.

More details can be found in the file attached.


was (Author: JIRAUSER300423):
I have managed to reproduce the memory leak with Spark version 3.4.0 in 
standalone mode {*}within just 5 minutes of activity{*}.

{color:#0747a6}*Setup :*{color}
 * Cluster :

{code:java}
SPARK_WORKER_MEMORY=1g
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=32 {code}
 * Application :

{code:java}
spark.scheduler.listenerbus.eventqueue.capacity=10
spark.executor.memory=512m{code}
 * Code :

{code:java}
import Array._
import spark.implicits._
val uuid = udf(() => java.util.UUID.randomUUID().toString)
(1 to 100).foreach(x => sc.parallelize(range(x, 1)).toDF("id")
 .repartition(1000)
 .withColumn("uuid", uuid())
 .withColumn("key", substring(col("uuid"), 0, 2))
 .groupBy("key")
 .agg(count("id").alias("c"))
 .sort(col("c").desc)
 .filter(x => x.getAs[Long](1) % 3 == 0)
 .count()) {code}
{color:#0747a6}*Results :*{color}
 * Logs :

{code:java}
23/05/20 16:44:04 ERROR AsyncEventQueue: Dropping event from queue appStatus. 
This likely means one of the listeners is too slow and cannot keep up with the 
rate at which tasks are being started by the scheduler.
23/05/20 16:45:04 WARN AsyncEventQueue: Dropped 2560 events from appStatus 
since Sat May 20 16:44:04 CEST 2023.
23/05/20 16:46:04 WARN AsyncEventQueue: Dropped 8797 events from appStatus 
since Sat May 20 16:45:04 CEST 2023.
23/05/20 16:47:04 WARN AsyncEventQueue: Dropped 15909 events from appStatus 
since Sat May 20 16:46:04 CEST 2023.
23/05/20 16:48:04 WARN AsyncEventQueue: Dropped 20031 events from appStatus 
since Sat May 20 16:47:04 CEST 2023.{code}
 * Stats in the Spark UI at the end of processing (nothing is running anymore 
on the application) :
 ** 14 active jobs
 ** 15 active stages
 ** 4 pending stages
 ** -5303 active tasks
 * Heap dump of the driver :
 ** AppStatusListener estimated retained heap size is 95 MB.
 ** LiveTask objects count is 19k.
 ** LiveJob objects with status "RUNNING" is 14.

More details can be found in the file attached.

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.4.0
>Reporter: Amine Bagdouri
>Priority: Major
> Attachments: spark_ui_memory_leak.zip
>
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
>

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-20 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724545#comment-17724545
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/20/23 3:31 PM:
-

I have managed to reproduce the memory leak with Spark version 3.4.0 in 
standalone mode {*}within just 5 minutes of activity{*}.

{color:#0747a6}*Setup :*{color}
 * Cluster :

{code:java}
SPARK_WORKER_MEMORY=1g
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=32 {code}
 * Application :

{code:java}
spark.scheduler.listenerbus.eventqueue.capacity=10
spark.executor.memory=512m{code}
 * Code :

{code:java}
import Array._
import spark.implicits._
val uuid = udf(() => java.util.UUID.randomUUID().toString)
(1 to 100).foreach(x => sc.parallelize(range(x, 1)).toDF("id")
 .repartition(1000)
 .withColumn("uuid", uuid())
 .withColumn("key", substring(col("uuid"), 0, 2))
 .groupBy("key")
 .agg(count("id").alias("c"))
 .sort(col("c").desc)
 .filter(x => x.getAs[Long](1) % 3 == 0)
 .count()) {code}
{color:#0747a6}*Results :*{color}
 * Logs :

{code:java}
23/05/20 16:44:04 ERROR AsyncEventQueue: Dropping event from queue appStatus. 
This likely means one of the listeners is too slow and cannot keep up with the 
rate at which tasks are being started by the scheduler.
23/05/20 16:45:04 WARN AsyncEventQueue: Dropped 2560 events from appStatus 
since Sat May 20 16:44:04 CEST 2023.
23/05/20 16:46:04 WARN AsyncEventQueue: Dropped 8797 events from appStatus 
since Sat May 20 16:45:04 CEST 2023.
23/05/20 16:47:04 WARN AsyncEventQueue: Dropped 15909 events from appStatus 
since Sat May 20 16:46:04 CEST 2023.
23/05/20 16:48:04 WARN AsyncEventQueue: Dropped 20031 events from appStatus 
since Sat May 20 16:47:04 CEST 2023.{code}
 * Stats in the Spark UI at the end of processing (nothing is running anymore 
on the application) :
 ** 14 active jobs
 ** 15 active stages
 ** 4 pending stages
 ** -5303 active tasks
 * Heap dump of the driver :
 ** AppStatusListener estimated retained heap size is 95 MB.
 ** LiveTask objects count is 19k.
 ** LiveJob objects with status "RUNNING" is 14.

More details can be found in the file attached.


was (Author: JIRAUSER300423):
I have managed to reproduce the memory leak with Spark version 3.4.0 in 
standalone mode {*}within just 5 minutes of activity{*}.

{color:#0747a6}*Setup :*{color}
 * Cluster :

 
{code:java}
SPARK_WORKER_MEMORY=1g
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=32 {code}
 * Application :

 

 
{code:java}
spark.scheduler.listenerbus.eventqueue.capacity=10
spark.executor.memory=512m{code}
 * Code :

 

 
{code:java}
import Array._
import spark.implicits._
val uuid = udf(() => java.util.UUID.randomUUID().toString)
(1 to 100).foreach(x => sc.parallelize(range(x, 1)).toDF("id")
 .repartition(1000)
 .withColumn("uuid", uuid())
 .withColumn("key", substring(col("uuid"), 0, 2))
 .groupBy("key")
 .agg(count("id").alias("c"))
 .sort(col("c").desc)
 .filter(x => x.getAs[Long](1) % 3 == 0)
 .count()) {code}
{color:#0747a6}*Results :*{color}

 
 * Logs :

 
{code:java}
23/05/20 16:44:04 ERROR AsyncEventQueue: Dropping event from queue appStatus. 
This likely means one of the listeners is too slow and cannot keep up with the 
rate at which tasks are being started by the scheduler.
23/05/20 16:45:04 WARN AsyncEventQueue: Dropped 2560 events from appStatus 
since Sat May 20 16:44:04 CEST 2023.
23/05/20 16:46:04 WARN AsyncEventQueue: Dropped 8797 events from appStatus 
since Sat May 20 16:45:04 CEST 2023.
23/05/20 16:47:04 WARN AsyncEventQueue: Dropped 15909 events from appStatus 
since Sat May 20 16:46:04 CEST 2023.
23/05/20 16:48:04 WARN AsyncEventQueue: Dropped 20031 events from appStatus 
since Sat May 20 16:47:04 CEST 2023.{code}
 * Stats in the Spark UI at the end of processing (nothing is running anymore 
on the application) :
 ** 14 active jobs
 ** 15 active stages
 ** 4 pending stages
 ** -5303 active tasks
 * Heap dump of the driver :
 ** AppStatusListener estimated retained heap size is 95 MB.
 ** LiveTask objects count is 19k.
 ** LiveJob objects with status "RUNNING" is 14.

More details can be found in the file attached.

 

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
> Attachments: spark_ui_memory_leak.zip
>
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some i

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724436#comment-17724436
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 10:37 PM:
--

Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware that these leaked units are already finished because the 
queue (AsyncEventQueue) it is using to listen to events (onJobEnd, onTaskEnd, 
onStageCompleted, ...) in order to update its state is full, and new events are 
dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) {
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.


was (Author: JIRAUSER300423):
Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) {
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For ex

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724436#comment-17724436
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 10:35 PM:
--

Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) {
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.


was (Author: JIRAUSER300423):
Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) { //
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.

 

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleane

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724436#comment-17724436
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 10:35 PM:
--

Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) { //
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.

 


was (Author: JIRAUSER300423):
Theses units are only considered running from the point of view of Spark UI, 
while in reality they were already finished a long time ago and the application 
is totally idle.

Spark UI is not aware of the update of the state of these leaked units because 
the queue (AsyncEventQueue) it is using to listen to events (onJobEnd, 
onTaskEnd, onStageCompleted, ...) in order to update its state is full and new 
events are dropped.

As shown below, we try to add the event to the bounded queue using 
LinkedBlockingQueue::offer. If the queue is full, the event is not inserted and 
the droppedEvents counter is incremented.

[AsyncEventQueue.scala post (line 
152)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#LL152C1-L164C1]
 :

 
{code:java}
  def post(event: SparkListenerEvent): Unit = {
    if (stopped.get()) {
      return
    }    
eventCount.incrementAndGet()
    if (eventQueue.offer(event)) { //
      return
    }
eventCount.decrementAndGet()
    droppedEvents.inc()
    droppedEventsCounter.incrementAndGet() {code}
 I will try to reproduce the leak on a more recent version of Spark and provide 
you with the results.

 

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 1.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from bei

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724424#comment-17724424
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 10:09 PM:
--

Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap 
forever, and keep adding up indefinitely.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, they only 
apply to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1377)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,


was (Author: JIRAUSER300423):
Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap 
forever, and keep adding up indefinitely.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, they only 
apply to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> S

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724424#comment-17724424
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 9:59 PM:
-

Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap 
forever, and keep adding up indefinitely.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, they only 
apply to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,


was (Author: JIRAUSER300423):
Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap 
forever, and keep adding up indefinitely.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spa

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724424#comment-17724424
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 9:58 PM:
-

Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap 
forever, and keep adding up indefinitely.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,


was (Author: JIRAUSER300423):
Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap forever.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few da

[jira] [Comment Edited] (SPARK-43523) Memory leak in Spark UI

2023-05-19 Thread Amine Bagdouri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724424#comment-17724424
 ] 

Amine Bagdouri edited comment on SPARK-43523 at 5/19/23 9:55 PM:
-

Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap forever.

The same goes for tasks and stages and that missed the "onTaskEnd" and the 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,


was (Author: JIRAUSER300423):
Hi [~srowen], thanks for your reply.

 

However, I still think that it is a memory leak, since many jobs, stages and 
tasks get cumulated in memory and are never released.

For example, all jobs that have a "RUNNING" state and that missed the 
"onJobEnd" event (due to event drop from queue) will remain in the heap forever.

The same goes for tasks and stages and that missed the "onTaskEnd" and tje 
"onStageCompleted" events.

 

As for the limits on the number of retained jobs, tasks and stages, it only 
applies to units that are considered finished (as shown below). Thus, these 
limits do not prevent the memory leak. That is why I suggest adding other 
limits that apply even to units that are still running or pending.
 * [AppStatusListener.scala cleanupJobs (line 
1263)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1263-L1265]
 :

{code:java}
    val toDelete = KVUtils.viewToSeq(view, countToDelete.toInt) { j =>
      j.info.status != JobExecutionStatus.RUNNING && j.info.status != 
JobExecutionStatus.UNKNOWN
    }{code}
 * [AppStatusListener.scala cleanupStagesInKVStore (line 
1309)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1314C1-L1316C1]
 :

{code:java}
    val stages = KVUtils.viewToSeq(view, countToDelete.toInt) { s =>
      s.info.status != v1.StageStatus.ACTIVE && s.info.status != 
v1.StageStatus.PENDING
    } {code}
 * [AppStatusListener.scala cleanupTasks (line 
1314)|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#LL1377C1-L1380C1]
 :

{code:java}
      // Try to delete finished tasks only.
      val toDelete = KVUtils.viewToSeq(view, countToDelete) { t =>
        !live || t.status != TaskState.RUNNING.toString()
      } {code}
 

As for the version, I have only used the Spark version 2.4.4. But I don't think 
that upgrading to a more recent version will fix the memory leak since the code 
causing the leak is still there.

 

Regards,

> Memory leak in Spark UI
> ---
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Amine Bagdouri
>Priority: Major
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our app