[jira] [Commented] (SPARK-35262) Memory leak when dataset is being persisted

2024-04-15 Thread Leo Timofeyev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837418#comment-17837418
 ] 

Leo Timofeyev commented on SPARK-35262:
---

Hello [~dnskrv] [~iamelin]
I can confirm that the issue is still existing in 3.5.0

> Memory leak when dataset is being persisted
> ---
>
> Key: SPARK-35262
> URL: https://issues.apache.org/jira/browse/SPARK-35262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Igor Amelin
>Priority: Major
>
> If a Java- or Scala-application with SparkSession runs a long time and 
> persists a lot of datasets, it can crash because of a memory leak.
>  I've noticed the following. When we have a dataset and persist it, the 
> SparkSession used to load that dataset is cloned in CacheManager, and this 
> clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But 
> this clone isn't removed from the list of listeners after that, e.g. 
> unpersisting the dataset. If we persist a lot of datasets, the SparkSession 
> is cloned and added to `ListenerBus` many times. This leads to a memory leak 
> since the `listenersPlusTimers` list become very large.
> I've found out that the SparkSession is cloned is CacheManager when the 
> parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and 
> `spark.sql.adaptive.enabled` are true. The first one is true by default, and 
> this default behavior leads to the problem. When auto bucketed scan is 
> disabled, the SparkSession isn't cloned, and there are no duplicates in 
> ListenerBus, so the memory leak doesn't occur.
> Here is a small Java application to reproduce the memory leak: 
> [https://github.com/iamelin/spark-memory-leak]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35262) Memory leak when dataset is being persisted

2022-01-10 Thread Denis Krivenko (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472248#comment-17472248
 ] 

Denis Krivenko commented on SPARK-35262:


[~iamelin] Could you please check/confirm the issue still exists in 3.2.0?

> Memory leak when dataset is being persisted
> ---
>
> Key: SPARK-35262
> URL: https://issues.apache.org/jira/browse/SPARK-35262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Igor Amelin
>Priority: Major
>
> If a Java- or Scala-application with SparkSession runs a long time and 
> persists a lot of datasets, it can crash because of a memory leak.
>  I've noticed the following. When we have a dataset and persist it, the 
> SparkSession used to load that dataset is cloned in CacheManager, and this 
> clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But 
> this clone isn't removed from the list of listeners after that, e.g. 
> unpersisting the dataset. If we persist a lot of datasets, the SparkSession 
> is cloned and added to `ListenerBus` many times. This leads to a memory leak 
> since the `listenersPlusTimers` list become very large.
> I've found out that the SparkSession is cloned is CacheManager when the 
> parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and 
> `spark.sql.adaptive.enabled` are true. The first one is true by default, and 
> this default behavior leads to the problem. When auto bucketed scan is 
> disabled, the SparkSession isn't cloned, and there are no duplicates in 
> ListenerBus, so the memory leak doesn't occur.
> Here is a small Java application to reproduce the memory leak: 
> [https://github.com/iamelin/spark-memory-leak]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35262) Memory leak when dataset is being persisted

2021-04-29 Thread Fu Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335343#comment-17335343
 ] 

Fu Chen commented on SPARK-35262:
-

[~iamelin] This is should be a duplicate bug with SPARK-34087 and has been 
fixed by PR-31919. Spark 3.1.1 has a memory leak when we clone the SparkSession.

When you disabled `spark.sql.sources.bucketing.autoBucketedScan.enabled` and 
`spark.sql.adaptive.enabled, the CacheManager cache query using original 
SparkSession (means spark not clone session).

> Memory leak when dataset is being persisted
> ---
>
> Key: SPARK-35262
> URL: https://issues.apache.org/jira/browse/SPARK-35262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Igor Amelin
>Priority: Critical
>
> If a Java- or Scala-application with SparkSession runs a long time and 
> persists a lot of datasets, it can crash because of a memory leak.
>  I've noticed the following. When we have a dataset and persist it, the 
> SparkSession used to load that dataset is cloned in CacheManager, and this 
> clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But 
> this clone isn't removed from the list of listeners after that, e.g. 
> unpersisting the dataset. If we persist a lot of datasets, the SparkSession 
> is cloned and added to `ListenerBus` many times. This leads to a memory leak 
> since the `listenersPlusTimers` list become very large.
> I've found out that the SparkSession is cloned is CacheManager when the 
> parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and 
> `spark.sql.adaptive.enabled` are true. The first one is true by default, and 
> this default behavior leads to the problem. When auto bucketed scan is 
> disabled, the SparkSession isn't cloned, and there are no duplicates in 
> ListenerBus, so the memory leak doesn't occur.
> Here is a small Java application to reproduce the memory leak: 
> [https://github.com/iamelin/spark-memory-leak]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org