[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-12-10 Thread wang fanming (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795061#comment-17795061
 ] 

wang fanming commented on SPARK-44900:
--

Of course, the following is my execution in the spark3.5.1 environment. I only 
extracted the relevant logs related to the production, storage, and reuse of 
rdd_1538 partition 1 under the Storage tab of the WEB UI.
{quote}23/12/10 18:05:23 INFO MemoryStore: Block rdd_1538_1 stored as values in 
memory (estimated size 176.7 MiB, free 310.2 MiB)
[rdd_1538_1]
23/12/10 18:05:24 INFO BlockManager: Found block rdd_1538_1 locally
23/12/10 18:05:27 INFO BlockManager: Dropping block rdd_1538_1 from memory
23/12/10 18:05:27 INFO BlockManager: Writing block rdd_1538_1 to disk
23/12/10 18:05:34 INFO MemoryStore: Block rdd_1538_1 stored as values in memory 
(estimated size 176.7 MiB, free 133.5 MiB)
23/12/10 18:05:34 INFO BlockManager: Found block rdd_1538_1 locally
23/12/10 18:05:40 INFO BlockManager: Found block rdd_1538_1 locally
23/12/10 18:05:42 INFO BlockManager: Dropping block rdd_1538_1 from memory
23/12/10 18:05:46 INFO MemoryStore: Block rdd_1538_1 stored as values in memory 
(estimated size 176.7 MiB, free 133.5 MiB)
{quote}
Through analysis of the above logs and combined with the UI display situation, 
the "Size on Disk" displayed on the Storage interface is incomprehensible.

If the partitions of this RDD are cached normally, shouldn't the size under 
"Size on Disk" label change?

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Major
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-12-08 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794817#comment-17794817
 ] 

Dongjoon Hyun commented on SPARK-44900:
---

Could you try this with Apache Spark 3.5.0, please?

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Major
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-12-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792872#comment-17792872
 ] 

王范明 commented on SPARK-44900:
-

I have analyzed the program execution details in the logs and it seems that 
there is an issue with the 
'org.apache.spark.status.AppStatusListener#updateRDDBlock' method.The method 
directly calculates the usage of {{rdd.memoryUsed}} and {{{}rdd.diskUsed{}}}, 
but it does not pay sufficient attention to the {{{}storageLevel{}}}.

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-30 Thread Varun Nalla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760700#comment-17760700
 ] 

Varun Nalla commented on SPARK-44900:
-

[~yxzhang] / [~yao] any update for us ?

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-28 Thread Yauheni Audzeichyk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759821#comment-17759821
 ] 

Yauheni Audzeichyk commented on SPARK-44900:


[~yxzhang] looks like it is just disk usage tracking issue as disk space is not 
used as much.

However it affects effectiveness of cached data since Spark spills it to disk 
as it believes it doesn't fit memory anymore so eventually it becomes 100% 
stored on disk.

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-28 Thread Yuexin Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759818#comment-17759818
 ] 

Yuexin Zhang commented on SPARK-44900:
--

Hi [~varun2807]  [~yaud]  did you check the actual cached file size on disk, on 
the yarn node manager local filesystem?  Is it really ever growing?

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-28 Thread Varun Nalla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759703#comment-17759703
 ] 

Varun Nalla commented on SPARK-44900:
-

[~yao] hope you got a chance to look into what [~yaud] mentioned.

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-25 Thread Yauheni Audzeichyk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759130#comment-17759130
 ] 

Yauheni Audzeichyk commented on SPARK-44900:


[~yao] we tried spark.cleaner.periodicGC.interval=1min but it didn't help. 

Here are my observations:
 * this happens even in a very simple scenario - see the example to reproduce 
below
 * it happens after join
 * uncontrollable growth of disk usage starts only when any portion of RDD got 
spilled to disk
 * if cached RDD remains 100% in memory this issue doesn't happen
 * when an executor dies then "Size on Disk" on Storage tab gets reduced by the 
amount of storage blocks held by that dead executor (makes sense)

It looks like some storage block (shuffle blocks?) are being tracked under that 
cached RDD and never (or at least not in a reasonable time) released until the 
executor dies.

Our worry is whether it is just disk size usage tracking bug or those blocks 
are actually kept on the disk because our production job disk usage (per Spark 
UI) grew by 6TB in a span of 10 hours.

Here's the code to reproduce:
{code:java}
val conf = new SparkConf().set("spark.master", "yarn")
val spark = SparkSession.builder().config(conf).getOrCreate()

import spark.implicits._

val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(10))
// create a pseudo stream
val rddQueue = new mutable.Queue[RDD[Long]]()
val stream = ssc.queueStream(rddQueue, oneAtATime = true)
// create a simple lookup table
val lookup = sc.range(start = 0, end = 5000, numSlices = 10)
.toDF("id")
.withColumn("value", md5(rand().cast(StringType)))
.cache()
// for every micro-batch perform value lookup via join
stream.foreachRDD { rdd =>
  val df = rdd.toDF("id")
  df.join(lookup, Seq("id"), "leftouter").count()
}
// run the streaming
ssc.start()
for (_ <- 1 to 100) {
  rddQueue.synchronized {
val firstId = Random.nextInt(5000)
rddQueue += sc.range(start = firstId, end = firstId + 1, numSlices = 4)
  }
  Thread.sleep(10)
}
ssc.stop() {code}
Submit parameters (selected to create storage memory deficit and cause cache to 
be spilled):
{code:java}
--executor-cores 2 --num-executors 5 --executor-memory 1250m --driver-memory 1g 
\
--conf spark.dynamicAllocation.enabled=false --conf 
spark.sql.shuffle.partitions=10 {code}
When executed, disk usage of that cached lookup DF grows really fast.

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-25 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758868#comment-17758868
 ] 

Kent Yao commented on SPARK-44900:
--

Set spark.cleaner.periodicGC.interval to a smaller value(3min) might help

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-24 Thread Varun Nalla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758674#comment-17758674
 ] 

Varun Nalla commented on SPARK-44900:
-

We do not keep on adding , they are existing cached RDD's which keeps on 
growing, tested with spark3 as well, seeing same behavior and also changed the 
persisting strategy to just MEMORY instead of MEMORY_AND_DISK, but no luck.

 

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-23 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758376#comment-17758376
 ] 

Kent Yao commented on SPARK-44900:
--

Please note that there is a storage limit in place. Adding items without 
removing any may lead to cache evictions and recomputations. In such cases, 
caching may not be as effective as direct computation, as extra write paths are 
introduced. Probably, you should optimize your program.

 

 

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-23 Thread Varun Nalla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758316#comment-17758316
 ] 

Varun Nalla commented on SPARK-44900:
-

[~yao]  Thanks for the comment. However we can't release as cached RDD's are 
being used in every micro batch that's why could not unpersist.

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-23 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758315#comment-17758315
 ] 

Kent Yao commented on SPARK-44900:
--

How about releasing the cached rdds if you never touch it again

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

2023-08-23 Thread Varun Nalla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758161#comment-17758161
 ] 

Varun Nalla commented on SPARK-44900:
-

[~yao] , is there a way I could prioritize this issue as it's causing us 
production impact ?

> Cached DataFrame keeps growing
> --
>
> Key: SPARK-44900
> URL: https://issues.apache.org/jira/browse/SPARK-44900
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Varun Nalla
>Priority: Blocker
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org