[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

wang fanming (Jira) Sun, 10 Dec 2023 03:01:05 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795061#comment-17795061
 ]


wang fanming commented on SPARK-44900:
--------------------------------------

Of course, the following is my execution in the spark3.5.1 environment. I only 
extracted the relevant logs related to the production, storage, and reuse of 
rdd_1538 partition 1 under the Storage tab of the WEB UI.
{quote}23/12/10 18:05:23 INFO MemoryStore: Block rdd_1538_1 stored as values in 
memory (estimated size 176.7 MiB, free 310.2 MiB)
[rdd_1538_1]
23/12/10 18:05:24 INFO BlockManager: Found block rdd_1538_1 locally
23/12/10 18:05:27 INFO BlockManager: Dropping block rdd_1538_1 from memory
23/12/10 18:05:27 INFO BlockManager: Writing block rdd_1538_1 to disk
23/12/10 18:05:34 INFO MemoryStore: Block rdd_1538_1 stored as values in memory 
(estimated size 176.7 MiB, free 133.5 MiB)
23/12/10 18:05:34 INFO BlockManager: Found block rdd_1538_1 locally
23/12/10 18:05:40 INFO BlockManager: Found block rdd_1538_1 locally
23/12/10 18:05:42 INFO BlockManager: Dropping block rdd_1538_1 from memory
23/12/10 18:05:46 INFO MemoryStore: Block rdd_1538_1 stored as values in memory 
(estimated size 176.7 MiB, free 133.5 MiB)
{quote}
Through analysis of the above logs and combined with the UI display situation, 
the "Size on Disk" displayed on the Storage interface is incomprehensible.

If the partitions of this RDD are cached normally, shouldn't the size under 
"Size on Disk" label change?

> Cached DataFrame keeps growing
> ------------------------------
>
>                 Key: SPARK-44900
>                 URL: https://issues.apache.org/jira/browse/SPARK-44900
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.3.0
>            Reporter: Varun Nalla
>            Priority: Major
>
> Scenario :
> We have a kafka streaming application where the data lookups are happening by 
> joining  another DF which is cached, and the caching strategy is 
> MEMORY_AND_DISK.
> However the size of the cached DataFrame keeps on growing for every micro 
> batch the streaming application process and that's being visible under 
> storage tab.
> A similar stack overflow thread was already raised.
> https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing

Reply via email to