[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job

Faiz Halde (Jira) Mon, 08 May 2023 05:17:56 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Faiz Halde updated SPARK-43408:
-------------------------------
    Description: 
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API. The 
StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking 
to speed up by caching data in memory

  was:
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API


> Spark caching in the context of a single job
> --------------------------------------------
>
>                 Key: SPARK-43408
>                 URL: https://issues.apache.org/jira/browse/SPARK-43408
>             Project: Spark
>          Issue Type: Question
>          Components: Shuffle
>    Affects Versions: 3.3.1
>            Reporter: Faiz Halde
>            Priority: Trivial
>
> Does caching benefit a spark job with only a single action in it? Spark IIRC 
> already optimizes shuffles by persisting them onto the disk
> I am unable to find a counter-example where caching would benefit a job with 
> a single action. In every case I can think of, the shuffle checkpoint acts as 
> a good enough caching mechanism in itself
> FWIW, I am talking specifically in the context of the Dataframe API. The 
> StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking 
> to speed up by caching data in memory



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job

Reply via email to