Paul Staab created SPARK-40155:
----------------------------------

             Summary: Optionally use a serialized storage level for 
DataFrame.localCheckpoint()
                 Key: SPARK-40155
                 URL: https://issues.apache.org/jira/browse/SPARK-40155
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 3.3.0
            Reporter: Paul Staab


In PySpark 3.3.0 `DataFrame.localCheckpoint()` stores the RDD checkpoints using 
the "Disk Memory *Deserialized* 1x Replicated" storage level. Looking through 
the Python code and the documentation, I haven't found any possibility to 
change this.

As serialized RDDs are often a lot smaller than deserialized ones - I have seen 
examples where a 40GB deserialized RDD shrank to 200MB when serialized - I 
would usually like to create local checkpoints that are stored in serialized 
instead of deserialized format.

To make this possible, we could e.g. add an optional `storage_level` argument 
to `DataFrame.localCheckpoint()` similar to `DataFrame.persist()` or add a 
global configuration option similar to `spark.checkpoint.compress`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to