Paul Staab created SPARK-40155: ---------------------------------- Summary: Optionally use a serialized storage level for DataFrame.localCheckpoint() Key: SPARK-40155 URL: https://issues.apache.org/jira/browse/SPARK-40155 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.3.0 Reporter: Paul Staab
In PySpark 3.3.0 `DataFrame.localCheckpoint()` stores the RDD checkpoints using the "Disk Memory *Deserialized* 1x Replicated" storage level. Looking through the Python code and the documentation, I haven't found any possibility to change this. As serialized RDDs are often a lot smaller than deserialized ones - I have seen examples where a 40GB deserialized RDD shrank to 200MB when serialized - I would usually like to create local checkpoints that are stored in serialized instead of deserialized format. To make this possible, we could e.g. add an optional `storage_level` argument to `DataFrame.localCheckpoint()` similar to `DataFrame.persist()` or add a global configuration option similar to `spark.checkpoint.compress`. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org