[ https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Abhishek Dixit updated SPARK-31448: ----------------------------------- Labels: pys (was: ) > Difference in Storage Levels used in cache() and persist() for pyspark > dataframes > --------------------------------------------------------------------------------- > > Key: SPARK-31448 > URL: https://issues.apache.org/jira/browse/SPARK-31448 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.3 > Reporter: Abhishek Dixit > Priority: Major > Labels: pys > > There is a difference in default storage level *MEMORY_AND_DISK* in pyspark > and scala. > *Scala*: StorageLevel(true, true, false, true) > *Pyspark:* StorageLevel(True, True, False, False) > > *Problem Description:* > Calling *df.cache()* for pyspark dataframe directly invokes Scala method > cache() and Storage Level used is StorageLevel(true, true, false, true). > But calling *df.persist()* for pyspark dataframe sets the > newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and > then invokes Scala function persist(newStorageLevel). > *Possible Fix:* > Invoke pyspark function persist inside pyspark function cache instead of > calling the scala function directly. > I can raise a PR for this fix if someone can confirm that this is a bug and > the possible fix is the correct approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org