[jira] [Updated] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes

Abhishek Dixit (Jira) Fri, 17 Apr 2020 04:26:21 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Abhishek Dixit updated SPARK-31448:
-----------------------------------
    Labels: pys  (was: )

> Difference in Storage Levels used in cache() and persist() for pyspark 
> dataframes
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-31448
>                 URL: https://issues.apache.org/jira/browse/SPARK-31448
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>            Reporter: Abhishek Dixit
>            Priority: Major
>              Labels: pys
>
> There is a difference in default storage level *MEMORY_AND_DISK* in pyspark 
> and scala.
> *Scala*: StorageLevel(true, true, false, true)
> *Pyspark:* StorageLevel(True, True, False, False)
>  
> *Problem Description:* 
> Calling *df.cache()*  for pyspark dataframe directly invokes Scala method 
> cache() and Storage Level used is StorageLevel(true, true, false, true).
> But calling *df.persist()* for pyspark dataframe sets the 
> newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and 
> then invokes Scala function persist(newStorageLevel).
> *Possible Fix:*
> Invoke pyspark function persist inside pyspark function cache instead of 
> calling the scala function directly.
> I can raise a PR for this fix if someone can confirm that this is a bug and 
> the possible fix is the correct approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes

Reply via email to