[ 
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yihong He updated SPARK-38353:
------------------------------
    Description: 
Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic 
methods for Pandas module can help improve accuracy of the usage data. Besides, 
we are interested in extend the Pandas usage logger to other PySpark modules in 
the future so it will help improve accuracy of usage data of other PySpark 
modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
    [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
    self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
    self.assert_eq(
        repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
    ){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

  was:
For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
    [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
    self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
    self.assert_eq(
        repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
    ){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since _{_}enter{_}_ and _{_}exit{_}_ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

So instrumenting __enter__ and __exit__ magic methods for Pandas module can 
help improve accuracy of the usage data


> Instrument __enter__ and __exit__ magic methods for Pandas module
> -----------------------------------------------------------------
>
>                 Key: SPARK-38353
>                 URL: https://issues.apache.org/jira/browse/SPARK-38353
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Yihong He
>            Priority: Minor
>
> Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic 
> methods for Pandas module can help improve accuracy of the usage data. 
> Besides, we are interested in extend the Pandas usage logger to other PySpark 
> modules in the future so it will help improve accuracy of usage data of other 
> PySpark modules.
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
>     [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
>     self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
>     self.assert_eq(
>         repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
> False, True))
>     ){code}
>  
> pandas usage logger records 
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
>  since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of 
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
>  are not instrumented.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to