[ 
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yihong He updated SPARK-38353:
------------------------------
    Description: 
Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and 
{_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy 
of the usage data. Besides, we are interested in extending the Pandas usage 
logger to other PySpark modules in the future so it will help improve accuracy 
of usage data of other PySpark modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
    [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
    self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
    self.assert_eq(
        repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
    ){code}
 

pandas usage logger records the internal call 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since __enter__ and __exit__ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

  was:
Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and 
{_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy 
of the usage data. Besides, we are interested in extending the Pandas usage 
logger to other PySpark modules in the future so it will help improve accuracy 
of usage data of other PySpark modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
    [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
    self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
    self.assert_eq(
        repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
    ){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.


> Instrument __enter__ and __exit__ magic methods for Pandas module
> -----------------------------------------------------------------
>
>                 Key: SPARK-38353
>                 URL: https://issues.apache.org/jira/browse/SPARK-38353
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Yihong He
>            Priority: Minor
>
> Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and 
> {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve 
> accuracy of the usage data. Besides, we are interested in extending the 
> Pandas usage logger to other PySpark modules in the future so it will help 
> improve accuracy of usage data of other PySpark modules.
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
>     [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
>     self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
>     self.assert_eq(
>         repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
> False, True))
>     ){code}
>  
> pandas usage logger records the internal call 
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
>  since __enter__ and __exit__ methods of 
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
>  are not instrumented.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to