[ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yihong He updated SPARK-38353: ------------------------------ Description: Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy of the usage data. Besides, we are interested in extending the Pandas usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. was: Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic methods for Pandas module can help improve accuracy of the usage data. Besides, we are interested in extend the Pandas usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. > Instrument __enter__ and __exit__ magic methods for Pandas module > ----------------------------------------------------------------- > > Key: SPARK-38353 > URL: https://issues.apache.org/jira/browse/SPARK-38353 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.2.1 > Reporter: Yihong He > Priority: Minor > > Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and > {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve > accuracy of the usage data. Besides, we are interested in extending the > Pandas usage logger to other PySpark modules in the future so it will help > improve accuracy of usage data of other PySpark modules. > For example, for the following code: > > {code:java} > pdf = pd.DataFrame( > [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] > ) > psdf = ps.from_pandas(pdf) > with psdf.spark.cache() as cached_df: > self.assert_eq(isinstance(cached_df, CachedDataFrame), True) > self.assert_eq( > repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, > False, True)) > ){code} > > pandas usage logger records > [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] > since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of > [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] > are not instrumented. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org