[ https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haejoon Lee updated SPARK-38353: -------------------------------- Description: Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic methods for pandas API on Spark can help improve accuracy of the usage data. Besides, we are interested in extending the pandas-on-Spark usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas-on-Spark usage logger records the internal call [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since __enter__ and __exit__ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. was: Create the ticket since instrumenting __enter__ and __exit__ magic methods for pandas API on Spark can help improve accuracy of the usage data. Besides, we are interested in extending the pandas-on-Spark usage logger to other PySpark modules in the future so it will help improve accuracy of usage data of other PySpark modules. For example, for the following code: {code:java} pdf = pd.DataFrame( [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] ) psdf = ps.from_pandas(pdf) with psdf.spark.cache() as cached_df: self.assert_eq(isinstance(cached_df, CachedDataFrame), True) self.assert_eq( repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, False, True)) ){code} pandas usage logger records the internal call [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] since _{_}enter{_}_ and _{_}exit{_}_ methods of [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] are not instrumented. > Instrument __enter__ and __exit__ magic methods for pandas API on Spark > ----------------------------------------------------------------------- > > Key: SPARK-38353 > URL: https://issues.apache.org/jira/browse/SPARK-38353 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.2.1 > Reporter: Yihong He > Priority: Minor > > Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic > methods for pandas API on Spark can help improve accuracy of the usage data. > Besides, we are interested in extending the pandas-on-Spark usage logger to > other PySpark modules in the future so it will help improve accuracy of usage > data of other PySpark modules. > For example, for the following code: > > {code:java} > pdf = pd.DataFrame( > [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"] > ) > psdf = ps.from_pandas(pdf) > with psdf.spark.cache() as cached_df: > self.assert_eq(isinstance(cached_df, CachedDataFrame), True) > self.assert_eq( > repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, > False, True)) > ){code} > > pandas-on-Spark usage logger records the internal call > [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518] > since __enter__ and __exit__ methods of > [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492] > are not instrumented. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org