[ 
https://issues.apache.org/jira/browse/SPARK-37765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466252#comment-17466252
 ] 

Apache Spark commented on SPARK-37765:
--------------------------------------

User 'pabloalcain' has created a pull request for this issue:
https://github.com/apache/spark/pull/35045

> PySpark Dynamic DataFrame for easier inheritance
> ------------------------------------------------
>
>                 Key: SPARK-37765
>                 URL: https://issues.apache.org/jira/browse/SPARK-37765
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Pablo Alcain
>            Priority: Major
>
> In typical development settings, multiple tables with very different concepts 
> are mapped to the same `DataFrame` class. The inheritance from the pyspark 
> `DataFrame` class is a bit cumbersome because of the chainable methods and it 
> also makes it difficult to abstract regularly used queries. The proposal is 
> to generate a `DynamicDataFrame` that allows easy inheritance retaining 
> `DataFrame` methods without losing chainability neither for the newly 
> generated queries nor for the usual dataframe ones.
> In our experience, this allowed us to iterate *much* faster, generating 
> business-centric classes in a couple of lines of code. Here's an example of 
> what the application code would look like. Attached in the end is a summary 
> of the different strategies that are usually pursued when trying to abstract 
> queries.
> {code:python}
> import pyspark
> from pyspark.sql import DynamicDataFrame
> from pyspark.sql import functions as F
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> class Inventory(DynamicDataFrame):
>     def update_prices(self, factor: float = 2.0):
>         return self.withColumn("price", F.col("price") * factor)
> base_dataframe = spark.createDataFrame(
>     data=[["product_1", 2.0], ["product_2", 4.0]],
>     schema=["name", "price"],
> )
> print("Doing an inheritance mediated by DynamicDataFrame")
> inventory = Inventory(base_dataframe)
> inventory_updated = inventory.update_prices(2.0).update_prices(5.0)
> print("inventory_updated.show():")
> inventory_updated.show()
> print("After multiple uses of the query we still have the desired type")
> print(f"type(inventory_updated): {type(inventory_updated)}")
> print("We can still use the usual dataframe methods")
> expensive_inventory = inventory_updated.filter(F.col("price") > 25)
> print("expensive_inventory.show():")
> expensive_inventory.show()
> print("And retain the desired type")
> print(f"type(expensive_inventory): {type(expensive_inventory)}")
> {code}
> The PR linked to this ticket is an implementation of the DynamicDataFramed 
> used in this snippet.
>  
> Other strategies found for handling the query abstraction:
> 1. Functions: using functions that call dataframes and returns them 
> transformed. It had a couple of pitfalls: we had to manage the namespaces 
> carefully, there is no clear new object and also the "chainability" didn't 
> feel very pyspark-y.
> 2. MonkeyPatching DataFrame: we monkeypatched 
> ([https://stackoverflow.com/questions/5626193/what-is-monkey-patching]) 
> methods with the regularly done queries inside the DataFrame class. This one 
> kept it pyspark-y, but there was no easy way to handle segregated namespaces/
> 3. Inheritances: create the class `MyBusinessDataFrame`, inherit from 
> `DataFrame` and implement the methods there. This one solves all the issues, 
> but with a caveat: the chainable methods cast the result explicitly to 
> `DataFrame` (see 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910]
>  e g). Therefore, everytime you use one of the parent's methods you'd have to 
> re-cast to `MyBusinessDataFrame`, making the code cumbersome.
>  
> (see 
> [https://mail-archives.apache.org/mod_mbox/spark-dev/202111.mbox/browser] for 
> the link to the original mail in which we proposed this feature)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to