tl;dr: a proposal for a pyspark "DynamicDataFrame" class that would make it easier to inherit from it while keeping dataframe methods.
Hello everyone. We have been working for a long time with PySpark and more specifically with DataFrames. In our pipelines we have several tables, with specific purposes, that we usually load as DataFrames. As you might expect, there are a handful of queries and transformations per dataframe that are done many times, so we thought of ways that we could abstract them: 1. Functions: using functions that call dataframes and returns them transformed. It had a couple of pitfalls: we had to manage the namespaces carefully, and also the "chainability" didn't feel very pyspark-y. 2. MonkeyPatching DataFrame: we monkeypatched ( https://stackoverflow.com/questions/5626193/what-is-monkey-patching) methods with the regularly done queries inside the DataFrame class. This one kept it pyspark-y, but there was no easy way to handle segregated namespaces/ 3. Inheritances: create the class `MyBusinessDataFrame`, inherit from `DataFrame` and implement the methods there. This one solves all the issues, but with a caveat: the chainable methods cast the result explicitly to `DataFrame` (see https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910 e g). Therefore, everytime you use one of the parent's methods you'd have to re-cast to `MyBusinessDataFrame`, making the code cumbersome. In view of these pitfalls we decided to go for a slightly different approach, inspired by #3: We created a class called `DynamicDataFrame` that overrides the explicit call to `DataFrame` as done in PySpark but instead casted dynamically to `self.__class__` (see https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21 e g). This allows the fluent methods to always keep the same class, making chainability as smooth as it is with pyspark dataframes. As an example implementation, here's a link to a gist ( https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e) that implemented dynamically `withColumn` and `select` methods and the expected output. I'm sharing this here in case you feel like this approach can be useful for anyone else. In our case it greatly sped up the development of abstraction layers and allowed us to write cleaner code. One of the advantages is that it would simply be a "plugin" over pyspark, that does not modify anyhow already existing code or application interfaces. If you think that this can be helpful, I can write a PR as a more refined proof of concept. Thanks! Pablo