tl;dr: a proposal for a pyspark "DynamicDataFrame" class that would make it
easier to inherit from it while keeping dataframe methods.

Hello everyone. We have been working for a long time with PySpark and more
specifically with DataFrames. In our pipelines we have several tables, with
specific purposes, that we usually load as DataFrames. As you might expect,
there are a handful of queries and transformations per dataframe that are
done many times, so we thought of ways that we could abstract them:

1. Functions: using functions that call dataframes and returns them
transformed. It had a couple of pitfalls: we had to manage the namespaces
carefully, and also the "chainability" didn't feel very pyspark-y.
2. MonkeyPatching DataFrame: we monkeypatched (
https://stackoverflow.com/questions/5626193/what-is-monkey-patching)
methods with the regularly done queries inside the DataFrame class. This
one kept it pyspark-y, but there was no easy way to handle segregated
namespaces/
3. Inheritances: create the class `MyBusinessDataFrame`, inherit from
`DataFrame` and implement the methods there. This one solves all the
issues, but with a caveat: the chainable methods cast the result explicitly
to `DataFrame` (see
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1910
e g). Therefore, everytime you use one of the parent's methods you'd have
to re-cast to `MyBusinessDataFrame`, making the code cumbersome.

In view of these pitfalls we decided to go for a slightly different
approach, inspired by #3: We created a class called `DynamicDataFrame` that
overrides the explicit call to `DataFrame` as done in PySpark but instead
casted dynamically to `self.__class__` (see
https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e#file-dynamic_dataframe_minimal-py-L21
e g). This allows the fluent methods to always keep the same class, making
chainability as smooth as it is with pyspark dataframes.

As an example implementation, here's a link to a gist (
https://gist.github.com/pabloalcain/de79938507ad2d823a866238b3c8a66e) that
implemented dynamically `withColumn` and `select` methods and the expected
output.

I'm sharing this here in case you feel like this approach can be useful for
anyone else. In our case it greatly sped up the development of abstraction
layers and allowed us to write cleaner code. One of the advantages is that
it would simply be a "plugin" over pyspark, that does not modify anyhow
already existing code or application interfaces.

If you think that this can be helpful, I can write a PR as a more refined
proof of concept.

Thanks!

Pablo

Reply via email to