Hanan Shteingart created SPARK-26449:
----------------------------------------

             Summary: Dataframe.transform
                 Key: SPARK-26449
                 URL: https://issues.apache.org/jira/browse/SPARK-26449
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Hanan Shteingart


I would like to chain custom transformations as is suggested in this [blog 
post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]

This will allow to write something like the following:

 

 
{code:java}
 
def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(df, something):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("liz", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = (source_df
    .transform(with_greeting)
    .transform(lambda df: with_something(df, "crazy")))
print(actual_df.show())
+----+---+--------+---------+
|name|age|greeting|something|
+----+---+--------+---------+
|jose|  1|      hi|    crazy|
|  li|  2|      hi|    crazy|
| liz|  3|      hi|    crazy|
+----+---+--------+---------+

{code}
The only thing needed to accomplish this is the following simple method for 
DataFrame:
{code:java}
from pyspark.sql.dataframe import DataFrame 
def transform(self, f): 
    return f(self) 
DataFrame.transform = transform
{code}
I volunteer to do the pull request if approved (at least the python part)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to