[ https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen reassigned SPARK-26449: --------------------------------- Assignee: Hanan Shteingart > Missing Dataframe.transform API in Python API > --------------------------------------------- > > Key: SPARK-26449 > URL: https://issues.apache.org/jira/browse/SPARK-26449 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 2.4.0 > Reporter: Hanan Shteingart > Assignee: Hanan Shteingart > Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > I would like to chain custom transformations as is suggested in this [blog > post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] > This will allow to write something like the following: > > > {code:java} > > def with_greeting(df): > return df.withColumn("greeting", lit("hi")) > def with_something(df, something): > return df.withColumn("something", lit(something)) > data = [("jose", 1), ("li", 2), ("liz", 3)] > source_df = spark.createDataFrame(data, ["name", "age"]) > actual_df = (source_df > .transform(with_greeting) > .transform(lambda df: with_something(df, "crazy"))) > print(actual_df.show()) > +----+---+--------+---------+ > |name|age|greeting|something| > +----+---+--------+---------+ > |jose| 1| hi| crazy| > | li| 2| hi| crazy| > | liz| 3| hi| crazy| > +----+---+--------+---------+ > {code} > The only thing needed to accomplish this is the following simple method for > DataFrame: > {code:java} > from pyspark.sql.dataframe import DataFrame > def transform(self, f): > return f(self) > DataFrame.transform = transform > {code} > I volunteer to do the pull request if approved (at least the python part) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org