Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r141883671 --- Diff: python/pyspark/sql/functions.py --- @@ -2206,6 +2207,10 @@ def pandas_udf(f=None, returnType=StringType()): | 8| JOHN DOE| 22| +----------+--------------+------------+ """ + import pandas as pd + if isinstance(returnType, pd.Series): + returnType = from_pandas_dtypes(returnType) --- End diff -- I agree having a consistent way to express return type is good. The reason I added this is to enable this usage: ``` sample_df = df.filter(df.id == 1).toPandas() def foo(df): ret = # Some transformation on the input pd.DataFrame return ret foo_udf = pandas_udf(foo, foo(sample_df).dtypes) df.groupBy('id').apply(foo_udf) ``` The pattern is quite useful in interactive usage. Here the user no longer needs to specify the return schema of the `foo` manually. And if the user changes the return columns of `foo`, they don't need to change the return type of `pandas_udf`. I am leaning towards keeping this but I am willing to be convinced.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org