Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18732#discussion_r141883671
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2206,6 +2207,10 @@ def pandas_udf(f=None, returnType=StringType()):
         |         8|      JOHN DOE|          22|
         +----------+--------------+------------+
         """
    +    import pandas as pd
    +    if isinstance(returnType, pd.Series):
    +        returnType = from_pandas_dtypes(returnType)
    --- End diff --
    
    I agree having a consistent way to express return type is good.
    
    The reason I added this is to enable this usage:
    
    ```
    sample_df = df.filter(df.id == 1).toPandas()
    
    def foo(df):
          ret = # Some transformation on the input pd.DataFrame
          return ret
    
    foo_udf = pandas_udf(foo, foo(sample_df).dtypes)
    
    df.groupBy('id').apply(foo_udf)
    ```
    
    The pattern is quite useful in interactive usage. Here the user no longer 
needs to specify the return schema of the `foo` manually. And if the user 
changes the return columns of `foo`, they don't need to change the return type 
of `pandas_udf`. 
    
    I am leaning towards keeping this but I am willing to be convinced.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to