Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r163957412 --- Diff: docs/sql-programming-guide.md --- @@ -1693,70 +1693,70 @@ Using the above optimizations with Arrow will produce the same results as when A enabled. Not all Spark data types are currently supported and an error will be raised if a column has an unsupported type, see [Supported Types](#supported-types). -## How to Write Vectorized UDFs +## Pandas UDFs (a.k.a Vectorized UDFs) -A vectorized UDF is similar to a standard UDF in Spark except the inputs and output will be -Pandas Series, which allow the function to be composed with vectorized operations. This function -can then be run very efficiently in Spark where data is sent in batches to Python and then -is executed using Pandas Series as the inputs. The exected output of the function is also a Pandas -Series of the same length as the inputs. A vectorized UDF is declared using the `pandas_udf` -keyword, no additional configuration is required. +With Arrow, we introduce a new type of UDF - pandas UDF. Pandas UDF is defined with a new function +`pyspark.sql.functions.pandas_udf` and allows user to use functions that operate on `pandas.Series` +and `pandas.DataFrame` with Spark. Currently, there are two types of pandas UDF: Scalar and Group Map. -The following example shows how to create a vectorized UDF that computes the product of 2 columns. +### Scalar + +Scalar pandas UDFs are used for vectorizing scalar operations. They can used with functions such as `select` +and `withColumn`. To define a scalar pandas UDF, use `pandas_udf` to annotate a Python function. The Python +should takes `pandas.Series` and returns a `pandas.Series` of the same size. Internally, Spark will +split a column into multiple `pandas.Series` and invoke the Python function with each `pandas.Series`, and +concat the results together to be a new column. + +The following example shows how to create a scalar pandas UDF that computes the product of 2 columns. <div class="codetabs"> <div data-lang="python" markdown="1"> {% highlight python %} import pandas as pd -from pyspark.sql.functions import col, pandas_udf -from pyspark.sql.types import LongType +from pyspark.sql.functions import pandas_udf, PandasUDFTypr + +df = spark.createDataFrame( + [(1,), (2,), (3,)], + ['v']) # Declare the function and create the UDF -def multiply_func(a, b): +@pandas_udf('long', PandasUDFType.SCALAR) +def multiply_udf(a, b): + # a and b are both pandas.Series return a * b -multiply = pandas_udf(multiply_func, returnType=LongType()) - -# The function for a pandas_udf should be able to execute with local Pandas data -x = pd.Series([1, 2, 3]) -print(multiply_func(x, x)) -# 0 1 -# 1 4 -# 2 9 -# dtype: int64 - -# Create a Spark DataFrame, 'spark' is an existing SparkSession -df = spark.createDataFrame(pd.DataFrame(x, columns=["x"])) - -# Execute function as a Spark vectorized UDF -df.select(multiply(col("x"), col("x"))).show() -# +-------------------+ -# |multiply_func(x, x)| -# +-------------------+ -# | 1| -# | 4| -# | 9| -# +-------------------+ +df.select(multiply_udf(df.v, df.v)).show() +# +------------------+ +# |multiply_udf(v, v)| +# +------------------+ +# | 1| +# | 4| +# | 9| +# +------------------+ {% endhighlight %} </div> </div> -## GroupBy-Apply -GroupBy-Apply implements the "split-apply-combine" pattern. Split-apply-combine consists of three steps: +Note that there are two important requirement when using scalar pandas UDFs: +* The input and output series must have the same size. +* How a column is splitted into multiple `pandas.Series` is internal to Spark, and therefore the result --- End diff -- splitted -> split
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org