Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r163956524 --- Diff: docs/sql-programming-guide.md --- @@ -1693,70 +1693,70 @@ Using the above optimizations with Arrow will produce the same results as when A enabled. Not all Spark data types are currently supported and an error will be raised if a column has an unsupported type, see [Supported Types](#supported-types). -## How to Write Vectorized UDFs +## Pandas UDFs (a.k.a Vectorized UDFs) -A vectorized UDF is similar to a standard UDF in Spark except the inputs and output will be -Pandas Series, which allow the function to be composed with vectorized operations. This function -can then be run very efficiently in Spark where data is sent in batches to Python and then -is executed using Pandas Series as the inputs. The exected output of the function is also a Pandas -Series of the same length as the inputs. A vectorized UDF is declared using the `pandas_udf` -keyword, no additional configuration is required. +With Arrow, we introduce a new type of UDF - pandas UDF. Pandas UDF is defined with a new function +`pyspark.sql.functions.pandas_udf` and allows user to use functions that operate on `pandas.Series` +and `pandas.DataFrame` with Spark. Currently, there are two types of pandas UDF: Scalar and Group Map. -The following example shows how to create a vectorized UDF that computes the product of 2 columns. +### Scalar + +Scalar pandas UDFs are used for vectorizing scalar operations. They can used with functions such as `select` +and `withColumn`. To define a scalar pandas UDF, use `pandas_udf` to annotate a Python function. The Python +should takes `pandas.Series` and returns a `pandas.Series` of the same size. Internally, Spark will +split a column into multiple `pandas.Series` and invoke the Python function with each `pandas.Series`, and +concat the results together to be a new column. + +The following example shows how to create a scalar pandas UDF that computes the product of 2 columns. <div class="codetabs"> <div data-lang="python" markdown="1"> {% highlight python %} import pandas as pd -from pyspark.sql.functions import col, pandas_udf -from pyspark.sql.types import LongType +from pyspark.sql.functions import pandas_udf, PandasUDFTypr --- End diff -- @icexelloss I think there is a typo -> `PandasUDFTypr` Please make sure the example run
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org