Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r163987675 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,250 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# PySpark Usage Guide for Pandas with Arrow + +## Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +### Ensure PyArrow Installed + +If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the +SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow +is installed and available on all cluster nodes. The current supported version is 0.8.0. +You can install using pip or conda from the conda-forge channel. See PyArrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## Enabling for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, it first must be enabled by setting the Spark configuration +'spark.sql.execution.arrow.enabled' to 'true', this is disabled by default. + +<div class="codetabs"> +<div data-lang="python" markdown="1"> +{% highlight python %} + +import numpy as np +import pandas as pd + +# Enable Arrow, 'spark' is an existing SparkSession +spark.conf.set("spark.sql.execution.arrow.enabled", "true") + +# Generate sample data +pdf = pd.DataFrame(np.random.rand(100, 3)) + +# Create a Spark DataFrame from Pandas data using Arrow +df = spark.createDataFrame(pdf) + +# Convert the Spark DataFrame to a local Pandas DataFrame +selpdf = df.select("*").toPandas() + +{% endhighlight %} +</div> +</div> + +Using the above optimizations with Arrow will produce the same results as when Arrow is not +enabled. Not all Spark data types are currently supported and an error will be raised if a column +has an unsupported type, see [Supported Types](#supported-types). + +## Pandas UDFs (a.k.a Vectorized UDFs) + +With Arrow, we introduce a new type of UDF - pandas UDF. Pandas UDF is defined with a new function +`pyspark.sql.functions.pandas_udf` and allows user to use functions that operate on `pandas.Series` +and `pandas.DataFrame` with Spark. Currently, there are two types of pandas UDF: Scalar and Group Map. + +### Scalar + +Scalar pandas UDFs are used for vectorizing scalar operations. They can used with functions such as `select` +and `withColumn`. To define a scalar pandas UDF, use `pandas_udf` to annotate a Python function. The Python +should takes `pandas.Series` and returns a `pandas.Series` of the same size. Internally, Spark will +split a column into multiple `pandas.Series` and invoke the Python function with each `pandas.Series`, and +concat the results together to be a new column. + +The following example shows how to create a scalar pandas UDF that computes the product of 2 columns. + +<div class="codetabs"> +<div data-lang="python" markdown="1"> +{% highlight python %} + +import pandas as pd +from pyspark.sql.functions import pandas_udf, PandasUDFTypr + +df = spark.createDataFrame( + [(1,), (2,), (3,)], + ['v']) + +# Declare the function and create the UDF +@pandas_udf('long', PandasUDFType.SCALAR) +def multiply_udf(a, b): + # a and b are both pandas.Series + return a * b + +df.select(multiply_udf(df.v, df.v)).show() +# +------------------+ +# |multiply_udf(v, v)| +# +------------------+ +# | 1| +# | 4| +# | 9| +# +------------------+ + +{% endhighlight %} +</div> +</div> + +Note that there are two important requirement when using scalar pandas UDFs: +* The input and output series must have the same size. +* How a column is splitted into multiple `pandas.Series` is internal to Spark, and therefore the result + of user-defined function must be independent of the splitting. + +### Group Map +Group map pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. +Split-apply-combine consists of three steps: +* Split the data into groups by using `DataFrame.groupBy`. +* Apply a function on each group. The input and output of the function are both `pandas.DataFrame`. The + input data contains all the rows and columns for each group. +* Combine the results into a new `DataFrame`. + +To use groupby apply, user needs to define the following: +* A Python function that defines the computation for each group. +* A `StructType` object or a string that defines the schema of the output `DataFrame`. + +Here we show two examples of using group map pandas UDFs. + +The first example shows a simple use case: subtracting the mean from each value in the group. + +<div class="codetabs"> +<div data-lang="python" markdown="1"> +{% highlight python %} + +from pyspark.sql.functions import pandas_udf, PandasUDFType + +df = spark.createDataFrame( + [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], + ("id", "v")) + +@pandas_udf("id long, v double", PandasUDFType.GROUP_MAP) +def substract_mean(pdf): + # pdf is a pandas.DataFrame + v = pdf.v + return pdf.assign(v=v - v.mean()) + +df.groupby("id").apply(substract_mean).show() +# +---+----+ +# | id| v| +# +---+----+ +# | 1|-0.5| +# | 1| 0.5| +# | 2|-3.0| +# | 2|-1.0| +# | 2| 4.0| +# +---+----+ + +{% endhighlight %} +</div> +</div> + +The second example is a more complicated example. It shows how to run a OLS linear regression --- End diff -- I don't have strong opinion. I think a separate file is hard to discover by user. I actually think we should include the example in python doc, I often find having multiple examples in python doc is useful. I am fine with removing this example here.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org