Repository: spark Updated Branches: refs/heads/branch-2.3 9b0f6f530 -> 8bb6c2285
[SPARK-24392][PYTHON] Label pandas_udf as Experimental The pandas_udf functionality was introduced in 2.3.0, but is not completely stable and still evolving. This adds a label to indicate it is still an experimental API. NA Author: Bryan Cutler <cutl...@gmail.com> Closes #21435 from BryanCutler/arrow-pandas_udf-experimental-SPARK-24392. (cherry picked from commit fa2ae9d2019f839647d17932d8fea769e7622777) Signed-off-by: hyukjinkwon <gurwls...@apache.org> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8bb6c228 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8bb6c228 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8bb6c228 Branch: refs/heads/branch-2.3 Commit: 8bb6c2285c6017f28d8c94f4030df518f6d3048d Parents: 9b0f6f5 Author: Bryan Cutler <cutl...@gmail.com> Authored: Mon May 28 12:56:05 2018 +0800 Committer: hyukjinkwon <gurwls...@apache.org> Committed: Mon May 28 12:57:18 2018 +0800 ---------------------------------------------------------------------- docs/sql-programming-guide.md | 4 ++++ python/pyspark/sql/dataframe.py | 2 ++ python/pyspark/sql/functions.py | 2 ++ python/pyspark/sql/group.py | 2 ++ python/pyspark/sql/session.py | 2 ++ 5 files changed, 12 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/8bb6c228/docs/sql-programming-guide.md ---------------------------------------------------------------------- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 651e440..14bc5e6 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -1797,6 +1797,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see # Migration Guide +## Upgrading From Spark SQL 2.3.0 to 2.3.1 and above + + - As of version 2.3.1 Arrow functionality, including `pandas_udf` and `toPandas()`/`createDataFrame()` with `spark.sql.execution.arrow.enabled` set to `True`, has been marked as experimental. These are still evolving and not currently recommended for use in production. + ## Upgrading From Spark SQL 2.2 to 2.3 - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`. http://git-wip-us.apache.org/repos/asf/spark/blob/8bb6c228/python/pyspark/sql/dataframe.py ---------------------------------------------------------------------- diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index 9bb0dca..d416b3b 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -1924,6 +1924,8 @@ class DataFrame(object): .. note:: This method should only be used if the resulting Pandas's DataFrame is expected to be small, as all the data is loaded into the driver's memory. + .. note:: Usage with spark.sql.execution.arrow.enabled=True is experimental. + >>> df.toPandas() # doctest: +SKIP age name 0 2 Alice http://git-wip-us.apache.org/repos/asf/spark/blob/8bb6c228/python/pyspark/sql/functions.py ---------------------------------------------------------------------- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 365be7b..cf26523 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -2172,6 +2172,8 @@ def pandas_udf(f=None, returnType=None, functionType=None): :param functionType: an enum value in :class:`pyspark.sql.functions.PandasUDFType`. Default: SCALAR. + .. note:: Experimental + The function type of the UDF can be one of the following: 1. SCALAR http://git-wip-us.apache.org/repos/asf/spark/blob/8bb6c228/python/pyspark/sql/group.py ---------------------------------------------------------------------- diff --git a/python/pyspark/sql/group.py b/python/pyspark/sql/group.py index 330faf2..bc6c094 100644 --- a/python/pyspark/sql/group.py +++ b/python/pyspark/sql/group.py @@ -212,6 +212,8 @@ class GroupedData(object): This function does not support partial aggregation, and requires shuffling all the data in the :class:`DataFrame`. + .. note:: Experimental + :param udf: a grouped map user-defined function returned by :func:`pyspark.sql.functions.pandas_udf`. http://git-wip-us.apache.org/repos/asf/spark/blob/8bb6c228/python/pyspark/sql/session.py ---------------------------------------------------------------------- diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py index 2ac2ec2..a459cb5 100644 --- a/python/pyspark/sql/session.py +++ b/python/pyspark/sql/session.py @@ -578,6 +578,8 @@ class SparkSession(object): .. versionchanged:: 2.1 Added verifySchema. + .. note:: Usage with spark.sql.execution.arrow.enabled=True is experimental. + >>> l = [('Alice', 1)] >>> spark.createDataFrame(l).collect() [Row(_1=u'Alice', _2=1)] --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org