spark git commit: [SPARK-24392][PYTHON] Label pandas_udf as Experimental

gurwls223 Sun, 27 May 2018 21:57:01 -0700

Repository: spark
Updated Branches:
  refs/heads/master de01a8d50 -> fa2ae9d20



[SPARK-24392][PYTHON] Label pandas_udf as Experimental

## What changes were proposed in this pull request?

The pandas_udf functionality was introduced in 2.3.0, but is not completely 
stable and still evolving.  This adds a label to indicate it is still an 
experimental API.

## How was this patch tested?

NA

Author: Bryan Cutler <cutl...@gmail.com>

Closes #21435 from BryanCutler/arrow-pandas_udf-experimental-SPARK-24392.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fa2ae9d2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fa2ae9d2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fa2ae9d2

Branch: refs/heads/master
Commit: fa2ae9d2019f839647d17932d8fea769e7622777
Parents: de01a8d
Author: Bryan Cutler <cutl...@gmail.com>
Authored: Mon May 28 12:56:05 2018 +0800
Committer: hyukjinkwon <gurwls...@apache.org>
Committed: Mon May 28 12:56:05 2018 +0800

----------------------------------------------------------------------
 docs/sql-programming-guide.md   | 4 ++++
 python/pyspark/sql/dataframe.py | 2 ++
 python/pyspark/sql/functions.py | 2 ++
 python/pyspark/sql/group.py     | 2 ++
 python/pyspark/sql/session.py   | 2 ++
 5 files changed, 12 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/fa2ae9d2/docs/sql-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index fc26562..5060086 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1827,6 +1827,10 @@ working with timestamps in `pandas_udf`s to get the best 
performance, see
   - Since Spark 2.0, Spark converts Parquet Hive tables by default for better 
performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. 
It means Spark uses its own ORC support by default instead of Hive SerDe. As an 
example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive 
SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC 
data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
   - In version 2.3 and earlier, CSV rows are considered as malformed if at 
least one column value in the row is malformed. CSV parser dropped such rows in 
the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 
2.4, CSV row is considered as malformed only when it contains malformed column 
values requested from CSV datasource, other values can be ignored. As an 
example, CSV file contains the "id,name" header and one row "1234". In Spark 
2.4, selection of the id column consists of a row with one column value 1234 
but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore 
the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to 
`false`.
 
+## Upgrading From Spark SQL 2.3.0 to 2.3.1 and above
+
+  - As of version 2.3.1 Arrow functionality, including `pandas_udf` and 
`toPandas()`/`createDataFrame()` with `spark.sql.execution.arrow.enabled` set 
to `True`, has been marked as experimental. These are still evolving and not 
currently recommended for use in production.
+
 ## Upgrading From Spark SQL 2.2 to 2.3
 
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
the referenced columns only include the internal corrupt record column (named 
`_corrupt_record` by default). For example, 
`spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()`
 and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. 
Instead, you can cache or save the parsed results and then send the same query. 
For example, `val df = spark.read.schema(schema).json(file).cache()` and then 
`df.filter($"_corrupt_record".isNotNull).count()`.

http://git-wip-us.apache.org/repos/asf/spark/blob/fa2ae9d2/python/pyspark/sql/dataframe.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 213dc15..808235a 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1975,6 +1975,8 @@ class DataFrame(object):
         .. note:: This method should only be used if the resulting Pandas's 
DataFrame is expected
             to be small, as all the data is loaded into the driver's memory.
 
+        .. note:: Usage with spark.sql.execution.arrow.enabled=True is 
experimental.
+
         >>> df.toPandas()  # doctest: +SKIP
            age   name
         0    2  Alice

http://git-wip-us.apache.org/repos/asf/spark/blob/fa2ae9d2/python/pyspark/sql/functions.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index fbc8a2d..efcce25 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2456,6 +2456,8 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
     :param functionType: an enum value in 
:class:`pyspark.sql.functions.PandasUDFType`.
                          Default: SCALAR.
 
+    .. note:: Experimental
+
     The function type of the UDF can be one of the following:
 
     1. SCALAR

http://git-wip-us.apache.org/repos/asf/spark/blob/fa2ae9d2/python/pyspark/sql/group.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/group.py b/python/pyspark/sql/group.py
index 3505065..0906c9c 100644
--- a/python/pyspark/sql/group.py
+++ b/python/pyspark/sql/group.py
@@ -236,6 +236,8 @@ class GroupedData(object):
             into memory, so the user should be aware of the potential OOM risk 
if data is skewed
             and certain groups are too large to fit in memory.
 
+        .. note:: Experimental
+
         :param udf: a grouped map user-defined function returned by
             :func:`pyspark.sql.functions.pandas_udf`.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/fa2ae9d2/python/pyspark/sql/session.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 13d6e2e..d675a24 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -584,6 +584,8 @@ class SparkSession(object):
         .. versionchanged:: 2.1
            Added verifySchema.
 
+        .. note:: Usage with spark.sql.execution.arrow.enabled=True is 
experimental.
+
         >>> l = [('Alice', 1)]
         >>> spark.createDataFrame(l).collect()
         [Row(_1=u'Alice', _2=1)]


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24392][PYTHON] Label pandas_udf as Experimental

Reply via email to