[spark] branch master updated: [SPARK-39425][PYTHON][PS] Add migration guide for pandas 1.4 behavior changes

gurwls223 Fri, 10 Jun 2022 01:23:09 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new e298a136881 [SPARK-39425][PYTHON][PS] Add migration guide for pandas 
1.4 behavior changes
e298a136881 is described below

commit e298a136881f734db24f339b80341d8ff888bacc
Author: Yikun Jiang <yikunk...@gmail.com>
AuthorDate: Fri Jun 10 17:22:51 2022 +0900

    [SPARK-39425][PYTHON][PS] Add migration guide for pandas 1.4 behavior 
changes
    
    ### What changes were proposed in this pull request?
    Add migration guide for pandas 1.4 behavior changes:
    * SPARK-39054 https://github.com/apache/spark/pull/36581: In Spark 3.4, if 
Pandas on Spark API `Groupby.apply`'s `func` parameter return type is not 
specified and `compute.shortcut_limit` is set to `0`, the sampling rows will be 
set to 2 (ensure sampling rows always >= 2) to make sure infer schema is 
accurate.
    
    * SPARK-38822 https://github.com/apache/spark/pull/36168 In Spark 3.4, if 
Pandas on Spark API `Index.insert` is out of bounds, will rasied IndexError 
with `index {} is out of bounds for axis 0 with size {}` to follow pandas 1.4 
behavior.
    
    * SPARK-38857 https://github.com/apache/spark/pull/36159 In Spark 3.4, the 
series name will be preserved in Pandas on Spark API `Series.mode` to follow 
pandas 1.4 behavior.
    
    * SPARK-38859 https://github.com/apache/spark/pull/36142 In Spark 3.4, the 
Pandas on Spark API `Index.__setitem__` will first to check `value` type is 
`Column` to avoid raise unexpected error in `is_list_like` like `Cannot convert 
column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when 
building DataFrame boolean expressions.`.
    
    * SPARK-38820 https://github.com/apache/spark/pull/36357 In Spark 3.4, the 
Pandas on Spark API `astype('category')` will also refresh `categories.dtype` 
according to original data `dtype` to follow pandas 1.4 behavior.
    
    * SPARK-38947 https://github.com/apache/spark/pull/36464 In Spark 3.4, the 
Pandas on Spark API supports groupby positional indexing in `GroupBy.head` and 
`GroupBy.tail` to follow pandas 1.4. Negative arguments now work correctly and 
result in ranges relative to the end and start of each group, Previously, 
negative arguments returned empty frames.
    
    * SPARK-39317 https://github.com/apache/spark/pull/36699 In Spark 3.4, the 
infer schema process of `groupby.apply` in Pandas on Spark, will first infer 
the pandas type to ensure the accuracy of the pandas `dtype` as much as 
possible.
    
    * SPARK-39314 https://github.com/apache/spark/pull/36711 In Spark 3.4, the 
`Series.concat` sort parameter will be respected to follow pandas 1.4 behaviors.
    
    For other test only fixes, I don't add migration doc: SPARK-38821 
SPARK-39053 SPARK-38982
    
    ### Why are the changes needed?
    
    ### Does this PR introduce _any_ user-facing change?
    
    ### How was this patch tested?
    
    Closes #36816 from Yikun/SPARK-39425.
    
    Authored-by: Yikun Jiang <yikunk...@gmail.com>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 .../docs/source/migration_guide/pyspark_3.3_to_3.4.rst | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst 
b/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst
index 9f8cf545e28..dbe7b818b2a 100644
--- a/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst
+++ b/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst
@@ -20,4 +20,20 @@
 Upgrading from PySpark 3.3 to 3.4
 =================================
 
-* In Spark 3.4, the schema of an array column is inferred by merging the 
schemas of all elements in the array. To restore the previous behavior where 
the schema is only inferred from the first element, you can set 
``spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled`` to ``true``.
\ No newline at end of file
+* In Spark 3.4, the schema of an array column is inferred by merging the 
schemas of all elements in the array. To restore the previous behavior where 
the schema is only inferred from the first element, you can set 
``spark.sql.pyspark.legacy.inferArrayTypeFromFirstElement.enabled`` to ``true``.
+
+* In Spark 3.4, if Pandas on Spark API ``Groupby.apply``'s ``func`` parameter 
return type is not specified and ``compute.shortcut_limit`` is set to 0, the 
sampling rows will be set to 2 (ensure sampling rows always >= 2) to make sure 
infer schema is accurate.
+
+* In Spark 3.4, if Pandas on Spark API ``Index.insert`` is out of bounds, will 
raise IndexError with ``index {} is out of bounds for axis 0 with size {}`` to 
follow pandas 1.4 behavior.
+
+* In Spark 3.4, the series name will be preserved in Pandas on Spark API 
``Series.mode`` to follow pandas 1.4 behavior.
+
+* In Spark 3.4, the Pandas on Spark API ``Index.__setitem__`` will first to 
check ``value`` type is ``Column`` type to avoid raising unexpected 
``ValueError`` in ``is_list_like`` like `Cannot convert column into bool: 
please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame 
boolean expressions.`.
+
+* In Spark 3.4, the Pandas on Spark API ``astype('category')`` will also 
refresh ``categories.dtype`` according to original data ``dtype`` to follow 
pandas 1.4 behavior.
+
+* In Spark 3.4, the Pandas on Spark API supports groupby positional indexing 
in ``GroupBy.head`` and ``GroupBy.tail`` to follow pandas 1.4. Negative 
arguments now work correctly and result in ranges relative to the end and start 
of each group, Previously, negative arguments returned empty frames.
+
+* In Spark 3.4, the infer schema process of ``groupby.apply`` in Pandas on 
Spark, will first infer the pandas type to ensure the accuracy of the pandas 
``dtype`` as much as possible.
+
+* In Spark 3.4, the ``Series.concat`` sort parameter will be respected to 
follow pandas 1.4 behaviors.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39425][PYTHON][PS] Add migration guide for pandas 1.4 behavior changes

Reply via email to