GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/20534
[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow
versions in PySpark tests (to skip or test)
This PR backports https://github.com/apache/spark/pull/20487 to branch-2.3.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark
PR_TOOL_PICK_PR_20487_BRANCH-2.3
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20534.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20534
commit ff9ba5eb840bcd843c5201e23589e8cbb5009c53
Author: hyukjinkwon
Date: 2018-02-07T14:28:10Z
[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in
PySpark tests (to skip or test)
This PR proposes to explicitly specify Pandas and PyArrow versions in
PySpark tests to skip or test.
We declared the extra dependencies:
https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204
In case of PyArrow:
Currently we only check if pyarrow is installed or not without checking the
version. It already fails to run tests. For example, if PyArrow 0.7.0 is
installed:
```
==
ERROR: test_vectorized_udf_wrong_return_type
(pyspark.sql.tests.ScalarPandasUDF)
--
Traceback (most recent call last):
File "/.../spark/python/pyspark/sql/tests.py", line 4019, in
test_vectorized_udf_wrong_return_type
f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
File "/.../spark/python/pyspark/sql/functions.py", line 2309, in
pandas_udf
return _create_udf(f=f, returnType=return_type, evalType=eval_type)
File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
require_minimum_pyarrow_version()
File "/.../spark/python/pyspark/sql/utils.py", line 132, in
require_minimum_pyarrow_version
"however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process;
however, your version was 0.7.0.
--
Ran 33 tests in 8.098s
FAILED (errors=33)
```
In case of Pandas:
There are few tests for old Pandas which were tested only when Pandas
version was lower, and I rewrote them to be tested when both Pandas version is
lower and missing.
Manually tested by modifying the condition:
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests)
... skipped 'Pandas >= 1.19.2 must be installed; however, your version was
0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests)
... skipped 'Pandas >= 1.19.2 must be installed; however, your version was
0.19.2.'
test_createDataFrame_respect_session_timezone
(pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed;
however, your version was 0.19.2.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests)
... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests)
... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone
(pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed;
however, it was not found.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests)
... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was
0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests)
... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was
0.8.0.'
test_createDataFrame_respect_session_timezone
(pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed;
however, your version was 0.8.0.'
```
```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests)
... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests)
... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone
(pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed;
however, it was not found.'
```
Author: hyukjinkwon
Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.
(cherry picked from commit 71cfba04aeec5ae9b85a507b13996e80