[GitHub] spark pull request #20534: [SPARK-23319][TESTS][BRANCH-2.3] Explicitly speci...

2018-02-07 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/20534


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20534: [SPARK-23319][TESTS][BRANCH-2.3] Explicitly speci...

2018-02-07 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/20534

[SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow 
versions in PySpark tests (to skip or test)

This PR backports https://github.com/apache/spark/pull/20487 to branch-2.3.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark 
PR_TOOL_PICK_PR_20487_BRANCH-2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20534.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20534


commit ff9ba5eb840bcd843c5201e23589e8cbb5009c53
Author: hyukjinkwon 
Date:   2018-02-07T14:28:10Z

[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in 
PySpark tests (to skip or test)

This PR proposes to explicitly specify Pandas and PyArrow versions in 
PySpark tests to skip or test.

We declared the extra dependencies:


https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204

In case of PyArrow:

Currently we only check if pyarrow is installed or not without checking the 
version. It already fails to run tests. For example, if PyArrow 0.7.0 is 
installed:

```
==
ERROR: test_vectorized_udf_wrong_return_type 
(pyspark.sql.tests.ScalarPandasUDF)
--
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests.py", line 4019, in 
test_vectorized_udf_wrong_return_type
f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
  File "/.../spark/python/pyspark/sql/functions.py", line 2309, in 
pandas_udf
return _create_udf(f=f, returnType=return_type, evalType=eval_type)
  File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
require_minimum_pyarrow_version()
  File "/.../spark/python/pyspark/sql/utils.py", line 132, in 
require_minimum_pyarrow_version
"however, your version was %s." % pyarrow.__version__)
ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; 
however, your version was 0.7.0.

--
Ran 33 tests in 8.098s

FAILED (errors=33)
```

In case of Pandas:

There are few tests for old Pandas which were tested only when Pandas 
version was lower, and I rewrote them to be tested when both Pandas version is 
lower and missing.

Manually tested by modifying the condition:

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 
0.19.2.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 
0.19.2.'
test_createDataFrame_respect_session_timezone 
(pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; 
however, your version was 0.19.2.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone 
(pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; 
however, it was not found.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 
0.8.0.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 
0.8.0.'
test_createDataFrame_respect_session_timezone 
(pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; 
however, your version was 0.8.0.'
```

```
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_respect_session_timezone 
(pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; 
however, it was not found.'
```

Author: hyukjinkwon 

Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.

(cherry picked from commit 71cfba04aeec5ae9b85a507b13996e80