[jira] [Commented] (SPARK-37465) PySpark tests failing on Pandas 0.23

Willi Raschkowski (Jira) Fri, 26 Nov 2021 06:14:05 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17449584#comment-17449584
 ]


Willi Raschkowski commented on SPARK-37465:
-------------------------------------------

I also noticed another that {{CategoricalOpsTest}} fails on pandas 0.25.3 
(latest 0.x) and works on 1.x:
{code:java}
$ conda list | grep pandas
pandas                    0.25.3           py36he6710b0_0
$ python/run-tests --testnames 
'pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest'
...
Running tests...
----------------------------------------------------------------------
/home/circleci/project/python/pyspark/context.py:238: FutureWarning: Python 3.6 
support is deprecated in Spark 3.2.
  FutureWarning
  test_abs 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (2.353s)
  test_add 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (1.382s)
  test_and 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.265s)
ok (6.569s)                                                                     
alOpsTest) ... 
  test_eq 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (1.514s)
  test_floordiv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.910s)
  test_from_to_pandas 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.143s)
  test_ge 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.795s)
  test_gt 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.891s)
  test_invert 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.044s)
  test_isnull 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.097s)
  test_le 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.863s)
  test_lt 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (0.844s)
  test_mod 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.897s)
  test_mul 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.860s)
  test_ne 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... FAIL (1.405s)
  test_neg 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.044s)
  test_or 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.160s)
  test_pow 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.821s)
  test_radd 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.081s)
  test_rand 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.100s)
  test_rfloordiv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.083s)
  test_rmod 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.050s)
  test_rmul 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.079s)
  test_ror 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.095s)
  test_rpow 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.078s)
  test_rsub 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.078s)
  test_rtruediv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.079s)
  test_sub 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.818s)
  test_truediv 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest) 
... ok (0.832s)

======================================================================
FAIL [1.611s]: test_eq 
(pyspark.pandas.tests.data_type_ops.test_categorical_ops.CategoricalOpsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 
122, in assertPandasEqual
    **kwargs
  File 
"/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py",
 line 1248, in assert_series_equal
    assert_attr_equal('name', left, right, obj=obj)
  File 
"/home/circleci/.pyenv/versions/our-miniconda/envs/python3/lib/python3.6/site-packages/pandas/util/testing.py",
 line 941, in assert_attr_equal
    raise_assert_detail(obj, msg, left_attr, right_attr)
AssertionError: Series are different

Attribute "name" are different
[left]:  that_numeric_cat
[right]: None

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File 
"/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py",
 line 268, in test_eq
    psdf["this_numeric_cat"] == psdf["that_numeric_cat"],
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 
223, in assert_eq
    self.assertPandasEqual(lobj, robj, check_exact=check_exact)
  File "/home/circleci/project/python/pyspark/testing/pandasutils.py", line 
130, in assertPandasEqual
    raise AssertionError(msg) from e
AssertionError: Series are different

Attribute "name" are different
[left]:  that_numeric_cat
[right]: None

Left:
Name: that_numeric_cat, dtype: bool
bool

Right:
dtype: bool
bool

...
{code}

Upgrading pandas to 1.x fixes it:
{code}
$ conda list | grep pandas
pandas                    1.0.0            py36h0573a6f_0  
$ python/run-tests --testnames 
'pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest'
Running PySpark tests. Output is in /home/circleci/project/python/unit-tests.log
Will test against the following Python executables: ['python3.6']
Will test the following Python tests: 
['pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest']
python3.6 python_implementation is CPython
python3.6 version is: Python 3.6.12 :: Anaconda, Inc.
Starting test(python3.6): 
pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest
Finished test(python3.6): 
pyspark.pandas.tests.data_type_ops.test_categorical_ops CategoricalOpsTest (34s)
Tests passed in 34 seconds
{code}

> PySpark tests failing on Pandas 0.23
> ------------------------------------
>
>                 Key: SPARK-37465
>                 URL: https://issues.apache.org/jira/browse/SPARK-37465
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Willi Raschkowski
>            Priority: Major
>
> I was running Spark tests with Pandas {{0.23.4}} and got the error below. The 
> minimum Pandas version is currently {{0.23.2}} 
> [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. 
> Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix 
> (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222]
>  in Pandas.
> {code:java}
> $ python/run-tests --testnames 
> 'pyspark.pandas.tests.data_type_ops.test_boolean_ops 
> BooleanOpsTest.test_floordiv'
> ...
> ======================================================================
> ERROR [5.785s]: test_floordiv 
> (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File 
> "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py",
>  line 128, in test_floordiv
>     self.assert_eq(b_pser // b_pser.astype(int), b_psser // 
> b_psser.astype(int))
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1069, in wrapper
>     result = safe_na_op(lvalues, rvalues)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1033, in safe_na_op
>     return na_op(lvalues, rvalues)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py",
>  line 1027, in na_op
>     result = missing.fill_zeros(result, x, y, op_name, fill_zeros)
>   File 
> "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py",
>  line 641, in fill_zeros
>     signs = np.sign(y if name.startswith(('r', '__r')) else x)
> TypeError: ufunc 'sign' did not contain a loop with signature matching types 
> dtype('bool') dtype('bool')
> {code}
> These are my relevant package versions:
> {code:java}
> $ conda list | grep -e numpy -e pyarrow -e pandas -e python
> # packages in environment at /home/circleci/miniconda/envs/python3:
> numpy                     1.16.6           py36h0a8e133_3  
> numpy-base                1.16.6           py36h41b4c56_3  
> pandas                    0.23.4           py36h04863e7_0  
> pyarrow                   1.0.1           py36h6200943_36_cpu    conda-forge
> python                    3.6.12               hcff3b4d_2    anaconda
> python-dateutil           2.8.1                      py_0    anaconda
> python_abi                3.6                     1_cp36m    conda-forg
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37465) PySpark tests failing on Pandas 0.23

Reply via email to