Repository: spark
Updated Branches:
  refs/heads/master 70f9d7f71 -> 224e0e785


[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against 
column

## What changes were proposed in this pull request?

This PR proposes to remove incorrect implementation that has been not executed 
so far (at least from Spark 1.5.2) for `in` operator and throw a correct 
exception rather than saying it is a bool. I tested the codes above in 1.5.2, 
1.6.3, 2.1.0 and in the master branch as below:

**1.5.2**

```python
>>> df = sqlContext.createDataFrame([[1]])
>>> 1 in df._1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, 
in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 
'or', '~' for 'not' when building DataFrame boolean expressions.
```

**1.6.3**

```python
>>> 1 in sqlContext.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, 
in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 
'or', '~' for 'not' when building DataFrame boolean expressions.
```

**2.1.0**

```python
>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, 
in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 
'or', '~' for 'not' when building DataFrame boolean expressions.
```

**Current Master**

```python
>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 
'or', '~' for 'not' when building DataFrame boolean expressions.
```

**After**

```python
>>> 1 in spark.range(1).id
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__
    raise ValueError("Cannot apply 'in' operator against a column: please use 
'contains' "
ValueError: Cannot apply 'in' operator against a column: please use 'contains' 
in a string column or 'array_contains' function for an array column.
```

In more details,

It seems the implementation intended to support this

```python
1 in df.column
```

However, currently, it throws an exception as below:

```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 
'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 
'or', '~' for 'not' when building DataFrame boolean expressions.
```

What happens here is as below:

```python
class Column(object):
    def __contains__(self, item):
        print "I am contains"
        return Column()
    def __nonzero__(self):
        raise Exception("I am nonzero.")

>>> 1 in Column()
I am contains
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in __nonzero__
Exception: I am nonzero.
```

It seems it calls `__contains__` first and then `__nonzero__` or `__bool__` is 
being called against `Column()` to make this a bool (or int to be specific).

It seems `__nonzero__` (for Python 2), `__bool__` (for Python 3) and 
`__contains__` forcing the the return into a bool unlike other operators. There 
are few references about this as below:

https://bugs.python.org/issue16011
http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777

It seems we can't overwrite `__nonzero__` or `__bool__` as a workaround to make 
this working because these force the return type as a bool as below:

```python
class Column(object):
    def __contains__(self, item):
        print "I am contains"
        return Column()
    def __nonzero__(self):
        return "a"

>>> 1 in Column()
I am contains
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __nonzero__ should return bool or int, returned str
```

## How was this patch tested?

Added unit tests in `tests.py`.

Author: hyukjinkwon <gurwls...@gmail.com>

Closes #17160 from HyukjinKwon/SPARK-19701.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/224e0e78
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/224e0e78
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/224e0e78

Branch: refs/heads/master
Commit: 224e0e785b4b449ea638c2629263c798116a3011
Parents: 70f9d7f
Author: hyukjinkwon <gurwls...@gmail.com>
Authored: Sun Mar 5 18:04:52 2017 -0800
Committer: Wenchen Fan <wenc...@databricks.com>
Committed: Sun Mar 5 18:04:52 2017 -0800

----------------------------------------------------------------------
 python/pyspark/sql/column.py | 4 +++-
 python/pyspark/sql/tests.py  | 3 +++
 2 files changed, 6 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/224e0e78/python/pyspark/sql/column.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py
index c10ab96..ec05c18 100644
--- a/python/pyspark/sql/column.py
+++ b/python/pyspark/sql/column.py
@@ -180,7 +180,9 @@ class Column(object):
     __ror__ = _bin_op("or")
 
     # container operators
-    __contains__ = _bin_op("contains")
+    def __contains__(self, item):
+        raise ValueError("Cannot apply 'in' operator against a column: please 
use 'contains' "
+                         "in a string column or 'array_contains' function for 
an array column.")
 
     # bitwise operators
     bitwiseOR = _bin_op("bitwiseOR")

http://git-wip-us.apache.org/repos/asf/spark/blob/224e0e78/python/pyspark/sql/tests.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py
index e943f8d..81f3d1d 100644
--- a/python/pyspark/sql/tests.py
+++ b/python/pyspark/sql/tests.py
@@ -967,6 +967,9 @@ class SQLTests(ReusedPySparkTestCase):
             cs.startswith('a'), cs.endswith('a')
         self.assertTrue(all(isinstance(c, Column) for c in css))
         self.assertTrue(isinstance(ci.cast(LongType()), Column))
+        self.assertRaisesRegexp(ValueError,
+                                "Cannot apply 'in' operator against a column",
+                                lambda: 1 in cs)
 
     def test_column_getitem(self):
         from pyspark.sql.functions import col


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to