[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19787


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-20 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19787#discussion_r152198789
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2198,12 +2198,9 @@ def udf(f=None, returnType=StringType()):
 duplicate invocations may be eliminated or the function may even 
be invoked more times than
 it is present in the query.
 
-.. note:: The user-defined functions do not support conditional 
execution by using them with
-SQL conditional expressions such as `when` or `if`. The functions 
still apply on all rows no
-matter the conditions are met or not. So the output is correct if 
the functions can be
-correctly run on all rows without failure. If the functions can 
cause runtime failure on the
-rows that do not satisfy the conditions, the suggested workaround 
is to incorporate the
-condition logic into the functions.
+.. note:: The user-defined functions do not support conditional 
expressions or short curcuiting
+in boolean expressions and it ends up with being executed all 
internally. If the functions
+can fail on special rows, the workaround is to incorporate the 
condition into the functions.
--- End diff --

Maybe it is also worth adding a note to pandas_udf.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-20 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19787#discussion_r152198691
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()):
 rows that do not satisfy the conditions, the suggested workaround 
is to incorporate the
 condition logic into the functions.
 
+.. note:: Users can't rely on short-curcuit evaluation of boolean 
expressions to execute
+conditionally user-defined functions too. For example, the two 
functions in an expression
+like udf1(x) && udf2(y) will be both executed on all rows.
--- End diff --

Sorry this is not correct. Pandas_udf can use in boolean expressions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-20 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19787#discussion_r152198181
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2198,12 +2198,9 @@ def udf(f=None, returnType=StringType()):
 duplicate invocations may be eliminated or the function may even 
be invoked more times than
 it is present in the query.
 
-.. note:: The user-defined functions do not support conditional 
execution by using them with
-SQL conditional expressions such as `when` or `if`. The functions 
still apply on all rows no
-matter the conditions are met or not. So the output is correct if 
the functions can be
-correctly run on all rows without failure. If the functions can 
cause runtime failure on the
-rows that do not satisfy the conditions, the suggested workaround 
is to incorporate the
-condition logic into the functions.
+.. note:: The user-defined functions do not support conditional 
expressions or short curcuiting
+in boolean expressions and it ends up with being executed all 
internally. If the functions
+can fail on special rows, the workaround is to incorporate the 
condition into the functions.
--- End diff --

IMHO, it is more unlikely to think pandas_udf executes conditionally on 
rows, because it applies on pd.Series.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19787#discussion_r152195436
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2198,12 +2198,9 @@ def udf(f=None, returnType=StringType()):
 duplicate invocations may be eliminated or the function may even 
be invoked more times than
 it is present in the query.
 
-.. note:: The user-defined functions do not support conditional 
execution by using them with
-SQL conditional expressions such as `when` or `if`. The functions 
still apply on all rows no
-matter the conditions are met or not. So the output is correct if 
the functions can be
-correctly run on all rows without failure. If the functions can 
cause runtime failure on the
-rows that do not satisfy the conditions, the suggested workaround 
is to incorporate the
-condition logic into the functions.
+.. note:: The user-defined functions do not support conditional 
expressions or short curcuiting
+in boolean expressions and it ends up with being executed all 
internally. If the functions
+can fail on special rows, the workaround is to incorporate the 
condition into the functions.
--- End diff --

Hm .. actually doesn't the same thing apply to `pandas_udf` too? I was just 
double checking:

```python
from pyspark.sql.functions import pandas_udf

def call1(b):
print "I am call1"
return b

def call2(b):
print "I am call2"
return b

bool1 = pandas_udf(call1, "boolean")
bool2 = pandas_udf(call2, "boolean")
spark.createDataFrame([[True]]).select(bool1("_1") | 
bool2("_1")).explain(True)
spark.createDataFrame([[True]]).select(bool1("_1") | bool2("_1")).show()
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-20 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19787#discussion_r152170467
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()):
 rows that do not satisfy the conditions, the suggested workaround 
is to incorporate the
 condition logic into the functions.
 
+.. note:: Users can't rely on short-curcuit evaluation of boolean 
expressions to execute
--- End diff --

Ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19787#discussion_r152014337
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()):
 rows that do not satisfy the conditions, the suggested workaround 
is to incorporate the
 condition logic into the functions.
 
+.. note:: Users can't rely on short-curcuit evaluation of boolean 
expressions to execute
--- End diff --

Just a little bit worried of overwhelming users with maybe too much 
information although it might be worth.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19787#discussion_r152013410
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()):
 rows that do not satisfy the conditions, the suggested workaround 
is to incorporate the
 condition logic into the functions.
 
+.. note:: Users can't rely on short-curcuit evaluation of boolean 
expressions to execute
--- End diff --

Looks okay but how about combining this comment with the one above and 
making it shorter if possible? Like .. udfs don't support conditional 
expression or short curcuiting and it ends up with being executed all 
internally. If it depends on ..., workaround blabla and blabla.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-20 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19787#discussion_r151928352
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()):
 rows that do not satisfy the conditions, the suggested workaround 
is to incorporate the
 condition logic into the functions.
 
+.. note:: Users can't rely on short-curcuit evaluation of boolean 
expressions to execute
+conditionally user-defined functions too. For example, the two 
functions in an expression
+like udf1(x) && udf2(y) will be both executed on all rows.
--- End diff --

I think pandas_udf doesn't use in boolean expressions as it returns 
pandas.Series.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19787#discussion_r151927818
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()):
 rows that do not satisfy the conditions, the suggested workaround 
is to incorporate the
 condition logic into the functions.
 
+.. note:: Users can't rely on short-curcuit evaluation of boolean 
expressions to execute
+conditionally user-defined functions too. For example, the two 
functions in an expression
+like udf1(x) && udf2(y) will be both executed on all rows.
--- End diff --

does it apply to pandas_udf?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...

2017-11-19 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/19787

[SPARK-22541][SQL] Explicitly claim that Python udfs can't be conditional 
executed with short-curcuit evaluation

## What changes were proposed in this pull request?

Besides conditional expressions such as `when` and `if`, users may want to 
conditionally execute python udfs by short-curcuit evaluation. We should also 
explicitly note that python udfs don't support this kind of conditional 
execution too.

## How was this patch tested?

N/A, just document change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 SPARK-22541

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19787.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19787


commit 3b69777924d0ac54bc4b6ec9c740cb20774bf033
Author: Liang-Chi Hsieh 
Date:   2017-11-20T07:13:32Z

Add document for udf.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org