[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19787 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19787#discussion_r152198789 --- Diff: python/pyspark/sql/functions.py --- @@ -2198,12 +2198,9 @@ def udf(f=None, returnType=StringType()): duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. -.. note:: The user-defined functions do not support conditional execution by using them with -SQL conditional expressions such as `when` or `if`. The functions still apply on all rows no -matter the conditions are met or not. So the output is correct if the functions can be -correctly run on all rows without failure. If the functions can cause runtime failure on the -rows that do not satisfy the conditions, the suggested workaround is to incorporate the -condition logic into the functions. +.. note:: The user-defined functions do not support conditional expressions or short curcuiting +in boolean expressions and it ends up with being executed all internally. If the functions +can fail on special rows, the workaround is to incorporate the condition into the functions. --- End diff -- Maybe it is also worth adding a note to pandas_udf. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19787#discussion_r152198691 --- Diff: python/pyspark/sql/functions.py --- @@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()): rows that do not satisfy the conditions, the suggested workaround is to incorporate the condition logic into the functions. +.. note:: Users can't rely on short-curcuit evaluation of boolean expressions to execute +conditionally user-defined functions too. For example, the two functions in an expression +like udf1(x) && udf2(y) will be both executed on all rows. --- End diff -- Sorry this is not correct. Pandas_udf can use in boolean expressions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19787#discussion_r152198181 --- Diff: python/pyspark/sql/functions.py --- @@ -2198,12 +2198,9 @@ def udf(f=None, returnType=StringType()): duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. -.. note:: The user-defined functions do not support conditional execution by using them with -SQL conditional expressions such as `when` or `if`. The functions still apply on all rows no -matter the conditions are met or not. So the output is correct if the functions can be -correctly run on all rows without failure. If the functions can cause runtime failure on the -rows that do not satisfy the conditions, the suggested workaround is to incorporate the -condition logic into the functions. +.. note:: The user-defined functions do not support conditional expressions or short curcuiting +in boolean expressions and it ends up with being executed all internally. If the functions +can fail on special rows, the workaround is to incorporate the condition into the functions. --- End diff -- IMHO, it is more unlikely to think pandas_udf executes conditionally on rows, because it applies on pd.Series. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19787#discussion_r152195436 --- Diff: python/pyspark/sql/functions.py --- @@ -2198,12 +2198,9 @@ def udf(f=None, returnType=StringType()): duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. -.. note:: The user-defined functions do not support conditional execution by using them with -SQL conditional expressions such as `when` or `if`. The functions still apply on all rows no -matter the conditions are met or not. So the output is correct if the functions can be -correctly run on all rows without failure. If the functions can cause runtime failure on the -rows that do not satisfy the conditions, the suggested workaround is to incorporate the -condition logic into the functions. +.. note:: The user-defined functions do not support conditional expressions or short curcuiting +in boolean expressions and it ends up with being executed all internally. If the functions +can fail on special rows, the workaround is to incorporate the condition into the functions. --- End diff -- Hm .. actually doesn't the same thing apply to `pandas_udf` too? I was just double checking: ```python from pyspark.sql.functions import pandas_udf def call1(b): print "I am call1" return b def call2(b): print "I am call2" return b bool1 = pandas_udf(call1, "boolean") bool2 = pandas_udf(call2, "boolean") spark.createDataFrame([[True]]).select(bool1("_1") | bool2("_1")).explain(True) spark.createDataFrame([[True]]).select(bool1("_1") | bool2("_1")).show() ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19787#discussion_r152170467 --- Diff: python/pyspark/sql/functions.py --- @@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()): rows that do not satisfy the conditions, the suggested workaround is to incorporate the condition logic into the functions. +.. note:: Users can't rely on short-curcuit evaluation of boolean expressions to execute --- End diff -- Ok. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19787#discussion_r152014337 --- Diff: python/pyspark/sql/functions.py --- @@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()): rows that do not satisfy the conditions, the suggested workaround is to incorporate the condition logic into the functions. +.. note:: Users can't rely on short-curcuit evaluation of boolean expressions to execute --- End diff -- Just a little bit worried of overwhelming users with maybe too much information although it might be worth. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19787#discussion_r152013410 --- Diff: python/pyspark/sql/functions.py --- @@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()): rows that do not satisfy the conditions, the suggested workaround is to incorporate the condition logic into the functions. +.. note:: Users can't rely on short-curcuit evaluation of boolean expressions to execute --- End diff -- Looks okay but how about combining this comment with the one above and making it shorter if possible? Like .. udfs don't support conditional expression or short curcuiting and it ends up with being executed all internally. If it depends on ..., workaround blabla and blabla. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19787#discussion_r151928352 --- Diff: python/pyspark/sql/functions.py --- @@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()): rows that do not satisfy the conditions, the suggested workaround is to incorporate the condition logic into the functions. +.. note:: Users can't rely on short-curcuit evaluation of boolean expressions to execute +conditionally user-defined functions too. For example, the two functions in an expression +like udf1(x) && udf2(y) will be both executed on all rows. --- End diff -- I think pandas_udf doesn't use in boolean expressions as it returns pandas.Series. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19787#discussion_r151927818 --- Diff: python/pyspark/sql/functions.py --- @@ -2205,6 +2205,10 @@ def udf(f=None, returnType=StringType()): rows that do not satisfy the conditions, the suggested workaround is to incorporate the condition logic into the functions. +.. note:: Users can't rely on short-curcuit evaluation of boolean expressions to execute +conditionally user-defined functions too. For example, the two functions in an expression +like udf1(x) && udf2(y) will be both executed on all rows. --- End diff -- does it apply to pandas_udf? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19787: [SPARK-22541][SQL] Explicitly claim that Python u...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/19787 [SPARK-22541][SQL] Explicitly claim that Python udfs can't be conditional executed with short-curcuit evaluation ## What changes were proposed in this pull request? Besides conditional expressions such as `when` and `if`, users may want to conditionally execute python udfs by short-curcuit evaluation. We should also explicitly note that python udfs don't support this kind of conditional execution too. ## How was this patch tested? N/A, just document change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-22541 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19787.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19787 commit 3b69777924d0ac54bc4b6ec9c740cb20774bf033 Author: Liang-Chi Hsieh Date: 2017-11-20T07:13:32Z Add document for udf. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org