[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/22620#discussion_r222698014 --- Diff: python/pyspark/sql/udf.py --- @@ -310,9 +319,11 @@ def register(self, name, f, returnType=None): "Invalid returnType: data type can not be specified when f is" "a user-defined function, but got %s." % returnType) if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF, - PythonEvalType.SQL_SCALAR_PANDAS_UDF]: + PythonEvalType.SQL_SCALAR_PANDAS_UDF, + PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF]: --- End diff -- I opened https://issues.apache.org/jira/browse/SPARK-25640 to track this. To be clear, this is transparent to end users, but I agree it can be confusing to developers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22620 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22620#discussion_r222487117 --- Diff: python/pyspark/sql/udf.py --- @@ -310,9 +319,11 @@ def register(self, name, f, returnType=None): "Invalid returnType: data type can not be specified when f is" "a user-defined function, but got %s." % returnType) if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF, - PythonEvalType.SQL_SCALAR_PANDAS_UDF]: + PythonEvalType.SQL_SCALAR_PANDAS_UDF, + PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF]: --- End diff -- These need to be clearly defined in Apache Spark 3.0 release; otherwise, it might be confusing to both developers and end users. :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/22620#discussion_r222456993 --- Diff: python/pyspark/sql/udf.py --- @@ -310,9 +319,11 @@ def register(self, name, f, returnType=None): "Invalid returnType: data type can not be specified when f is" "a user-defined function, but got %s." % returnType) if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF, - PythonEvalType.SQL_SCALAR_PANDAS_UDF]: + PythonEvalType.SQL_SCALAR_PANDAS_UDF, + PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF]: --- End diff -- We don't need it here: Users specify GROUPED_AGG only. GROUPED_AGG is turned to WINDOW_AGG eval type in WindowInPandasExec. Admittedly, there is a bit confusion here we can improve. We just haven't got a user specified udf type that maps to multiple evalType before WINDOW_AGG. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22620#discussion_r222452544 --- Diff: python/pyspark/sql/udf.py --- @@ -310,9 +319,11 @@ def register(self, name, f, returnType=None): "Invalid returnType: data type can not be specified when f is" "a user-defined function, but got %s." % returnType) if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF, - PythonEvalType.SQL_SCALAR_PANDAS_UDF]: + PythonEvalType.SQL_SCALAR_PANDAS_UDF, + PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF]: --- End diff -- how about SQL_WINDOW_AGG_PANDAS_UDF? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/22620#discussion_r222421940 --- Diff: python/pyspark/sql/udf.py --- @@ -298,6 +298,15 @@ def register(self, name, f, returnType=None): >>> spark.sql("SELECT add_one(id) FROM range(3)").collect() # doctest: +SKIP [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] +>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG) # doctest: +SKIP +... def sum_udf(v): +... return v.sum() +... +>>> _ = spark.udf.register("sum_udf", sum_udf) # doctest: +SKIP --- End diff -- Ha. I see.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22620#discussion_r222417513 --- Diff: python/pyspark/sql/udf.py --- @@ -298,6 +298,15 @@ def register(self, name, f, returnType=None): >>> spark.sql("SELECT add_one(id) FROM range(3)").collect() # doctest: +SKIP [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] +>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG) # doctest: +SKIP +... def sum_udf(v): +... return v.sum() +... +>>> _ = spark.udf.register("sum_udf", sum_udf) # doctest: +SKIP --- End diff -- Hide the output like ... ``` >>> spark.udf.register("sum_udf", sum_udf) ``` in the doctest. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/22620#discussion_r222411585 --- Diff: python/pyspark/sql/udf.py --- @@ -298,6 +298,15 @@ def register(self, name, f, returnType=None): >>> spark.sql("SELECT add_one(id) FROM range(3)").collect() # doctest: +SKIP [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] +>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG) # doctest: +SKIP +... def sum_udf(v): +... return v.sum() +... +>>> _ = spark.udf.register("sum_udf", sum_udf) # doctest: +SKIP --- End diff -- what is the "_ =" thing here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22620 [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement ## What changes were proposed in this pull request? This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf("integer", PandasUDFType.GROUPED_AGG) # doctest: +SKIP def sum_udf(v): return v.sum() spark.udf.register("sum_udf", sum_udf) # doctest: +SKIP q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" spark.sql(q).show() +---+ |sum_udf(v1)| +---+ | 1| | 5| +---+ ``` ## How was this patch tested? Manual test and unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-25601 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22620.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22620 commit 06a7bd0c7daed0f3af5b42c1ea8a9b4b5e2e6216 Author: hyukjinkwon Date: 2018-10-03T11:13:32Z Register Grouped aggregate UDF Vectorized UDFs for SQL Statement --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org