[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20900 @icexelloss as a daily user of `pandas_udf`, the inability to use keyword arguments, and the difficulties around default arguments (due in part to the magic that converts string arguments to `pd.series`, which doesn't apply to default args) , are much more annoying to me than the lack of support for partials and callables, which are more peripheral issues. (take as just one data point, certainly, others may have differing opinions.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20900 Created https://issues.apache.org/jira/browse/SPARK-23800 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20900 @HyukjinKwon Thanks for the explanation. I will create Jira for partial functions and callable objects in Pandas UDF. I am happy to take a look at it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20900 The issue itself here (SPARK-23645) describes kwargs arguments support in both UDF and Pandas UDF on calling side. Seems not working but the fix looks going to be quite invasive and big. So, I suggested to fix the documentation for now. Maybe, we should revisit in the future. Let's monitor mailing list and JIRAs. https://github.com/apache/spark/pull/20900#issuecomment-376356469 with https://github.com/apache/spark/pull/20900#issuecomment-376357750 is a separate issue about partial functions and callable objects in Pandas UDF, I found during review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20900 to be clear, I think both functions below ```python class F(object): def __call__(...): ... func = F() ``` ```python def naive_func(a, b): ... func = partial(naive_func, a=1) ``` should work woth Pandas UDF but seems not working given my test https://github.com/apache/spark/pull/20900#issuecomment-375949480 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20900 @icexelloss, yup ^ is correct. IIRC, we have some tests for normal udfs with callable objects and partial functions separately but seems the problem is in Pandas UDF. I think the fix itself will relativrly minimal (just from my wild guess). would you be inretested in doing this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20900 Partials (and callable objects) are supported in UDF but not `pandas_udf`; kw args are not supported by either. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20900 Thank you @mstewart141 for looking into this. @HyukjinKwon should we open Jira for supporting kw args and partial functions in python UDFs? If I understand correctly, this is related to both regular python UDFs and pandas UDFs, is that right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20900 Merged to master and branch-2.3 anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20900 I think we should generally make everything works in both Python 2 and Python 3 but I want to know if there are special chases that I am missing too if there are any. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20900 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20900 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88573/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20900 **[Test build #88573 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88573/testReport)** for PR 20900 at commit [`a3da39c`](https://github.com/apache/spark/commit/a3da39ca62f69fd4e3a4c417ed28613edd15924f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20900 > One general question: how do we tend to think about the py2/3 split for api quirks/features? Must everything that is added for py3 also be functional in py2? ideally, is there something you have in mind that would not work in py2? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20900 Many (though not all, I don't think `callable`s are impacted) of the limitations of pandas_udf relative to UDF in this domain are due to the fact that `pandas_udf` doesn't allow for keyword arguments at the call site. This obviously impacts plain old function-based `pandas_udf`s but also partial fns, where one would typically need to specify the argument (that one was partially applying) by name. In the incremental commits of this PR as at: https://github.com/apache/spark/pull/20900/commits/9ea2595f0cecb0cd05e0e6b99baf538679332e8b You can see the kind of things I was investigating to try and fix that case. Indeed my original PR was (ambitiously) titled something about enabling kw args for all pandas_udfs. This is actually very easy to do for *functions* on python3 (and maybe plain functions in py2 also, but I suspect that this is also rather tricky as `getargspec` is pretty unhelpful when it comes to some of the kw-arg metadata one would need)). However, it is rather harder for the partial function case as one quickly gets into stacktraces from places like `python/pyspark/worker.py` where the semantics of the current strategy do not realize that a column from the arguments list may already be "accounted for" and one runs into duplicate arguments passed for `a`, e.g., as a result of this. My summary is that the change to allow kw for functions is simple (at least in py3 -- indeed my incremental commit referenced above does this), but for partial fns maybe not so much. I'm pretty confident I'm most of the way to accomplishing the former, but not that latter. However, I have no substantial knowledge of the pyspark codebase so you will likely have better luck there, should you go down that route :) **TL;DR**: I could work on a PR to allow keyword arguments for python3 functions (not partials, and not py2), but that is likely too narrow a goal given the broader context. One general question: how do we tend to think about the py2/3 split for api quirks/features? Must everything that is added for py3 also be functional in py2? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20900 **[Test build #88573 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88573/testReport)** for PR 20900 at commit [`a3da39c`](https://github.com/apache/spark/commit/a3da39ca62f69fd4e3a4c417ed28613edd15924f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20900 LGTM except https://github.com/apache/spark/pull/20900#discussion_r176930776 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20900 From a very quick look for the case "Try to be sneaky and don't use keywords with partial:". Seems it's due to type mismatch. This seems working fine (in Python 3): ``` >>> spark.range(1).withColumn('ok', pandas_udf(f=partial(test_func, 2), returnType='bigint')('id')).show() +---+---+ | id| ok| +---+---+ | 0| 2| +---+---+ ``` I think we can take this example out in the description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20900 @mstewart141, just to be clear, the error: ``` ValueError: Function has keyword-only parameters or annotations, use getfullargspec() API which can support them ``` is from deprecated `getargspec` instead of `getfullargspec` that's fixed by you. Current error seems like this: ``` Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/sql/functions.py", line 2380, in pandas_udf return _create_udf(f=f, returnType=return_type, evalType=eval_type) File "/.../spark/python/pyspark/sql/udf.py", line 51, in _create_udf argspec = _get_argspec(f) File "/.../spark/python/pyspark/util.py", line 60, in _get_argspec argspec = inspect.getargspec(f) File "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/inspect.py", line 818, in getargspec raise TypeError('{!r} is not a Python function'.format(func)) TypeError: is not a Python function ``` with the reproducer below: ```python from functools import partial from pyspark.sql.functions import pandas_udf def test_func(a, b): return a + b pandas_udf(partial(test_func, b='id'), "string") ``` I think this should work like a normal udf ```python from functools import partial from pyspark.sql.functions import udf def test_func(a, b): return a + b normal_udf = udf(partial(test_func, b='id'), "string") df = spark.createDataFrame([["a"]]) df.select(normal_udf("_1")).show() ``` So, I think we should add the support for callable objects / partial functions in Pandas UDFs. Would you be interested in filling JIRA(s) and proceeding? If you are busy, I am willing to do it as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20900 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88566/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20900 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20900 **[Test build #88566 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88566/testReport)** for PR 20900 at commit [`bc49c3c`](https://github.com/apache/spark/commit/bc49c3cc5ae2e23da5cc7b6d7e1a779e9d012c8c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20900 **[Test build #88566 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88566/testReport)** for PR 20900 at commit [`bc49c3c`](https://github.com/apache/spark/commit/bc49c3cc5ae2e23da5cc7b6d7e1a779e9d012c8c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20900 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user mstewart141 commented on the issue: https://github.com/apache/spark/pull/20900 @HyukjinKwon the old pr: https://github.com/apache/spark/pull/20798 was a disaster from a git-cleanliness perspective so i've updated here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20900 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20900 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org