[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-27 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20900
  
@icexelloss as a daily user of `pandas_udf`, the inability to use keyword 
arguments, and the difficulties around default arguments (due in part to the 
magic that converts string arguments to `pd.series`, which doesn't apply to 
default args) , are much more annoying to me than the lack of support for 
partials and callables, which are more peripheral issues. 

(take as just one data point, certainly, others may have differing 
opinions.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-26 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20900
  
Partials (and callable objects) are supported in UDF but not `pandas_udf`; 
kw args are not supported by either.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20900
  
Many (though not all, I don't think `callable`s are impacted) of the 
limitations of pandas_udf relative to UDF in this domain are due to the fact 
that `pandas_udf` doesn't allow for keyword arguments at the call site. This 
obviously impacts plain old function-based `pandas_udf`s but also partial fns, 
where one would typically need to specify the argument (that one was partially 
applying) by name.

In the incremental commits of this PR as at:

https://github.com/apache/spark/pull/20900/commits/9ea2595f0cecb0cd05e0e6b99baf538679332e8b

You can see the kind of things I was investigating to try and fix that 
case. Indeed my original PR was (ambitiously) titled something about enabling 
kw args for all pandas_udfs. This is actually very easy to do for *functions* 
on python3 (and maybe plain functions in py2 also, but I suspect that this is 
also rather tricky as `getargspec` is pretty unhelpful when it comes to some of 
the kw-arg metadata one would need)). However, it is rather harder for the 
partial function case as one quickly gets into stacktraces from places like 
`python/pyspark/worker.py` where the semantics of the current strategy do not 
realize that a column from the arguments list may already be "accounted for" 
and one runs into duplicate arguments passed for `a`, e.g., as a result of 
this. 

My summary is that the change to allow kw for functions is simple (at least 
in py3 -- indeed my incremental commit referenced above does this), but for 
partial fns maybe not so much. I'm pretty confident I'm most of the way to 
accomplishing the former, but not that latter.

However, I have no substantial knowledge of the pyspark codebase so you 
will likely have better luck there, should you go down that route :)

**TL;DR**: I could work on a PR to allow keyword arguments for python3 
functions (not partials, and not py2), but that is likely too narrow a goal 
given the broader context.

One general question: how do we tend to think about the py2/3 split for api 
quirks/features? Must everything that is added for py3 also be functional in 
py2?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20798: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `p...

2018-03-25 Thread mstewart141
Github user mstewart141 closed the pull request at:

https://github.com/apache/spark/pull/20798


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20798: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-25 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20798
  
see https://github.com/apache/spark/pull/20900


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-24 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20900
  
@HyukjinKwon the old pr: https://github.com/apache/spark/pull/20798

was a disaster from a git-cleanliness perspective so i've updated here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `p...

2018-03-24 Thread mstewart141
GitHub user mstewart141 opened a pull request:

https://github.com/apache/spark/pull/20900

 [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_udf` with keyword 
args

## What changes were proposed in this pull request?

Add documentation about the limitations of `pandas_udf` with keyword 
arguments and related concepts, like `functools.partial` fn objects.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mstewart141/spark udfkw2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20900.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20900


commit 048570f7e5f421288b7c297e4d2e3873626a6adc
Author: Michael (Stu) Stewart <mstewart141@...>
Date:   2018-03-11T20:38:29Z

[SPARK-23645][PYTHON] Allow python udfs to be called with keyword arguments

commit 9ea2595f0cecb0cd05e0e6b99baf538679332e8b
Author: Michael (Stu) Stewart <mstewart141@...>
Date:   2018-03-18T18:04:21Z

Incomplete / Show issue with partial fn in pandas_udf

commit acd1cbe53dc7d1bf83b1022a7e36652cd9530b58
Author: Michael (Stu) Stewart <mstewart141@...>
Date:   2018-03-18T18:13:53Z

Add note RE no keyword args in python UDFs

commit bc49c3cc5ae2e23da5cc7b6d7e1a779e9d012c8c
Author: Michael (Stu) Stewart <mstewart141@...>
Date:   2018-03-24T17:30:15Z

Address comments




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20798: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-20 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20798
  
all that makes sense; i will update.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20798: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

2018-03-18 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20798
  
@HyukjinKwon thanks again. i've updated this PR to add documentation. I dug 
pretty deep into the bigger issue around kwargs/partial functions, and you can 
see what i did in the commit:

https://github.com/apache/spark/pull/20798/commits/969f9073ee06d2a5641f78247b75e30d9ad1679a

Basically, throughout the udf and arrow serialization code there is no 
notion of kwargs as supported, making it more challenging than I anticipated to 
wire everything together. Definitely not impossible, but not a small 
undertaking either.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20798: [SPARK-23645][PYTHON] Allow python udfs to be called wit...

2018-03-11 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20798
  
[WIP]
cc @HyukjinKwon
👍

i'd love to run tests here to make sure i haven't broken something. i will 
update pr with new tests once i set up testing better on my local box.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20798: [SPARK-23645][PYTHON] Allow python udfs to be cal...

2018-03-11 Thread mstewart141
GitHub user mstewart141 opened a pull request:

https://github.com/apache/spark/pull/20798

[SPARK-23645][PYTHON] Allow python udfs to be called with keyword arguments

## [WIP]

## What changes were proposed in this pull request?

Currently one can not pass keyword arguments in python UDFs. This patch 
allows keyword arguments to be mixed arbitrarily with positional arguments, as 
seen in normal python functions. 

UDFs accepting an arbitrary (undefined) number of columns are a different 
matter, and not addressed here.

## How was this patch tested?

I will add unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mstewart141/spark udfkw

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20798.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20798


commit 5ec810a7c36691df1877ffc11e6f06392d438485
Author: Michael (Stu) Stewart <mstewart141@...>
Date:   2018-03-11T20:38:29Z

[SPARK-23645][PYTHON] Allow python udfs to be called with keyword arguments




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20728: [SPARK-23569][PYTHON] Allow pandas_udf to work with pyth...

2018-03-04 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20728
  
your test definitely makes sense; yea the syntax error in py2 part is why i 
wasn't sure how to go about testing this in the first place. this certainly 
gets the job done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20728: [SPARK-23569][PYTHON] Allow pandas_udf to work wi...

2018-03-04 Thread mstewart141
Github user mstewart141 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20728#discussion_r172063118
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -42,10 +42,15 @@ def _create_udf(f, returnType, evalType):
 PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF):
 
 import inspect
+import sys
 from pyspark.sql.utils import require_minimum_pyarrow_version
 
 require_minimum_pyarrow_version()
-argspec = inspect.getargspec(f)
+
+if sys.version_info[0] < 3:
+argspec = inspect.getargspec(f)
+else:
+argspec = inspect.getfullargspec(f)
--- End diff --

can do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20728: [SPARK-23569][PYTHON] Allow pandas_udf to work with pyth...

2018-03-03 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20728
  
what should next step be here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20728: [SPARK-23569][PYTHON] Allow pandas_udf to work with pyth...

2018-03-03 Thread mstewart141
Github user mstewart141 commented on the issue:

https://github.com/apache/spark/pull/20728
  
cc @HyukjinKwon 
👍 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20728: [SPARK-23569][PYTHON] Allow pandas_udf to work wi...

2018-03-03 Thread mstewart141
GitHub user mstewart141 opened a pull request:

https://github.com/apache/spark/pull/20728

[SPARK-23569][PYTHON] Allow pandas_udf to work with python3 style 
type-annotated functions

## What changes were proposed in this pull request?

Check python version to determine whether to use `inspect.getargspec` or 
inspect.getfullargspec` before applying `pandas_udf` core logic to a function. 
The former is python2.7 (deprecated in python3) and the latter is python3.x. 
The latter correctly accounts for type annotations, which are syntax errors in 
python2.x.

## How was this patch tested?

Locally, on python 2.7 and 3.6.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mstewart141/spark pandas_udf_fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20728.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20728


commit 3cd53f39f23ebd1b9b4134a9ac22348b301f8bd4
Author: Michael (Stu) Stewart <mstewart141@...>
Date:   2018-03-03T21:54:53Z

[SPARK-23569][PYTHON] Allow pandas_udf to work with python3 style 
type-annotated functions




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org