[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

mstewart141 Sun, 25 Mar 2018 11:45:06 -0700

Github user mstewart141 commented on the issue:

    https://github.com/apache/spark/pull/20900
  
    Many (though not all, I don't think `callable`s are impacted) of the 
limitations of pandas_udf relative to UDF in this domain are due to the fact 
that `pandas_udf` doesn't allow for keyword arguments at the call site. This 
obviously impacts plain old function-based `pandas_udf`s but also partial fns, 
where one would typically need to specify the argument (that one was partially 
applying) by name.
    
    In the incremental commits of this PR as at:
    
https://github.com/apache/spark/pull/20900/commits/9ea2595f0cecb0cd05e0e6b99baf538679332e8b
    
    You can see the kind of things I was investigating to try and fix that 
case. Indeed my original PR was (ambitiously) titled something about enabling 
kw args for all pandas_udfs. This is actually very easy to do for *functions* 
on python3 (and maybe plain functions in py2 also, but I suspect that this is 
also rather tricky as `getargspec` is pretty unhelpful when it comes to some of 
the kw-arg metadata one would need)). However, it is rather harder for the 
partial function case as one quickly gets into stacktraces from places like 
`python/pyspark/worker.py` where the semantics of the current strategy do not 
realize that a column from the arguments list may already be "accounted for" 
and one runs into duplicate arguments passed for `a`, e.g., as a result of 
this. 
    
    My summary is that the change to allow kw for functions is simple (at least 
in py3 -- indeed my incremental commit referenced above does this), but for 
partial fns maybe not so much. I'm pretty confident I'm most of the way to 
accomplishing the former, but not that latter.
    
    However, I have no substantial knowledge of the pyspark codebase so you 
will likely have better luck there, should you go down that route :)
    
    **TL;DR**: I could work on a PR to allow keyword arguments for python3 
functions (not partials, and not py2), but that is likely too narrow a goal 
given the broader context.
    
    One general question: how do we tend to think about the py2/3 split for api 
quirks/features? Must everything that is added for py3 also be functional in 
py2?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20900: [SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_u...

Reply via email to