Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Reynold Xin Fri, 01 Sep 2017 05:16:39 -0700

Ok, thanks.

+1 on the SPIP for scope etc



On API details (will deal with in code reviews as well but leaving a note
here in case I forget)

1. I would suggest having the API also accept data type specification in
string form. It is usually simpler to say "long" then "LongType()".

2. Think about what error message to show when the rows numbers don't match
at runtime.


On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <[email protected]>
wrote:

> Yes, the aggregation is out of scope for now.
> I think we should continue discussing the aggregation at JIRA and we will
> be adding those later separately.
>
> Thanks.
>
>
> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <[email protected]> wrote:
>
>> Is the idea aggregate is out of scope for the current effort and we will
>> be adding those later?
>>
>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> We've been discussing to support vectorized UDFs in Python and we almost
>>> got a consensus about the APIs, so I'd like to summarize and call for a
>>> vote.
>>>
>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>>> for vectorized UDAFs or Window operations.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>
>>>
>>> *Proposed API*
>>>
>>> We introduce a @pandas_udf decorator (or annotation) to define
>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>> value meaning the length of the input value for 0-parameter UDFs. The
>>> return value should be pandas.Series of the specified type and the
>>> length of the returned value should be the same as input value.
>>>
>>> We can define vectorized UDFs as:
>>>
>>>   @pandas_udf(DoubleType())
>>>   def plus(v1, v2):
>>>       return v1 + v2
>>>
>>> or we can define as:
>>>
>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>
>>> We can use it similar to row-by-row UDFs:
>>>
>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>
>>> As for 0-parameter UDFs, we can define and use as:
>>>
>>>   @pandas_udf(LongType())
>>>   def f0(size):
>>>       return pd.Series(1).repeat(size)
>>>
>>>   df.select(f0())
>>>
>>>
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Reply via email to