Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Takuya UESHIN Fri, 01 Sep 2017 03:30:07 -0700

Yes, the aggregation is out of scope for now.
I think we should continue discussing the aggregation at JIRA and we will
be adding those later separately.


Thanks.


On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> wrote:

> Is the idea aggregate is out of scope for the current effort and we will
> be adding those later?
>
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st>
> wrote:
>
>> Hi all,
>>
>> We've been discussing to support vectorized UDFs in Python and we almost
>> got a consensus about the APIs, so I'd like to summarize and call for a
>> vote.
>>
>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>> for vectorized UDAFs or Window operations.
>>
>> https://issues.apache.org/jira/browse/SPARK-21190
>>
>>
>> *Proposed API*
>>
>> We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value meaning the length of the input value for 0-parameter UDFs. The
>> return value should be pandas.Series of the specified type and the
>> length of the returned value should be the same as input value.
>>
>> We can define vectorized UDFs as:
>>
>>   @pandas_udf(DoubleType())
>>   def plus(v1, v2):
>>       return v1 + v2
>>
>> or we can define as:
>>
>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>
>> We can use it similar to row-by-row UDFs:
>>
>>   df.withColumn('sum', plus(df.v1, df.v2))
>>
>> As for 0-parameter UDFs, we can define and use as:
>>
>>   @pandas_udf(LongType())
>>   def f0(size):
>>       return pd.Series(1).repeat(size)
>>
>>   df.select(f0())
>>
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Reply via email to