Is the idea aggregate is out of scope for the current effort and we will be adding those later?
On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st> wrote: > Hi all, > > We've been discussing to support vectorized UDFs in Python and we almost > got a consensus about the APIs, so I'd like to summarize and call for a > vote. > > Note that this vote should focus on APIs for vectorized UDFs, not APIs for > vectorized UDAFs or Window operations. > > https://issues.apache.org/jira/browse/SPARK-21190 > > > *Proposed API* > > We introduce a @pandas_udf decorator (or annotation) to define vectorized > UDFs which takes one or more pandas.Series or one integer value meaning > the length of the input value for 0-parameter UDFs. The return value should > be pandas.Series of the specified type and the length of the returned > value should be the same as input value. > > We can define vectorized UDFs as: > > @pandas_udf(DoubleType()) > def plus(v1, v2): > return v1 + v2 > > or we can define as: > > plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) > > We can use it similar to row-by-row UDFs: > > df.withColumn('sum', plus(df.v1, df.v2)) > > As for 0-parameter UDFs, we can define and use as: > > @pandas_udf(LongType()) > def f0(size): > return pd.Series(1).repeat(size) > > df.select(f0()) > > > > The vote will be up for the next 72 hours. Please reply with your vote: > > +1: Yeah, let's go forward and implement the SPIP. > +0: Don't really care. > -1: I don't think this is a good idea because of the following technical > reasons. > > Thanks! > > -- > Takuya UESHIN > Tokyo, Japan > > http://twitter.com/ueshin >