Yes, the aggregation is out of scope for now. I think we should continue discussing the aggregation at JIRA and we will be adding those later separately.
Thanks. On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> wrote: > Is the idea aggregate is out of scope for the current effort and we will > be adding those later? > > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st> > wrote: > >> Hi all, >> >> We've been discussing to support vectorized UDFs in Python and we almost >> got a consensus about the APIs, so I'd like to summarize and call for a >> vote. >> >> Note that this vote should focus on APIs for vectorized UDFs, not APIs >> for vectorized UDAFs or Window operations. >> >> https://issues.apache.org/jira/browse/SPARK-21190 >> >> >> *Proposed API* >> >> We introduce a @pandas_udf decorator (or annotation) to define >> vectorized UDFs which takes one or more pandas.Series or one integer >> value meaning the length of the input value for 0-parameter UDFs. The >> return value should be pandas.Series of the specified type and the >> length of the returned value should be the same as input value. >> >> We can define vectorized UDFs as: >> >> @pandas_udf(DoubleType()) >> def plus(v1, v2): >> return v1 + v2 >> >> or we can define as: >> >> plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) >> >> We can use it similar to row-by-row UDFs: >> >> df.withColumn('sum', plus(df.v1, df.v2)) >> >> As for 0-parameter UDFs, we can define and use as: >> >> @pandas_udf(LongType()) >> def f0(size): >> return pd.Series(1).repeat(size) >> >> df.select(f0()) >> >> >> >> The vote will be up for the next 72 hours. Please reply with your vote: >> >> +1: Yeah, let's go forward and implement the SPIP. >> +0: Don't really care. >> -1: I don't think this is a good idea because of the following technical >> reasons. >> >> Thanks! >> >> -- >> Takuya UESHIN >> Tokyo, Japan >> >> http://twitter.com/ueshin >> > -- Takuya UESHIN Tokyo, Japan http://twitter.com/ueshin