Hi all, We've been discussing to support vectorized UDFs in Python and we almost got a consensus about the APIs, so I'd like to summarize and call for a vote.
Note that this vote should focus on APIs for vectorized UDFs, not APIs for vectorized UDAFs or Window operations. https://issues.apache.org/jira/browse/SPARK-21190 *Proposed API* We introduce a @pandas_udf decorator (or annotation) to define vectorized UDFs which takes one or more pandas.Series or one integer value meaning the length of the input value for 0-parameter UDFs. The return value should be pandas.Series of the specified type and the length of the returned value should be the same as input value. We can define vectorized UDFs as: @pandas_udf(DoubleType()) def plus(v1, v2): return v1 + v2 or we can define as: plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) We can use it similar to row-by-row UDFs: df.withColumn('sum', plus(df.v1, df.v2)) As for 0-parameter UDFs, we can define and use as: @pandas_udf(LongType()) def f0(size): return pd.Series(1).repeat(size) df.select(f0()) The vote will be up for the next 72 hours. Please reply with your vote: +1: Yeah, let's go forward and implement the SPIP. +0: Don't really care. -1: I don't think this is a good idea because of the following technical reasons. Thanks! -- Takuya UESHIN Tokyo, Japan http://twitter.com/ueshin