Hi all,

We've been discussing to support vectorized UDFs in Python and we almost
got a consensus about the APIs, so I'd like to summarize and call for a
vote.

Note that this vote should focus on APIs for vectorized UDFs, not APIs for
vectorized UDAFs or Window operations.

https://issues.apache.org/jira/browse/SPARK-21190


*Proposed API*

We introduce a @pandas_udf decorator (or annotation) to define vectorized
UDFs which takes one or more pandas.Series or one integer value meaning the
length of the input value for 0-parameter UDFs. The return value should be
pandas.Series of the specified type and the length of the returned value
should be the same as input value.

We can define vectorized UDFs as:

  @pandas_udf(DoubleType())
  def plus(v1, v2):
      return v1 + v2

or we can define as:

  plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())

We can use it similar to row-by-row UDFs:

  df.withColumn('sum', plus(df.v1, df.v2))

As for 0-parameter UDFs, we can define and use as:

  @pandas_udf(LongType())
  def f0(size):
      return pd.Series(1).repeat(size)

  df.select(f0())



The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical
reasons.

Thanks!

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Reply via email to