Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Wenchen Fan Tue, 05 Sep 2017 17:49:20 -0700

+1 on the design and proposed API.

One detail I'd like to discuss is the 0-parameter UDF, how we can specify
the size hint. This can be done in the PR review though.


On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> +1 on this and like the suggestion of type in string form.
>
> Would it be correct to assume there will be data type check, for example
> the returned pandas data frame column data types match what are specified.
> We have seen quite a bit of issues/confusions with that in R.
>
> Would it make sense to have a more generic decorator name so that it could
> also be useable for other efficient vectorized format in the future? Or do
> we anticipate the decorator to be format specific and will have more in the
> future?
>
> ------------------------------
> *From:* Reynold Xin <r...@databricks.com>
> *Sent:* Friday, September 1, 2017 5:16:11 AM
> *To:* Takuya UESHIN
> *Cc:* spark-dev
> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>
> Ok, thanks.
>
> +1 on the SPIP for scope etc
>
>
> On API details (will deal with in code reviews as well but leaving a note
> here in case I forget)
>
> 1. I would suggest having the API also accept data type specification in
> string form. It is usually simpler to say "long" then "LongType()".
>
> 2. Think about what error message to show when the rows numbers don't
> match at runtime.
>
>
> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st>
> wrote:
>
>> Yes, the aggregation is out of scope for now.
>> I think we should continue discussing the aggregation at JIRA and we will
>> be adding those later separately.
>>
>> Thanks.
>>
>>
>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Is the idea aggregate is out of scope for the current effort and we will
>>> be adding those later?
>>>
>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We've been discussing to support vectorized UDFs in Python and we
>>>> almost got a consensus about the APIs, so I'd like to summarize and
>>>> call for a vote.
>>>>
>>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>>>> for vectorized UDAFs or Window operations.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>
>>>>
>>>> *Proposed API*
>>>>
>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>>> value meaning the length of the input value for 0-parameter UDFs. The
>>>> return value should be pandas.Series of the specified type and the
>>>> length of the returned value should be the same as input value.
>>>>
>>>> We can define vectorized UDFs as:
>>>>
>>>>   @pandas_udf(DoubleType())
>>>>   def plus(v1, v2):
>>>>       return v1 + v2
>>>>
>>>> or we can define as:
>>>>
>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>
>>>> We can use it similar to row-by-row UDFs:
>>>>
>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>
>>>> As for 0-parameter UDFs, we can define and use as:
>>>>
>>>>   @pandas_udf(LongType())
>>>>   def f0(size):
>>>>       return pd.Series(1).repeat(size)
>>>>
>>>>   df.select(f0())
>>>>
>>>>
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>> +0: Don't really care.
>>>> -1: I don't think this is a good idea because of the following technical
>>>> reasons.
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Takuya UESHIN
>>>> Tokyo, Japan
>>>>
>>>> http://twitter.com/ueshin
>>>>
>>>
>>
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Reply via email to