Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Bryan Cutler Thu, 07 Sep 2017 21:11:10 -0700

+1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
fine to work out the minor details of the API during review.


Bryan

On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <[email protected]>
wrote:

> Hi all,
>
> Thank you for voting and suggestions.
>
> As Wenchen mentioned and also we're discussing at JIRA, we need to discuss
> the size hint for the 0-parameter UDF.
> But I believe we got a consensus about the basic APIs except for the size
> hint, I'd like to submit a pr based on the current proposal and continue
> discussing in its review.
>
>     https://github.com/apache/spark/pull/19147
>
> I'd keep this vote open to wait for more opinions.
>
> Thanks.
>
>
> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <[email protected]> wrote:
>
>> +1 on the design and proposed API.
>>
>> One detail I'd like to discuss is the 0-parameter UDF, how we can specify
>> the size hint. This can be done in the PR review though.
>>
>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <[email protected]>
>> wrote:
>>
>>> +1 on this and like the suggestion of type in string form.
>>>
>>> Would it be correct to assume there will be data type check, for example
>>> the returned pandas data frame column data types match what are specified.
>>> We have seen quite a bit of issues/confusions with that in R.
>>>
>>> Would it make sense to have a more generic decorator name so that it
>>> could also be useable for other efficient vectorized format in the future?
>>> Or do we anticipate the decorator to be format specific and will have more
>>> in the future?
>>>
>>> ------------------------------
>>> *From:* Reynold Xin <[email protected]>
>>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>>> *To:* Takuya UESHIN
>>> *Cc:* spark-dev
>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>>
>>> Ok, thanks.
>>>
>>> +1 on the SPIP for scope etc
>>>
>>>
>>> On API details (will deal with in code reviews as well but leaving a
>>> note here in case I forget)
>>>
>>> 1. I would suggest having the API also accept data type specification in
>>> string form. It is usually simpler to say "long" then "LongType()".
>>>
>>> 2. Think about what error message to show when the rows numbers don't
>>> match at runtime.
>>>
>>>
>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <[email protected]>
>>> wrote:
>>>
>>>> Yes, the aggregation is out of scope for now.
>>>> I think we should continue discussing the aggregation at JIRA and we
>>>> will be adding those later separately.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <[email protected]>
>>>> wrote:
>>>>
>>>>> Is the idea aggregate is out of scope for the current effort and we
>>>>> will be adding those later?
>>>>>
>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We've been discussing to support vectorized UDFs in Python and we
>>>>>> almost got a consensus about the APIs, so I'd like to summarize and
>>>>>> call for a vote.
>>>>>>
>>>>>> Note that this vote should focus on APIs for vectorized UDFs, not
>>>>>> APIs for vectorized UDAFs or Window operations.
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>>
>>>>>>
>>>>>> *Proposed API*
>>>>>>
>>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>>>>> value meaning the length of the input value for 0-parameter UDFs. The
>>>>>> return value should be pandas.Series of the specified type and the
>>>>>> length of the returned value should be the same as input value.
>>>>>>
>>>>>> We can define vectorized UDFs as:
>>>>>>
>>>>>>   @pandas_udf(DoubleType())
>>>>>>   def plus(v1, v2):
>>>>>>       return v1 + v2
>>>>>>
>>>>>> or we can define as:
>>>>>>
>>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>>
>>>>>> We can use it similar to row-by-row UDFs:
>>>>>>
>>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>>
>>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>>
>>>>>>   @pandas_udf(LongType())
>>>>>>   def f0(size):
>>>>>>       return pd.Series(1).repeat(size)
>>>>>>
>>>>>>   df.select(f0())
>>>>>>
>>>>>>
>>>>>>
>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>> vote:
>>>>>>
>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>> +0: Don't really care.
>>>>>> -1: I don't think this is a good idea because of the following technical
>>>>>> reasons.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --
>>>>>> Takuya UESHIN
>>>>>> Tokyo, Japan
>>>>>>
>>>>>> http://twitter.com/ueshin
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Takuya UESHIN
>>>> Tokyo, Japan
>>>>
>>>> http://twitter.com/ueshin
>>>>
>>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Reply via email to