+1 on the design and proposed API. One detail I'd like to discuss is the 0-parameter UDF, how we can specify the size hint. This can be done in the PR review though.
On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <felixcheun...@hotmail.com> wrote: > +1 on this and like the suggestion of type in string form. > > Would it be correct to assume there will be data type check, for example > the returned pandas data frame column data types match what are specified. > We have seen quite a bit of issues/confusions with that in R. > > Would it make sense to have a more generic decorator name so that it could > also be useable for other efficient vectorized format in the future? Or do > we anticipate the decorator to be format specific and will have more in the > future? > > ------------------------------ > *From:* Reynold Xin <r...@databricks.com> > *Sent:* Friday, September 1, 2017 5:16:11 AM > *To:* Takuya UESHIN > *Cc:* spark-dev > *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python > > Ok, thanks. > > +1 on the SPIP for scope etc > > > On API details (will deal with in code reviews as well but leaving a note > here in case I forget) > > 1. I would suggest having the API also accept data type specification in > string form. It is usually simpler to say "long" then "LongType()". > > 2. Think about what error message to show when the rows numbers don't > match at runtime. > > > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st> > wrote: > >> Yes, the aggregation is out of scope for now. >> I think we should continue discussing the aggregation at JIRA and we will >> be adding those later separately. >> >> Thanks. >> >> >> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> Is the idea aggregate is out of scope for the current effort and we will >>> be adding those later? >>> >>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st> >>> wrote: >>> >>>> Hi all, >>>> >>>> We've been discussing to support vectorized UDFs in Python and we >>>> almost got a consensus about the APIs, so I'd like to summarize and >>>> call for a vote. >>>> >>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs >>>> for vectorized UDAFs or Window operations. >>>> >>>> https://issues.apache.org/jira/browse/SPARK-21190 >>>> >>>> >>>> *Proposed API* >>>> >>>> We introduce a @pandas_udf decorator (or annotation) to define >>>> vectorized UDFs which takes one or more pandas.Series or one integer >>>> value meaning the length of the input value for 0-parameter UDFs. The >>>> return value should be pandas.Series of the specified type and the >>>> length of the returned value should be the same as input value. >>>> >>>> We can define vectorized UDFs as: >>>> >>>> @pandas_udf(DoubleType()) >>>> def plus(v1, v2): >>>> return v1 + v2 >>>> >>>> or we can define as: >>>> >>>> plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) >>>> >>>> We can use it similar to row-by-row UDFs: >>>> >>>> df.withColumn('sum', plus(df.v1, df.v2)) >>>> >>>> As for 0-parameter UDFs, we can define and use as: >>>> >>>> @pandas_udf(LongType()) >>>> def f0(size): >>>> return pd.Series(1).repeat(size) >>>> >>>> df.select(f0()) >>>> >>>> >>>> >>>> The vote will be up for the next 72 hours. Please reply with your vote: >>>> >>>> +1: Yeah, let's go forward and implement the SPIP. >>>> +0: Don't really care. >>>> -1: I don't think this is a good idea because of the following technical >>>> reasons. >>>> >>>> Thanks! >>>> >>>> -- >>>> Takuya UESHIN >>>> Tokyo, Japan >>>> >>>> http://twitter.com/ueshin >>>> >>> >> >> >> -- >> Takuya UESHIN >> Tokyo, Japan >> >> http://twitter.com/ueshin >> >