+1 (non-binding) for the goals and non-goals of this SPIP. I think it's fine to work out the minor details of the API during review.
Bryan On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ues...@happy-camper.st> wrote: > Hi all, > > Thank you for voting and suggestions. > > As Wenchen mentioned and also we're discussing at JIRA, we need to discuss > the size hint for the 0-parameter UDF. > But I believe we got a consensus about the basic APIs except for the size > hint, I'd like to submit a pr based on the current proposal and continue > discussing in its review. > > https://github.com/apache/spark/pull/19147 > > I'd keep this vote open to wait for more opinions. > > Thanks. > > > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> +1 on the design and proposed API. >> >> One detail I'd like to discuss is the 0-parameter UDF, how we can specify >> the size hint. This can be done in the PR review though. >> >> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <felixcheun...@hotmail.com> >> wrote: >> >>> +1 on this and like the suggestion of type in string form. >>> >>> Would it be correct to assume there will be data type check, for example >>> the returned pandas data frame column data types match what are specified. >>> We have seen quite a bit of issues/confusions with that in R. >>> >>> Would it make sense to have a more generic decorator name so that it >>> could also be useable for other efficient vectorized format in the future? >>> Or do we anticipate the decorator to be format specific and will have more >>> in the future? >>> >>> ------------------------------ >>> *From:* Reynold Xin <r...@databricks.com> >>> *Sent:* Friday, September 1, 2017 5:16:11 AM >>> *To:* Takuya UESHIN >>> *Cc:* spark-dev >>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python >>> >>> Ok, thanks. >>> >>> +1 on the SPIP for scope etc >>> >>> >>> On API details (will deal with in code reviews as well but leaving a >>> note here in case I forget) >>> >>> 1. I would suggest having the API also accept data type specification in >>> string form. It is usually simpler to say "long" then "LongType()". >>> >>> 2. Think about what error message to show when the rows numbers don't >>> match at runtime. >>> >>> >>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st> >>> wrote: >>> >>>> Yes, the aggregation is out of scope for now. >>>> I think we should continue discussing the aggregation at JIRA and we >>>> will be adding those later separately. >>>> >>>> Thanks. >>>> >>>> >>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> Is the idea aggregate is out of scope for the current effort and we >>>>> will be adding those later? >>>>> >>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> We've been discussing to support vectorized UDFs in Python and we >>>>>> almost got a consensus about the APIs, so I'd like to summarize and >>>>>> call for a vote. >>>>>> >>>>>> Note that this vote should focus on APIs for vectorized UDFs, not >>>>>> APIs for vectorized UDAFs or Window operations. >>>>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-21190 >>>>>> >>>>>> >>>>>> *Proposed API* >>>>>> >>>>>> We introduce a @pandas_udf decorator (or annotation) to define >>>>>> vectorized UDFs which takes one or more pandas.Series or one integer >>>>>> value meaning the length of the input value for 0-parameter UDFs. The >>>>>> return value should be pandas.Series of the specified type and the >>>>>> length of the returned value should be the same as input value. >>>>>> >>>>>> We can define vectorized UDFs as: >>>>>> >>>>>> @pandas_udf(DoubleType()) >>>>>> def plus(v1, v2): >>>>>> return v1 + v2 >>>>>> >>>>>> or we can define as: >>>>>> >>>>>> plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) >>>>>> >>>>>> We can use it similar to row-by-row UDFs: >>>>>> >>>>>> df.withColumn('sum', plus(df.v1, df.v2)) >>>>>> >>>>>> As for 0-parameter UDFs, we can define and use as: >>>>>> >>>>>> @pandas_udf(LongType()) >>>>>> def f0(size): >>>>>> return pd.Series(1).repeat(size) >>>>>> >>>>>> df.select(f0()) >>>>>> >>>>>> >>>>>> >>>>>> The vote will be up for the next 72 hours. Please reply with your >>>>>> vote: >>>>>> >>>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>>> +0: Don't really care. >>>>>> -1: I don't think this is a good idea because of the following technical >>>>>> reasons. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -- >>>>>> Takuya UESHIN >>>>>> Tokyo, Japan >>>>>> >>>>>> http://twitter.com/ueshin >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Takuya UESHIN >>>> Tokyo, Japan >>>> >>>> http://twitter.com/ueshin >>>> >>> >> > > > -- > Takuya UESHIN > Tokyo, Japan > > http://twitter.com/ueshin >