+1 Xiao On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia <matei.zaha...@gmail.com> wrote:
> +1 (binding) > > > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > > > > +1 (non-binding) > > > > > > 2017-09-12 9:52 GMT+09:00 Yin Huai <yh...@databricks.com>: > > +1 > > > > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <sam...@databricks.com> > wrote: > > +1 (non-binding) > > > > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cutl...@gmail.com> wrote: > > +1 (non-binding) for the goals and non-goals of this SPIP. I think it's > fine to work out the minor details of the API during review. > > > > Bryan > > > > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ues...@happy-camper.st> > wrote: > > Hi all, > > > > Thank you for voting and suggestions. > > > > As Wenchen mentioned and also we're discussing at JIRA, we need to > discuss the size hint for the 0-parameter UDF. > > But I believe we got a consensus about the basic APIs except for the > size hint, I'd like to submit a pr based on the current proposal and > continue discussing in its review. > > > > https://github.com/apache/spark/pull/19147 > > > > I'd keep this vote open to wait for more opinions. > > > > Thanks. > > > > > > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > > +1 on the design and proposed API. > > > > One detail I'd like to discuss is the 0-parameter UDF, how we can > specify the size hint. This can be done in the PR review though. > > > > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <felixcheun...@hotmail.com> > wrote: > > +1 on this and like the suggestion of type in string form. > > > > Would it be correct to assume there will be data type check, for example > the returned pandas data frame column data types match what are specified. > We have seen quite a bit of issues/confusions with that in R. > > > > Would it make sense to have a more generic decorator name so that it > could also be useable for other efficient vectorized format in the future? > Or do we anticipate the decorator to be format specific and will have more > in the future? > > > > From: Reynold Xin <r...@databricks.com> > > Sent: Friday, September 1, 2017 5:16:11 AM > > To: Takuya UESHIN > > Cc: spark-dev > > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python > > > > Ok, thanks. > > > > +1 on the SPIP for scope etc > > > > > > On API details (will deal with in code reviews as well but leaving a > note here in case I forget) > > > > 1. I would suggest having the API also accept data type specification in > string form. It is usually simpler to say "long" then "LongType()". > > > > 2. Think about what error message to show when the rows numbers don't > match at runtime. > > > > > > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st> > wrote: > > Yes, the aggregation is out of scope for now. > > I think we should continue discussing the aggregation at JIRA and we > will be adding those later separately. > > > > Thanks. > > > > > > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> wrote: > > Is the idea aggregate is out of scope for the current effort and we will > be adding those later? > > > > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st> > wrote: > > Hi all, > > > > We've been discussing to support vectorized UDFs in Python and we almost > got a consensus about the APIs, so I'd like to summarize and call for a > vote. > > > > Note that this vote should focus on APIs for vectorized UDFs, not APIs > for vectorized UDAFs or Window operations. > > > > https://issues.apache.org/jira/browse/SPARK-21190 > > > > > > Proposed API > > > > We introduce a @pandas_udf decorator (or annotation) to define > vectorized UDFs which takes one or more pandas.Series or one integer value > meaning the length of the input value for 0-parameter UDFs. The return > value should be pandas.Series of the specified type and the length of the > returned value should be the same as input value. > > > > We can define vectorized UDFs as: > > > > @pandas_udf(DoubleType()) > > def plus(v1, v2): > > return v1 + v2 > > > > or we can define as: > > > > plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) > > > > We can use it similar to row-by-row UDFs: > > > > df.withColumn('sum', plus(df.v1, df.v2)) > > > > As for 0-parameter UDFs, we can define and use as: > > > > @pandas_udf(LongType()) > > def f0(size): > > return pd.Series(1).repeat(size) > > > > df.select(f0()) > > > > > > > > The vote will be up for the next 72 hours. Please reply with your vote: > > > > +1: Yeah, let's go forward and implement the SPIP. > > +0: Don't really care. > > -1: I don't think this is a good idea because of the following technical > reasons. > > > > Thanks! > > > > -- > > Takuya UESHIN > > Tokyo, Japan > > > > http://twitter.com/ueshin > > > > > > > > -- > > Takuya UESHIN > > Tokyo, Japan > > > > http://twitter.com/ueshin > > > > > > > > > > -- > > Takuya UESHIN > > Tokyo, Japan > > > > http://twitter.com/ueshin > > > > > > > > > > -- > > Sameer Agarwal > > Software Engineer | Databricks Inc. > > http://cs.berkeley.edu/~sameerag > > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >