Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Xiao Li Mon, 11 Sep 2017 19:47:27 -0700

+1

Xiao
On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia <[email protected]>
wrote:


> +1 (binding)
>
> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon <[email protected]> wrote:
> >
> > +1 (non-binding)
> >
> >
> > 2017-09-12 9:52 GMT+09:00 Yin Huai <[email protected]>:
> > +1
> >
> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal <[email protected]>
> wrote:
> > +1 (non-binding)
> >
> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <[email protected]> wrote:
> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
> fine to work out the minor details of the API during review.
> >
> > Bryan
> >
> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <[email protected]>
> wrote:
> > Hi all,
> >
> > Thank you for voting and suggestions.
> >
> > As Wenchen mentioned and also we're discussing at JIRA, we need to
> discuss the size hint for the 0-parameter UDF.
> > But I believe we got a consensus about the basic APIs except for the
> size hint, I'd like to submit a pr based on the current proposal and
> continue discussing in its review.
> >
> >     https://github.com/apache/spark/pull/19147
> >
> > I'd keep this vote open to wait for more opinions.
> >
> > Thanks.
> >
> >
> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <[email protected]> wrote:
> > +1 on the design and proposed API.
> >
> > One detail I'd like to discuss is the 0-parameter UDF, how we can
> specify the size hint. This can be done in the PR review though.
> >
> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <[email protected]>
> wrote:
> > +1 on this and like the suggestion of type in string form.
> >
> > Would it be correct to assume there will be data type check, for example
> the returned pandas data frame column data types match what are specified.
> We have seen quite a bit of issues/confusions with that in R.
> >
> > Would it make sense to have a more generic decorator name so that it
> could also be useable for other efficient vectorized format in the future?
> Or do we anticipate the decorator to be format specific and will have more
> in the future?
> >
> > From: Reynold Xin <[email protected]>
> > Sent: Friday, September 1, 2017 5:16:11 AM
> > To: Takuya UESHIN
> > Cc: spark-dev
> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
> >
> > Ok, thanks.
> >
> > +1 on the SPIP for scope etc
> >
> >
> > On API details (will deal with in code reviews as well but leaving a
> note here in case I forget)
> >
> > 1. I would suggest having the API also accept data type specification in
> string form. It is usually simpler to say "long" then "LongType()".
> >
> > 2. Think about what error message to show when the rows numbers don't
> match at runtime.
> >
> >
> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <[email protected]>
> wrote:
> > Yes, the aggregation is out of scope for now.
> > I think we should continue discussing the aggregation at JIRA and we
> will be adding those later separately.
> >
> > Thanks.
> >
> >
> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <[email protected]> wrote:
> > Is the idea aggregate is out of scope for the current effort and we will
> be adding those later?
> >
> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <[email protected]>
> wrote:
> > Hi all,
> >
> > We've been discussing to support vectorized UDFs in Python and we almost
> got a consensus about the APIs, so I'd like to summarize and call for a
> vote.
> >
> > Note that this vote should focus on APIs for vectorized UDFs, not APIs
> for vectorized UDAFs or Window operations.
> >
> > https://issues.apache.org/jira/browse/SPARK-21190
> >
> >
> > Proposed API
> >
> > We introduce a @pandas_udf decorator (or annotation) to define
> vectorized UDFs which takes one or more pandas.Series or one integer value
> meaning the length of the input value for 0-parameter UDFs. The return
> value should be pandas.Series of the specified type and the length of the
> returned value should be the same as input value.
> >
> > We can define vectorized UDFs as:
> >
> >   @pandas_udf(DoubleType())
> >   def plus(v1, v2):
> >       return v1 + v2
> >
> > or we can define as:
> >
> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
> >
> > We can use it similar to row-by-row UDFs:
> >
> >   df.withColumn('sum', plus(df.v1, df.v2))
> >
> > As for 0-parameter UDFs, we can define and use as:
> >
> >   @pandas_udf(LongType())
> >   def f0(size):
> >       return pd.Series(1).repeat(size)
> >
> >   df.select(f0())
> >
> >
> >
> > The vote will be up for the next 72 hours. Please reply with your vote:
> >
> > +1: Yeah, let's go forward and implement the SPIP.
> > +0: Don't really care.
> > -1: I don't think this is a good idea because of the following technical
> reasons.
> >
> > Thanks!
> >
> > --
> > Takuya UESHIN
> > Tokyo, Japan
> >
> > http://twitter.com/ueshin
> >
> >
> >
> > --
> > Takuya UESHIN
> > Tokyo, Japan
> >
> > http://twitter.com/ueshin
> >
> >
> >
> >
> > --
> > Takuya UESHIN
> > Tokyo, Japan
> >
> > http://twitter.com/ueshin
> >
> >
> >
> >
> > --
> > Sameer Agarwal
> > Software Engineer | Databricks Inc.
> > http://cs.berkeley.edu/~sameerag
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Reply via email to