Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Liang-Chi Hsieh
+1


Xiao Li wrote
> +1
> 
> Xiao
> On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia 

> matei.zaharia@

> 
> wrote:
> 
>> +1 (binding)
>>
>> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon 

> gurwls223@

>  wrote:
>> >
>> > +1 (non-binding)
>> >
>> >
>> > 2017-09-12 9:52 GMT+09:00 Yin Huai 

> yhuai@

> :
>> > +1
>> >
>> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal 

> sameer@

> 
>> wrote:
>> > +1 (non-binding)
>> >
>> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler 

> cutlerb@

>  wrote:
>> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think
>> it's
>> fine to work out the minor details of the API during review.
>> >
>> > Bryan
>> >
>> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 

> ueshin@

> 
>> wrote:
>> > Hi all,
>> >
>> > Thank you for voting and suggestions.
>> >
>> > As Wenchen mentioned and also we're discussing at JIRA, we need to
>> discuss the size hint for the 0-parameter UDF.
>> > But I believe we got a consensus about the basic APIs except for the
>> size hint, I'd like to submit a pr based on the current proposal and
>> continue discussing in its review.
>> >
>> > https://github.com/apache/spark/pull/19147
>> >
>> > I'd keep this vote open to wait for more opinions.
>> >
>> > Thanks.
>> >
>> >
>> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan 

> cloud0fan@

>  wrote:
>> > +1 on the design and proposed API.
>> >
>> > One detail I'd like to discuss is the 0-parameter UDF, how we can
>> specify the size hint. This can be done in the PR review though.
>> >
>> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung 

> felixcheung_m@

> 
>> wrote:
>> > +1 on this and like the suggestion of type in string form.
>> >
>> > Would it be correct to assume there will be data type check, for
>> example
>> the returned pandas data frame column data types match what are
>> specified.
>> We have seen quite a bit of issues/confusions with that in R.
>> >
>> > Would it make sense to have a more generic decorator name so that it
>> could also be useable for other efficient vectorized format in the
>> future?
>> Or do we anticipate the decorator to be format specific and will have
>> more
>> in the future?
>> >
>> > From: Reynold Xin 

> rxin@

> 
>> > Sent: Friday, September 1, 2017 5:16:11 AM
>> > To: Takuya UESHIN
>> > Cc: spark-dev
>> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>> >
>> > Ok, thanks.
>> >
>> > +1 on the SPIP for scope etc
>> >
>> >
>> > On API details (will deal with in code reviews as well but leaving a
>> note here in case I forget)
>> >
>> > 1. I would suggest having the API also accept data type specification
>> in
>> string form. It is usually simpler to say "long" then "LongType()".
>> >
>> > 2. Think about what error message to show when the rows numbers don't
>> match at runtime.
>> >
>> >
>> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 

> ueshin@

> 
>> wrote:
>> > Yes, the aggregation is out of scope for now.
>> > I think we should continue discussing the aggregation at JIRA and we
>> will be adding those later separately.
>> >
>> > Thanks.
>> >
>> >
>> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 

> rxin@

>  wrote:
>> > Is the idea aggregate is out of scope for the current effort and we
>> will
>> be adding those later?
>> >
>> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 

> ueshin@

> 
>> wrote:
>> > Hi all,
>> >
>> > We've been discussing to support vectorized UDFs in Python and we
>> almost
>> got a consensus about the APIs, so I'd like to summarize and call for a
>> vote.
>> >
>> > Note that this vote should focus on APIs for vectorized UDFs, not APIs
>> for vectorized UDAFs or Window operations.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-21190
>> >
>> >
>> > Proposed API
>> >
>> > We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value
>> meaning the length of the input value for 0-parameter UDFs. The return
>> value should be pandas.Series of the specified type and the length of the
>> returned value should be the same as input value.
>> >
>> > We can define vectorized UDFs as:
>> >
>> >   @pandas_udf(DoubleType())
>> >   def plus(v1, v2):
>> >   return v1 + v2
>> >
>> > or we can define as:
>> >
>> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>> >
>> > We can use it similar to row-by-row UDFs:
>> >
>> >   df.withColumn('sum', plus(df.v1, df.v2))
>> >
>> > As for 0-parameter UDFs, we can define and use as:
>> >
>> >   @pandas_udf(LongType())
>> >   def f0(size):
>> >   return pd.Series(1).repeat(size)
>> >
>> >   df.select(f0())
>> >
>> >
>> >
>> > The vote will be up for the next 72 hours. Please reply with your vote:
>> >
>> > +1: Yeah, let's go forward and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I don't think this is a good idea because of the following
>> technical
>> reasons.
>> >
>> > Thanks!
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Liang-Chi Hsieh
+1


Xiao Li wrote
> +1
> 
> Xiao
> On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia 

> matei.zaharia@

> 
> wrote:
> 
>> +1 (binding)
>>
>> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon 

> gurwls223@

>  wrote:
>> >
>> > +1 (non-binding)
>> >
>> >
>> > 2017-09-12 9:52 GMT+09:00 Yin Huai 

> yhuai@

> :
>> > +1
>> >
>> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal 

> sameer@

> 
>> wrote:
>> > +1 (non-binding)
>> >
>> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler 

> cutlerb@

>  wrote:
>> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think
>> it's
>> fine to work out the minor details of the API during review.
>> >
>> > Bryan
>> >
>> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 

> ueshin@

> 
>> wrote:
>> > Hi all,
>> >
>> > Thank you for voting and suggestions.
>> >
>> > As Wenchen mentioned and also we're discussing at JIRA, we need to
>> discuss the size hint for the 0-parameter UDF.
>> > But I believe we got a consensus about the basic APIs except for the
>> size hint, I'd like to submit a pr based on the current proposal and
>> continue discussing in its review.
>> >
>> > https://github.com/apache/spark/pull/19147
>> >
>> > I'd keep this vote open to wait for more opinions.
>> >
>> > Thanks.
>> >
>> >
>> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan 

> cloud0fan@

>  wrote:
>> > +1 on the design and proposed API.
>> >
>> > One detail I'd like to discuss is the 0-parameter UDF, how we can
>> specify the size hint. This can be done in the PR review though.
>> >
>> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung 

> felixcheung_m@

> 
>> wrote:
>> > +1 on this and like the suggestion of type in string form.
>> >
>> > Would it be correct to assume there will be data type check, for
>> example
>> the returned pandas data frame column data types match what are
>> specified.
>> We have seen quite a bit of issues/confusions with that in R.
>> >
>> > Would it make sense to have a more generic decorator name so that it
>> could also be useable for other efficient vectorized format in the
>> future?
>> Or do we anticipate the decorator to be format specific and will have
>> more
>> in the future?
>> >
>> > From: Reynold Xin 

> rxin@

> 
>> > Sent: Friday, September 1, 2017 5:16:11 AM
>> > To: Takuya UESHIN
>> > Cc: spark-dev
>> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>> >
>> > Ok, thanks.
>> >
>> > +1 on the SPIP for scope etc
>> >
>> >
>> > On API details (will deal with in code reviews as well but leaving a
>> note here in case I forget)
>> >
>> > 1. I would suggest having the API also accept data type specification
>> in
>> string form. It is usually simpler to say "long" then "LongType()".
>> >
>> > 2. Think about what error message to show when the rows numbers don't
>> match at runtime.
>> >
>> >
>> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 

> ueshin@

> 
>> wrote:
>> > Yes, the aggregation is out of scope for now.
>> > I think we should continue discussing the aggregation at JIRA and we
>> will be adding those later separately.
>> >
>> > Thanks.
>> >
>> >
>> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 

> rxin@

>  wrote:
>> > Is the idea aggregate is out of scope for the current effort and we
>> will
>> be adding those later?
>> >
>> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 

> ueshin@

> 
>> wrote:
>> > Hi all,
>> >
>> > We've been discussing to support vectorized UDFs in Python and we
>> almost
>> got a consensus about the APIs, so I'd like to summarize and call for a
>> vote.
>> >
>> > Note that this vote should focus on APIs for vectorized UDFs, not APIs
>> for vectorized UDAFs or Window operations.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-21190
>> >
>> >
>> > Proposed API
>> >
>> > We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value
>> meaning the length of the input value for 0-parameter UDFs. The return
>> value should be pandas.Series of the specified type and the length of the
>> returned value should be the same as input value.
>> >
>> > We can define vectorized UDFs as:
>> >
>> >   @pandas_udf(DoubleType())
>> >   def plus(v1, v2):
>> >   return v1 + v2
>> >
>> > or we can define as:
>> >
>> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>> >
>> > We can use it similar to row-by-row UDFs:
>> >
>> >   df.withColumn('sum', plus(df.v1, df.v2))
>> >
>> > As for 0-parameter UDFs, we can define and use as:
>> >
>> >   @pandas_udf(LongType())
>> >   def f0(size):
>> >   return pd.Series(1).repeat(size)
>> >
>> >   df.select(f0())
>> >
>> >
>> >
>> > The vote will be up for the next 72 hours. Please reply with your vote:
>> >
>> > +1: Yeah, let's go forward and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I don't think this is a good idea because of the following
>> technical
>> reasons.
>> >
>> > Thanks!
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Xiao Li
+1

Xiao
On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia 
wrote:

> +1 (binding)
>
> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon  wrote:
> >
> > +1 (non-binding)
> >
> >
> > 2017-09-12 9:52 GMT+09:00 Yin Huai :
> > +1
> >
> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal 
> wrote:
> > +1 (non-binding)
> >
> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler  wrote:
> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
> fine to work out the minor details of the API during review.
> >
> > Bryan
> >
> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 
> wrote:
> > Hi all,
> >
> > Thank you for voting and suggestions.
> >
> > As Wenchen mentioned and also we're discussing at JIRA, we need to
> discuss the size hint for the 0-parameter UDF.
> > But I believe we got a consensus about the basic APIs except for the
> size hint, I'd like to submit a pr based on the current proposal and
> continue discussing in its review.
> >
> > https://github.com/apache/spark/pull/19147
> >
> > I'd keep this vote open to wait for more opinions.
> >
> > Thanks.
> >
> >
> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan  wrote:
> > +1 on the design and proposed API.
> >
> > One detail I'd like to discuss is the 0-parameter UDF, how we can
> specify the size hint. This can be done in the PR review though.
> >
> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung 
> wrote:
> > +1 on this and like the suggestion of type in string form.
> >
> > Would it be correct to assume there will be data type check, for example
> the returned pandas data frame column data types match what are specified.
> We have seen quite a bit of issues/confusions with that in R.
> >
> > Would it make sense to have a more generic decorator name so that it
> could also be useable for other efficient vectorized format in the future?
> Or do we anticipate the decorator to be format specific and will have more
> in the future?
> >
> > From: Reynold Xin 
> > Sent: Friday, September 1, 2017 5:16:11 AM
> > To: Takuya UESHIN
> > Cc: spark-dev
> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
> >
> > Ok, thanks.
> >
> > +1 on the SPIP for scope etc
> >
> >
> > On API details (will deal with in code reviews as well but leaving a
> note here in case I forget)
> >
> > 1. I would suggest having the API also accept data type specification in
> string form. It is usually simpler to say "long" then "LongType()".
> >
> > 2. Think about what error message to show when the rows numbers don't
> match at runtime.
> >
> >
> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
> wrote:
> > Yes, the aggregation is out of scope for now.
> > I think we should continue discussing the aggregation at JIRA and we
> will be adding those later separately.
> >
> > Thanks.
> >
> >
> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin  wrote:
> > Is the idea aggregate is out of scope for the current effort and we will
> be adding those later?
> >
> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
> wrote:
> > Hi all,
> >
> > We've been discussing to support vectorized UDFs in Python and we almost
> got a consensus about the APIs, so I'd like to summarize and call for a
> vote.
> >
> > Note that this vote should focus on APIs for vectorized UDFs, not APIs
> for vectorized UDAFs or Window operations.
> >
> > https://issues.apache.org/jira/browse/SPARK-21190
> >
> >
> > Proposed API
> >
> > We introduce a @pandas_udf decorator (or annotation) to define
> vectorized UDFs which takes one or more pandas.Series or one integer value
> meaning the length of the input value for 0-parameter UDFs. The return
> value should be pandas.Series of the specified type and the length of the
> returned value should be the same as input value.
> >
> > We can define vectorized UDFs as:
> >
> >   @pandas_udf(DoubleType())
> >   def plus(v1, v2):
> >   return v1 + v2
> >
> > or we can define as:
> >
> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
> >
> > We can use it similar to row-by-row UDFs:
> >
> >   df.withColumn('sum', plus(df.v1, df.v2))
> >
> > As for 0-parameter UDFs, we can define and use as:
> >
> >   @pandas_udf(LongType())
> >   def f0(size):
> >   return pd.Series(1).repeat(size)
> >
> >   df.select(f0())
> >
> >
> >
> > The vote will be up for the next 72 hours. Please reply with your vote:
> >
> > +1: Yeah, let's go forward and implement the SPIP.
> > +0: Don't really care.
> > -1: I don't think this is a good idea because of the following technical
> reasons.
> >
> > Thanks!
> >
> > --
> > Takuya UESHIN
> > Tokyo, Japan
> >
> > http://twitter.com/ueshin
> >
> >
> >
> > --
> > Takuya UESHIN
> > Tokyo, Japan
> >
> > http://twitter.com/ueshin
> >
> >
> >
> >
> > --
> > Takuya UESHIN
> > Tokyo, 

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Matei Zaharia
+1 (binding)

> On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon  wrote:
> 
> +1 (non-binding)
> 
> 
> 2017-09-12 9:52 GMT+09:00 Yin Huai :
> +1
> 
> On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal  wrote:
> +1 (non-binding)
> 
> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler  wrote:
> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's fine 
> to work out the minor details of the API during review.
> 
> Bryan
> 
> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN  wrote:
> Hi all,
> 
> Thank you for voting and suggestions.
> 
> As Wenchen mentioned and also we're discussing at JIRA, we need to discuss 
> the size hint for the 0-parameter UDF.
> But I believe we got a consensus about the basic APIs except for the size 
> hint, I'd like to submit a pr based on the current proposal and continue 
> discussing in its review.
> 
> https://github.com/apache/spark/pull/19147
> 
> I'd keep this vote open to wait for more opinions.
> 
> Thanks.
> 
> 
> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan  wrote:
> +1 on the design and proposed API.
> 
> One detail I'd like to discuss is the 0-parameter UDF, how we can specify the 
> size hint. This can be done in the PR review though.
> 
> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung  
> wrote:
> +1 on this and like the suggestion of type in string form.
> 
> Would it be correct to assume there will be data type check, for example the 
> returned pandas data frame column data types match what are specified. We 
> have seen quite a bit of issues/confusions with that in R.
> 
> Would it make sense to have a more generic decorator name so that it could 
> also be useable for other efficient vectorized format in the future? Or do we 
> anticipate the decorator to be format specific and will have more in the 
> future?
> 
> From: Reynold Xin 
> Sent: Friday, September 1, 2017 5:16:11 AM
> To: Takuya UESHIN
> Cc: spark-dev
> Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>  
> Ok, thanks.
> 
> +1 on the SPIP for scope etc
> 
> 
> On API details (will deal with in code reviews as well but leaving a note 
> here in case I forget)
> 
> 1. I would suggest having the API also accept data type specification in 
> string form. It is usually simpler to say "long" then "LongType()". 
> 
> 2. Think about what error message to show when the rows numbers don't match 
> at runtime. 
> 
> 
> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN  wrote:
> Yes, the aggregation is out of scope for now.
> I think we should continue discussing the aggregation at JIRA and we will be 
> adding those later separately.
> 
> Thanks.
> 
> 
> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin  wrote:
> Is the idea aggregate is out of scope for the current effort and we will be 
> adding those later?
> 
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN  wrote:
> Hi all,
> 
> We've been discussing to support vectorized UDFs in Python and we almost got 
> a consensus about the APIs, so I'd like to summarize and call for a vote.
> 
> Note that this vote should focus on APIs for vectorized UDFs, not APIs for 
> vectorized UDAFs or Window operations.
> 
> https://issues.apache.org/jira/browse/SPARK-21190
> 
> 
> Proposed API
> 
> We introduce a @pandas_udf decorator (or annotation) to define vectorized 
> UDFs which takes one or more pandas.Series or one integer value meaning the 
> length of the input value for 0-parameter UDFs. The return value should be 
> pandas.Series of the specified type and the length of the returned value 
> should be the same as input value.
> 
> We can define vectorized UDFs as:
> 
>   @pandas_udf(DoubleType())
>   def plus(v1, v2):
>   return v1 + v2
> 
> or we can define as:
> 
>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
> 
> We can use it similar to row-by-row UDFs:
> 
>   df.withColumn('sum', plus(df.v1, df.v2))
> 
> As for 0-parameter UDFs, we can define and use as:
> 
>   @pandas_udf(LongType())
>   def f0(size):
>   return pd.Series(1).repeat(size)
> 
>   df.select(f0())
> 
> 
> 
> The vote will be up for the next 72 hours. Please reply with your vote:
> 
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical 
> reasons.
> 
> Thanks!
> 
> -- 
> Takuya UESHIN
> Tokyo, Japan
> 
> http://twitter.com/ueshin
> 
> 
> 
> -- 
> Takuya UESHIN
> Tokyo, Japan
> 
> http://twitter.com/ueshin
> 
> 
> 
> 
> -- 
> Takuya UESHIN
> Tokyo, Japan
> 
> http://twitter.com/ueshin
> 
> 
> 
> 
> -- 
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Hyukjin Kwon
+1 (non-binding)


2017-09-12 9:52 GMT+09:00 Yin Huai :

> +1
>
> On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal 
> wrote:
>
>> +1 (non-binding)
>>
>> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler  wrote:
>>
>>> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
>>> fine to work out the minor details of the API during review.
>>>
>>> Bryan
>>>
>>> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 
>>> wrote:
>>>
 Hi all,

 Thank you for voting and suggestions.

 As Wenchen mentioned and also we're discussing at JIRA, we need to
 discuss the size hint for the 0-parameter UDF.
 But I believe we got a consensus about the basic APIs except for the
 size hint, I'd like to submit a pr based on the current proposal and
 continue discussing in its review.

 https://github.com/apache/spark/pull/19147

 I'd keep this vote open to wait for more opinions.

 Thanks.


 On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan 
 wrote:

> +1 on the design and proposed API.
>
> One detail I'd like to discuss is the 0-parameter UDF, how we can
> specify the size hint. This can be done in the PR review though.
>
> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <
> felixcheun...@hotmail.com> wrote:
>
>> +1 on this and like the suggestion of type in string form.
>>
>> Would it be correct to assume there will be data type check, for
>> example the returned pandas data frame column data types match what are
>> specified. We have seen quite a bit of issues/confusions with that in R.
>>
>> Would it make sense to have a more generic decorator name so that it
>> could also be useable for other efficient vectorized format in the 
>> future?
>> Or do we anticipate the decorator to be format specific and will have 
>> more
>> in the future?
>>
>> --
>> *From:* Reynold Xin 
>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>> *To:* Takuya UESHIN
>> *Cc:* spark-dev
>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>
>> Ok, thanks.
>>
>> +1 on the SPIP for scope etc
>>
>>
>> On API details (will deal with in code reviews as well but leaving a
>> note here in case I forget)
>>
>> 1. I would suggest having the API also accept data type specification
>> in string form. It is usually simpler to say "long" then "LongType()".
>>
>> 2. Think about what error message to show when the rows numbers don't
>> match at runtime.
>>
>>
>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
>> wrote:
>>
>>> Yes, the aggregation is out of scope for now.
>>> I think we should continue discussing the aggregation at JIRA and we
>>> will be adding those later separately.
>>>
>>> Thanks.
>>>
>>>
>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 
>>> wrote:
>>>
 Is the idea aggregate is out of scope for the current effort and we
 will be adding those later?

 On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <
 ues...@happy-camper.st> wrote:

> Hi all,
>
> We've been discussing to support vectorized UDFs in Python and we
> almost got a consensus about the APIs, so I'd like to summarize
> and call for a vote.
>
> Note that this vote should focus on APIs for vectorized UDFs, not
> APIs for vectorized UDAFs or Window operations.
>
> https://issues.apache.org/jira/browse/SPARK-21190
>
>
> *Proposed API*
>
> We introduce a @pandas_udf decorator (or annotation) to define
> vectorized UDFs which takes one or more pandas.Series or one
> integer value meaning the length of the input value for 0-parameter 
> UDFs.
> The return value should be pandas.Series of the specified type
> and the length of the returned value should be the same as input 
> value.
>
> We can define vectorized UDFs as:
>
>   @pandas_udf(DoubleType())
>   def plus(v1, v2):
>   return v1 + v2
>
> or we can define as:
>
>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>
> We can use it similar to row-by-row UDFs:
>
>   df.withColumn('sum', plus(df.v1, df.v2))
>
> As for 0-parameter UDFs, we can define and use as:
>
>   @pandas_udf(LongType())
>   def f0(size):
>   return pd.Series(1).repeat(size)
>
>   df.select(f0())
>
>
>
> The vote will be up for the 

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Yin Huai
+1

On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal 
wrote:

> +1 (non-binding)
>
> On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler  wrote:
>
>> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
>> fine to work out the minor details of the API during review.
>>
>> Bryan
>>
>> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 
>> wrote:
>>
>>> Hi all,
>>>
>>> Thank you for voting and suggestions.
>>>
>>> As Wenchen mentioned and also we're discussing at JIRA, we need to
>>> discuss the size hint for the 0-parameter UDF.
>>> But I believe we got a consensus about the basic APIs except for the
>>> size hint, I'd like to submit a pr based on the current proposal and
>>> continue discussing in its review.
>>>
>>> https://github.com/apache/spark/pull/19147
>>>
>>> I'd keep this vote open to wait for more opinions.
>>>
>>> Thanks.
>>>
>>>
>>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan  wrote:
>>>
 +1 on the design and proposed API.

 One detail I'd like to discuss is the 0-parameter UDF, how we can
 specify the size hint. This can be done in the PR review though.

 On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung  wrote:

> +1 on this and like the suggestion of type in string form.
>
> Would it be correct to assume there will be data type check, for
> example the returned pandas data frame column data types match what are
> specified. We have seen quite a bit of issues/confusions with that in R.
>
> Would it make sense to have a more generic decorator name so that it
> could also be useable for other efficient vectorized format in the future?
> Or do we anticipate the decorator to be format specific and will have more
> in the future?
>
> --
> *From:* Reynold Xin 
> *Sent:* Friday, September 1, 2017 5:16:11 AM
> *To:* Takuya UESHIN
> *Cc:* spark-dev
> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>
> Ok, thanks.
>
> +1 on the SPIP for scope etc
>
>
> On API details (will deal with in code reviews as well but leaving a
> note here in case I forget)
>
> 1. I would suggest having the API also accept data type specification
> in string form. It is usually simpler to say "long" then "LongType()".
>
> 2. Think about what error message to show when the rows numbers don't
> match at runtime.
>
>
> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
> wrote:
>
>> Yes, the aggregation is out of scope for now.
>> I think we should continue discussing the aggregation at JIRA and we
>> will be adding those later separately.
>>
>> Thanks.
>>
>>
>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 
>> wrote:
>>
>>> Is the idea aggregate is out of scope for the current effort and we
>>> will be adding those later?
>>>
>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
>>> wrote:
>>>
 Hi all,

 We've been discussing to support vectorized UDFs in Python and we
 almost got a consensus about the APIs, so I'd like to summarize
 and call for a vote.

 Note that this vote should focus on APIs for vectorized UDFs, not
 APIs for vectorized UDAFs or Window operations.

 https://issues.apache.org/jira/browse/SPARK-21190


 *Proposed API*

 We introduce a @pandas_udf decorator (or annotation) to define
 vectorized UDFs which takes one or more pandas.Series or one
 integer value meaning the length of the input value for 0-parameter 
 UDFs.
 The return value should be pandas.Series of the specified type and
 the length of the returned value should be the same as input value.

 We can define vectorized UDFs as:

   @pandas_udf(DoubleType())
   def plus(v1, v2):
   return v1 + v2

 or we can define as:

   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())

 We can use it similar to row-by-row UDFs:

   df.withColumn('sum', plus(df.v1, df.v2))

 As for 0-parameter UDFs, we can define and use as:

   @pandas_udf(LongType())
   def f0(size):
   return pd.Series(1).repeat(size)

   df.select(f0())



 The vote will be up for the next 72 hours. Please reply with your
 vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following 
 technical
 

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-11 Thread Sameer Agarwal
+1 (non-binding)

On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler  wrote:

> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
> fine to work out the minor details of the API during review.
>
> Bryan
>
> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 
> wrote:
>
>> Hi all,
>>
>> Thank you for voting and suggestions.
>>
>> As Wenchen mentioned and also we're discussing at JIRA, we need to
>> discuss the size hint for the 0-parameter UDF.
>> But I believe we got a consensus about the basic APIs except for the size
>> hint, I'd like to submit a pr based on the current proposal and continue
>> discussing in its review.
>>
>> https://github.com/apache/spark/pull/19147
>>
>> I'd keep this vote open to wait for more opinions.
>>
>> Thanks.
>>
>>
>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan  wrote:
>>
>>> +1 on the design and proposed API.
>>>
>>> One detail I'd like to discuss is the 0-parameter UDF, how we can
>>> specify the size hint. This can be done in the PR review though.
>>>
>>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung 
>>> wrote:
>>>
 +1 on this and like the suggestion of type in string form.

 Would it be correct to assume there will be data type check, for
 example the returned pandas data frame column data types match what are
 specified. We have seen quite a bit of issues/confusions with that in R.

 Would it make sense to have a more generic decorator name so that it
 could also be useable for other efficient vectorized format in the future?
 Or do we anticipate the decorator to be format specific and will have more
 in the future?

 --
 *From:* Reynold Xin 
 *Sent:* Friday, September 1, 2017 5:16:11 AM
 *To:* Takuya UESHIN
 *Cc:* spark-dev
 *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

 Ok, thanks.

 +1 on the SPIP for scope etc


 On API details (will deal with in code reviews as well but leaving a
 note here in case I forget)

 1. I would suggest having the API also accept data type specification
 in string form. It is usually simpler to say "long" then "LongType()".

 2. Think about what error message to show when the rows numbers don't
 match at runtime.


 On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
 wrote:

> Yes, the aggregation is out of scope for now.
> I think we should continue discussing the aggregation at JIRA and we
> will be adding those later separately.
>
> Thanks.
>
>
> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 
> wrote:
>
>> Is the idea aggregate is out of scope for the current effort and we
>> will be adding those later?
>>
>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
>> wrote:
>>
>>> Hi all,
>>>
>>> We've been discussing to support vectorized UDFs in Python and we
>>> almost got a consensus about the APIs, so I'd like to summarize and
>>> call for a vote.
>>>
>>> Note that this vote should focus on APIs for vectorized UDFs, not
>>> APIs for vectorized UDAFs or Window operations.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>
>>>
>>> *Proposed API*
>>>
>>> We introduce a @pandas_udf decorator (or annotation) to define
>>> vectorized UDFs which takes one or more pandas.Series or one
>>> integer value meaning the length of the input value for 0-parameter 
>>> UDFs.
>>> The return value should be pandas.Series of the specified type and
>>> the length of the returned value should be the same as input value.
>>>
>>> We can define vectorized UDFs as:
>>>
>>>   @pandas_udf(DoubleType())
>>>   def plus(v1, v2):
>>>   return v1 + v2
>>>
>>> or we can define as:
>>>
>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>
>>> We can use it similar to row-by-row UDFs:
>>>
>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>
>>> As for 0-parameter UDFs, we can define and use as:
>>>
>>>   @pandas_udf(LongType())
>>>   def f0(size):
>>>   return pd.Series(1).repeat(size)
>>>
>>>   df.select(f0())
>>>
>>>
>>>
>>> The vote will be up for the next 72 hours. Please reply with your
>>> vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> 

Re: Easy way to get offset metatada with Spark Streaming API

2017-09-11 Thread Cody Koeninger
https://issues-test.apache.org/jira/browse/SPARK-18258

On Mon, Sep 11, 2017 at 7:15 AM, Dmitry Naumenko  wrote:
> Hi all,
>
> It started as a discussion in
> https://stackoverflow.com/questions/46153105/how-to-get-kafka-offsets-with-spark-structured-streaming-api.
>
> So the problem that there is no support in Public API to obtain the Kafka
> (or Kineses) offsets. For example, if you want to save offsets in external
> storage in Custom Sink, you should :
> 1) preserve topic, partition and offset across all transform operations of
> Dataset (based on hard-coded Kafka schema)
> 2) make a manual group by partition/offset with aggregate max offset
>
> Structured Streaming doc says "Every streaming source is assumed to have
> offsets", so why it's not a part of Public API? What do you think about
> supporting it?
>
> Dmitry

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Easy way to get offset metatada with Spark Streaming API

2017-09-11 Thread Dmitry Naumenko
Hi all,

It started as a discussion in
https://stackoverflow.com/questions/46153105/how-to-get-kafka-offsets-with-spark-structured-streaming-api
.

So the problem that there is no support in Public API to obtain the Kafka
(or Kineses) offsets. For example, if you want to save offsets in external
storage in Custom Sink, you should :
1) preserve topic, partition and offset across all transform operations of
Dataset (based on hard-coded Kafka schema)
2) make a manual group by partition/offset with aggregate max offset

Structured Streaming doc says "Every streaming source is assumed to have
offsets", so why it's not a part of Public API? What do you think about
supporting it?

Dmitry


[SPARK-20199][ML] : Provided featureSubsetStrategy to GBTClassifier and GBTRegressor

2017-09-11 Thread Pralabh Kumar
Hi Developers

Can somebody look into this pull request . Its being reviewed by MLnick

, sethah

, mpjlu

?
.


Please review it .


Regards
Pralabh Kumar


Re: Putting Kafka 0.8 behind an (opt-in) profile

2017-09-11 Thread Sean Owen
Pull request is ready to go: https://github.com/apache/spark/pull/19134

I flag it one more time because it means Kafka 0.8 is deprecated in 2.3.0
and because it will require -Pkafka-0-8 to build in the support now.

Pardon, I want to be sure: does this mean Pyspark Kafka support effectively
has no non-deprecated support now?

On Thu, Sep 7, 2017 at 10:32 AM Sean Owen  wrote:

> For those following along, see discussions at
> https://github.com/apache/spark/pull/19134
>
> It's now also clear that we'd need to remove Kafka 0.8 examples if Kafka
> 0.8 becomes optional. I think that's all reasonable but the change is
> growing beyond just putting it behind a profile.
>
> On Wed, Sep 6, 2017 at 3:00 PM Cody Koeninger  wrote:
>
>> I kind of doubt the kafka 0.10 integration is going to change much at
>> all before the upgrade to 0.11
>>
>> On Wed, Sep 6, 2017 at 8:57 AM, Sean Owen  wrote:
>> > Thanks, I can do that. We're then in the funny position of having one
>> > deprecated Kafka API, and one experimental one.
>> >
>> > Is the Kafka 0.10 integration as stable as it is going to be, and worth
>> > marking as such for 2.3.0?
>> >
>> >
>> > On Tue, Sep 5, 2017 at 4:12 PM Cody Koeninger 
>> wrote:
>> >>
>> >> +1 to going ahead and giving a deprecation warning now
>> >>
>> >> On Tue, Sep 5, 2017 at 6:39 AM, Sean Owen  wrote:
>> >> > On the road to Scala 2.12, we'll need to make Kafka 0.8 support
>> optional
>> >> > in
>> >> > the build, because it is not available for Scala 2.12.
>> >> >
>> >> > https://github.com/apache/spark/pull/19134  adds that profile. I
>> mention
>> >> > it
>> >> > because this means that Kafka 0.8 becomes "opt-in" and has to be
>> >> > explicitly
>> >> > enabled, and that may have implications for downstream builds.
>> >> >
>> >> > Yes, we can add true. It however
>> only
>> >> > has
>> >> > effect when no other profiles are set, which makes it more deceptive
>> >> > than
>> >> > useful IMHO. (We don't use it otherwise.)
>> >> >
>> >> > Reviewers may want to check my work especially as regards the Python
>> >> > test
>> >> > support and SBT build.
>> >> >
>> >> >
>> >> > Another related question is: when is 0.8 support deprecated,
>> removed? It
>> >> > seems sudden to remove it in 2.3.0. Maybe deprecation is in order.
>> The
>> >> > driver is that Kafka 0.11 and 1.0 will possibly require yet another
>> >> > variant
>> >> > of streaming support (not sure yet), and 3 versions is too many.
>> >> > Deprecating
>> >> > now opens more options sooner.
>>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-11 Thread Wenchen Fan
This vote passes with 4 binding +1 votes, 10 non-binding votes, one +0
vote, and no -1 votes.

Thanks all!

+1 votes (binding):
Wenchen Fan
Herman van Hövell tot Westerflier
Michael Armbrust
Reynold Xin


+1 votes (non-binding):
Xiao Li
Sameer Agarwal
Suresh Thalamati
Ryan Blue
Xingbo Jiang
Dongjoon Hyun
Zhenhua Wang
Noman Khan
vaquar khan
Hemant Bhanawat

+0 votes:
Andrew Ash

On Mon, Sep 11, 2017 at 4:03 PM, Wenchen Fan  wrote:

> yea, join push down (providing the other reader and join conditions) and
> aggregate push down (providing grouping keys and aggregate functions) can
> be added via the current framework in the future.
>
> On Mon, Sep 11, 2017 at 1:54 PM, Hemant Bhanawat 
> wrote:
>
>> +1 (non-binding)
>>
>> I have found the suggestion from Andrew Ash and James about plan push
>> down quite interesting. However, I am not clear about the join push-down
>> support at the data source level. Shouldn't it be the responsibility of the
>> join node to carry out a data source specific join? I mean join node and
>> the data source scan of the two sides can be coalesced into a single node
>> (theoretically). This can be done by providing a Strategy that replaces the
>> join node with a data source specific join node. We are doing it that way
>> for our data sources. I find this more intuitive.
>>
>> BTW, aggregate push-down support is desirable and should be considered as
>> an enhancement going forward.
>>
>> Hemant Bhanawat 
>> www.snappydata.io
>>
>> On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan 
>> wrote:
>>
>>> +1
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Sep 10, 2017 5:18 AM, "Noman Khan"  wrote:
>>>
 +1
 --
 *From:* wangzhenhua (G) 
 *Sent:* Friday, September 8, 2017 2:20:07 AM
 *To:* Dongjoon Hyun; 蒋星博
 *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
 Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
 *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path


 +1 (non-binding)  Great to see data source API is going to be improved!



 best regards,

 -Zhenhua(Xander)



 *发件人:* Dongjoon Hyun [mailto:dongjoon.h...@gmail.com]
 *发送时间:* 2017年9月8日 4:07
 *收件人:* 蒋星博
 *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
 Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
 *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path



 +1 (non-binding).



 On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博  wrote:

 +1





 Reynold Xin 于2017年9月7日 周四下午12:04写道:

 +1 as well



 On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

 +1



 On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
 wrote:

 +1 (non-binding)

 Thanks for making the updates reflected in the current PR. It would be
 great to see the doc updated before it is finally published though.

 Right now it feels like this SPIP is focused more on getting the basics
 right for what many datasources are already doing in API V1 combined with
 other private APIs, vs pushing forward state of the art for performance.

 I think that’s the right approach for this SPIP. We can add the support
 you’re talking about later with a more specific plan that doesn’t block
 fixing the problems that this addresses.

 ​



 On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
 hvanhov...@databricks.com> wrote:

 +1 (binding)



 I personally believe that there is quite a big difference between
 having a generic data source interface with a low surface area and pushing
 down a significant part of query processing into a datasource. The later
 has much wider wider surface area and will require us to stabilize most of
 the internal catalyst API's which will be a significant burden on the
 community to maintain and has the potential to slow development velocity
 significantly. If you want to write such integrations then you should be
 prepared to work with catalyst internals and own up to the fact that things
 might change across minor versions (and in some cases even maintenance
 releases). If you are willing to go down that road, then your best bet is
 to use the already existing spark session extensions which will allow you
 to write such integrations and can be used as an `escape hatch`.





 On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash 
 wrote:

 +0 

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-11 Thread Wenchen Fan
yea, join push down (providing the other reader and join conditions) and
aggregate push down (providing grouping keys and aggregate functions) can
be added via the current framework in the future.

On Mon, Sep 11, 2017 at 1:54 PM, Hemant Bhanawat 
wrote:

> +1 (non-binding)
>
> I have found the suggestion from Andrew Ash and James about plan push down
> quite interesting. However, I am not clear about the join push-down support
> at the data source level. Shouldn't it be the responsibility of the join
> node to carry out a data source specific join? I mean join node and the
> data source scan of the two sides can be coalesced into a single node
> (theoretically). This can be done by providing a Strategy that replaces the
> join node with a data source specific join node. We are doing it that way
> for our data sources. I find this more intuitive.
>
> BTW, aggregate push-down support is desirable and should be considered as
> an enhancement going forward.
>
> Hemant Bhanawat 
> www.snappydata.io
>
> On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan 
> wrote:
>
>> +1
>>
>> Regards,
>> Vaquar khan
>>
>> On Sep 10, 2017 5:18 AM, "Noman Khan"  wrote:
>>
>>> +1
>>> --
>>> *From:* wangzhenhua (G) 
>>> *Sent:* Friday, September 8, 2017 2:20:07 AM
>>> *To:* Dongjoon Hyun; 蒋星博
>>> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
>>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
>>> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>>>
>>>
>>> +1 (non-binding)  Great to see data source API is going to be improved!
>>>
>>>
>>>
>>> best regards,
>>>
>>> -Zhenhua(Xander)
>>>
>>>
>>>
>>> *发件人:* Dongjoon Hyun [mailto:dongjoon.h...@gmail.com]
>>> *发送时间:* 2017年9月8日 4:07
>>> *收件人:* 蒋星博
>>> *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
>>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
>>> *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>>>
>>>
>>>
>>> +1 (non-binding).
>>>
>>>
>>>
>>> On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>>
>>>
>>> Reynold Xin 于2017年9月7日 周四下午12:04写道:
>>>
>>> +1 as well
>>>
>>>
>>>
>>> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
>>> wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
>>> wrote:
>>>
>>> +1 (non-binding)
>>>
>>> Thanks for making the updates reflected in the current PR. It would be
>>> great to see the doc updated before it is finally published though.
>>>
>>> Right now it feels like this SPIP is focused more on getting the basics
>>> right for what many datasources are already doing in API V1 combined with
>>> other private APIs, vs pushing forward state of the art for performance.
>>>
>>> I think that’s the right approach for this SPIP. We can add the support
>>> you’re talking about later with a more specific plan that doesn’t block
>>> fixing the problems that this addresses.
>>>
>>> ​
>>>
>>>
>>>
>>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
>>> hvanhov...@databricks.com> wrote:
>>>
>>> +1 (binding)
>>>
>>>
>>>
>>> I personally believe that there is quite a big difference between having
>>> a generic data source interface with a low surface area and pushing down a
>>> significant part of query processing into a datasource. The later has much
>>> wider wider surface area and will require us to stabilize most of the
>>> internal catalyst API's which will be a significant burden on the community
>>> to maintain and has the potential to slow development velocity
>>> significantly. If you want to write such integrations then you should be
>>> prepared to work with catalyst internals and own up to the fact that things
>>> might change across minor versions (and in some cases even maintenance
>>> releases). If you are willing to go down that road, then your best bet is
>>> to use the already existing spark session extensions which will allow you
>>> to write such integrations and can be used as an `escape hatch`.
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash 
>>> wrote:
>>>
>>> +0 (non-binding)
>>>
>>>
>>>
>>> I think there are benefits to unifying all the Spark-internal
>>> datasources into a common public API for sure.  It will serve as a forcing
>>> function to ensure that those internal datasources aren't advantaged vs
>>> datasources developed externally as plugins to Spark, and that all Spark
>>> features are available to all datasources.
>>>
>>>
>>>
>>> But I also think this read-path proposal avoids the more difficult
>>> questions around how to continue pushing datasource performance forwards.
>>> James Baker (my colleague) had a number of questions about advanced
>>> 

Re: 2.1.2 maintenance release?

2017-09-11 Thread Holden Karau
So I think the consensus is that their is interest in having a few
maintenance releases. I'm happy to act as the RM. I think the next step is
seeing who the PMC wants as the RM for these (and if people are OK with me
I'll start updating my self on the docs, open JIRAs, and relevant Jenkins
jobs for packaging).

On Sun, Sep 10, 2017 at 11:31 PM, Felix Cheung 
wrote:

> Hi - what are the next steps?
> Pending changes are pushed and checked that there is no open JIRA
> targeting 2.1.2 and 2.2.1
>
> _
> From: Reynold Xin 
> Sent: Friday, September 8, 2017 9:27 AM
> Subject: Re: 2.1.2 maintenance release?
> To: Felix Cheung , Holden Karau <
> hol...@pigscanfly.ca>, Sean Owen , dev <
> dev@spark.apache.org>
>
>
>
> +1 as well. We should make a few maintenance releases.
>
> On Fri, Sep 8, 2017 at 6:46 PM Felix Cheung 
> wrote:
>
>> +1 on both 2.1.2 and 2.2.1
>>
>> And would try to help and/or wrangle the release if needed.
>>
>> (Note: trying to backport a few changes to branch-2.1 right now)
>>
>> --
>> *From:* Sean Owen 
>> *Sent:* Friday, September 8, 2017 12:05:28 AM
>> *To:* Holden Karau; dev
>> *Subject:* Re: 2.1.2 maintenance release?
>>
>> Let's look at the standard ASF guidance, which actually surprised me when
>> I first read it:
>>
>> https://www.apache.org/foundation/voting.html
>>
>> VOTES ON PACKAGE RELEASES
>> Votes on whether a package is ready to be released use majority approval
>> -- i.e. at least three PMC members must vote affirmatively for release, and
>> there must be more positive than negative votes. Releases may not be
>> vetoed. Generally the community will cancel the release vote if anyone
>> identifies serious problems, but in most cases the ultimate decision, lies
>> with the individual serving as release manager. The specifics of the
>> process may vary from project to project, but the 'minimum quorum of three
>> +1 votes' rule is universal.
>>
>>
>> PMC votes on it, but no vetoes allowed, and the release manager makes the
>> final call. Not your usual vote! doesn't say the release manager has to be
>> part of the PMC though it's the role with most decision power. In practice
>> I can't imagine it's a problem, but we could also just have someone on the
>> PMC technically be the release manager even as someone else is really
>> operating the release.
>>
>> The goal is, really, to be able to put out maintenance releases with
>> important fixes. Secondly, to ramp up one or more additional people to
>> perform the release steps. Maintenance releases ought to be the least
>> controversial releases to decide.
>>
>> Thoughts on kicking off a release for 2.1.2 to see how it goes?
>>
>> Although someone can just start following the steps, I think it will
>> certainly require some help from Michael, who's run the last release, to
>> clarify parts of the process or possibly provide an essential credential to
>> upload artifacts.
>>
>>
>> On Thu, Sep 7, 2017 at 11:59 PM Holden Karau 
>> wrote:
>>
>>> I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after
>>> that) if people are ok with a committer / me running the release process
>>> rather than a full PMC member.
>>>
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: 2.1.2 maintenance release?

2017-09-11 Thread Felix Cheung
Hi - what are the next steps?
Pending changes are pushed and checked that there is no open JIRA targeting 
2.1.2 and 2.2.1

_
From: Reynold Xin >
Sent: Friday, September 8, 2017 9:27 AM
Subject: Re: 2.1.2 maintenance release?
To: Felix Cheung >, 
Holden Karau >, Sean Owen 
>, dev 
>


+1 as well. We should make a few maintenance releases.

On Fri, Sep 8, 2017 at 6:46 PM Felix Cheung 
> wrote:
+1 on both 2.1.2 and 2.2.1

And would try to help and/or wrangle the release if needed.

(Note: trying to backport a few changes to branch-2.1 right now)


From: Sean Owen >
Sent: Friday, September 8, 2017 12:05:28 AM
To: Holden Karau; dev
Subject: Re: 2.1.2 maintenance release?

Let's look at the standard ASF guidance, which actually surprised me when I 
first read it:

https://www.apache.org/foundation/voting.html

VOTES ON PACKAGE RELEASES
Votes on whether a package is ready to be released use majority approval -- 
i.e. at least three PMC members must vote affirmatively for release, and there 
must be more positive than negative votes. Releases may not be vetoed. 
Generally the community will cancel the release vote if anyone identifies 
serious problems, but in most cases the ultimate decision, lies with the 
individual serving as release manager. The specifics of the process may vary 
from project to project, but the 'minimum quorum of three +1 votes' rule is 
universal.


PMC votes on it, but no vetoes allowed, and the release manager makes the final 
call. Not your usual vote! doesn't say the release manager has to be part of 
the PMC though it's the role with most decision power. In practice I can't 
imagine it's a problem, but we could also just have someone on the PMC 
technically be the release manager even as someone else is really operating the 
release.

The goal is, really, to be able to put out maintenance releases with important 
fixes. Secondly, to ramp up one or more additional people to perform the 
release steps. Maintenance releases ought to be the least controversial 
releases to decide.

Thoughts on kicking off a release for 2.1.2 to see how it goes?

Although someone can just start following the steps, I think it will certainly 
require some help from Michael, who's run the last release, to clarify parts of 
the process or possibly provide an essential credential to upload artifacts.


On Thu, Sep 7, 2017 at 11:59 PM Holden Karau 
> wrote:
I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after that) if 
people are ok with a committer / me running the release process rather than a 
full PMC member.




Re: Supporting Apache Aurora as a cluster manager

2017-09-11 Thread Mark Hamstra
While it may be worth creating the design doc and JIRA ticket so that we at
least have a better idea and a record of what you are talking about, I kind
of doubt that we are going to want to merge this into the Spark codebase.
That's not because of anything specific to this Aurora effort, but rather
because scheduler implementations in general are not going in the preferred
direction. There is already some regret that the YARN scheduler wasn't
implemented by means of a scheduler plug-in API, and there is likely to be
more regret if we continue to go forward with the spark-on-kubernetes SPIP
in its present form. I'd guess that we are likely to merge code associated
with that SPIP just because Kubernetes has become such an important
resource scheduler, but such a merge wouldn't be without some misgivings.
That is because we just can't get into the position of having more and more
scheduler implementations in the Spark code, and more and more maintenance
overhead to keep up with the idiosyncrasies of all the scheduler
implementations. We've really got to get to the kind of plug-in
architecture discussed in SPARK-19700 so that scheduler implementations can
be done outside of the Spark codebase, release schedule, etc.

My opinion on the subject isn't dispositive on its own, of course, but that
is how I'm seeing things right now.

On Sun, Sep 10, 2017 at 8:27 PM, karthik padmanabhan  wrote:

> Hi Spark Devs,
>
> We are using Aurora (http://aurora.apache.org/) as our mesos framework
> for running stateless services. We would like to use Aurora to deploy big
> data and batch workloads as well. And for this we have forked Spark and
> implement the ExternalClusterManager trait.
>
> The reason for doing this and not running Spark on Mesos is to leverage
> the existing roles and quotas provided by Aurora for admission control and
> also leverage Aurora features such as priority and preemption. Additionally
> we would like Aurora to be only deploy/orchestration system that our users
> should interact with.
>
> We have a working POC where Spark is launching jobs through as the
> ClusterManager. Is this something that can be merged upstream ? If so I can
> create a design document and create an associated jira ticket.
>
> Thanks
> Karthik
>