Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-29 Thread Mridul Muralidharan
Add a +1 from me as well.
Just managed to finish going over it.

Thanks Bobby for leading this effort !

Regards,
Mridul

On Wed, May 29, 2019 at 2:51 PM Tom Graves  wrote:
>
> Ok, I'm going to call this vote and send the result email. We had 9 +1's (4 
> binding) and 1 +0 and no -1's.
>
> Tom
>
> On Monday, May 27, 2019, 3:25:14 PM CDT, Felix Cheung 
>  wrote:
>
>
> +1
>
> I’d prefer to see more of the end goal and how that could be achieved (such 
> as ETL or SPARK-24579). However given the rounds and months of discussions we 
> have come down to just the public API.
>
> If the community thinks a new set of public API is maintainable, I don’t see 
> any problem with that.
>
> 
> From: Tom Graves 
> Sent: Sunday, May 26, 2019 8:22:59 AM
> To: hol...@pigscanfly.ca; Reynold Xin
> Cc: Bobby Evans; DB Tsai; Dongjoon Hyun; Imran Rashid; Jason Lowe; Matei 
> Zaharia; Thomas graves; Xiangrui Meng; Xiangrui Meng; dev
> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> Processing Support
>
> More feedback would be great, this has been open a long time though, let's 
> extend til Wednesday the 29th and see where we are at.
>
> Tom
>
>
>
> Sent from Yahoo Mail on Android
>
> On Sat, May 25, 2019 at 6:28 PM, Holden Karau
>  wrote:
> Same I meant to catch up after kubecon but had some unexpected travels.
>
> On Sat, May 25, 2019 at 10:56 PM Reynold Xin  wrote:
>
> Can we push this to June 1st? I have been meaning to read it but 
> unfortunately keeps traveling...
>
> On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun  wrote:
>
> +1
>
> Thanks,
> Dongjoon.
>
> On Fri, May 24, 2019 at 17:03 DB Tsai  wrote:
>
> +1 on exposing the APIs for columnar processing support.
>
> I understand that the scope of this SPIP doesn't cover AI / ML
> use-cases. But I saw a good performance gain when I converted data
> from rows to columns to leverage on SIMD architectures in a POC ML
> application.
>
> With the exposed columnar processing support, I can imagine that the
> heavy lifting parts of ML applications (such as computing the
> objective functions) can be written as columnar expressions that
> leverage on SIMD architectures to get a good speedup.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> On Wed, May 15, 2019 at 2:59 PM Bobby Evans  wrote:
> >
> > It would allow for the columnar processing to be extended through the 
> > shuffle.  So if I were doing say an FPGA accelerated extension it could 
> > replace the ShuffleExechangeExec with one that can take a ColumnarBatch as 
> > input instead of a Row. The extended version of the ShuffleExchangeExec 
> > could then do the partitioning on the incoming batch and instead of 
> > producing a ShuffleRowRDD for the exchange they could produce something 
> > like a ShuffleBatchRDD that would let the serializing and deserializing 
> > happen in a column based format for a faster exchange, assuming that 
> > columnar processing is also happening after the exchange. This is just like 
> > providing a columnar version of any other catalyst operator, except in this 
> > case it is a bit more complex of an operator.
> >
> > On Wed, May 15, 2019 at 12:15 PM Imran Rashid 
> >  wrote:
> >>
> >> sorry I am late to the discussion here -- the jira mentions using this 
> >> extensions for dealing with shuffles, can you explain that part?  I don't 
> >> see how you would use this to change shuffle behavior at all.
> >>
> >> On Tue, May 14, 2019 at 10:59 AM Thomas graves  wrote:
> >>>
> >>> Thanks for replying, I'll extend the vote til May 26th to allow your
> >>> and other people feedback who haven't had time to look at it.
> >>>
> >>> Tom
> >>>
> >>> On Mon, May 13, 2019 at 4:43 PM Holden Karau  wrote:
> >>> >
> >>> > I’d like to ask this vote period to be extended, I’m interested but I 
> >>> > don’t have the cycles to review it in detail and make an informed vote 
> >>> > until the 25th.
> >>> >
> >>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng  
> >>> > wrote:
> >>> >>
> >>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't 
> >>> >> feel strongly about it. I would still suggest doing the following:
> >>> >>
> >>> >> 1. Link the POC mentioned in Q4. So people can verify the

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-29 Thread Tom Graves
 Ok, I'm going to call this vote and send the result email. We had 9 +1's (4 
binding) and 1 +0 and no -1's.
Tom
On Monday, May 27, 2019, 3:25:14 PM CDT, Felix Cheung 
 wrote:  
 
 #yiv8731614492 html {background-color:transparent;}#yiv8731614492 body 
{color:#333;line-height:150%;margin:0;}#yiv8731614492 
.yiv8731614492ms-outlook-ios-reference-expand 
{display:block;color:#999;padding:20px 0px;text-decoration:none;}#yiv8731614492 
.yiv8731614492ms-outlook-ios-availability-container 
{max-width:500px;margin:auto;padding:12px 15px 15px 15px;border:1px solid 
#C7E0F4;border-radius:4px;}#yiv8731614492 #yiv8731614492 
.yiv8731614492ms-outlook-ios-availability-delete-button 
{width:25px;min-height:25px;background-size:25px 
25px;background-position:center;}#yiv8731614492 
#yiv8731614492ms-outlook-ios-main-container {margin:0 0 0 
0;margin-top:120;padding:8;}#yiv8731614492 
#yiv8731614492ms-outlook-ios-content-container 
{padding:0;padding-top:12;padding-bottom:20;}#yiv8731614492 
.yiv8731614492ms-outlook-ios-mention 
{color:#333;background-color:#f1f1f1;border-radius:4px;padding:0 2px 0 
2px;text-decoration:none;}#yiv8731614492 
.yiv8731614492ms-outlook-ios-mention-external 
{color:#ba8f0d;background-color:#fdf7e7;}#yiv8731614492 
.yiv8731614492ms-outlook-ios-mention-external-clear-design 
{color:#ba8f0d;background-color:#f1f1f1;}+1
I’d prefer to see more of the end goal and how that could be achieved (such as 
ETL or SPARK-24579). However given the rounds and months of discussions we have 
come down to just the public API.
If the community thinks a new set of public API is maintainable, I don’t see 
any problem with that.
From: Tom Graves 
Sent: Sunday, May 26, 2019 8:22:59 AM
To: hol...@pigscanfly.ca; Reynold Xin
Cc: Bobby Evans; DB Tsai; Dongjoon Hyun; Imran Rashid; Jason Lowe; Matei 
Zaharia; Thomas graves; Xiangrui Meng; Xiangrui Meng; dev
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
Processing Support More feedback would be great, this has been open a long time 
though, let's extend til Wednesday the 29th and see where we are at.
Tom



Sent from Yahoo Mail on Android

On Sat, May 25, 2019 at 6:28 PM, Holden Karau wrote:Same 
I meant to catch up after kubecon but had some unexpected travels.
On Sat, May 25, 2019 at 10:56 PM Reynold Xin  wrote:

Can we push this to June 1st? I have been meaning to read it but unfortunately 
keeps traveling...
On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun  wrote:

+1
Thanks,Dongjoon.
On Fri, May 24, 2019 at 17:03 DB Tsai  wrote:

+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans  wrote:
>
> It would allow for the columnar processing to be extended through the 
> shuffle.  So if I were doing say an FPGA accelerated extension it could 
> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as 
> input instead of a Row. The extended version of the ShuffleExchangeExec could 
> then do the partitioning on the incoming batch and instead of producing a 
> ShuffleRowRDD for the exchange they could produce something like a 
> ShuffleBatchRDD that would let the serializing and deserializing happen in a 
> column based format for a faster exchange, assuming that columnar processing 
> is also happening after the exchange. This is just like providing a columnar 
> version of any other catalyst operator, except in this case it is a bit more 
> complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid  
> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this 
>> extensions for dealing with shuffles, can you explain that part?  I don't 
>> see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves  wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau  wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I 
>>> > don’t have the cycles to review it in detail and make an informed vote 
>>> > until the 25th.
>>>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-27 Thread Felix Cheung
+1

I’d prefer to see more of the end goal and how that could be achieved (such as 
ETL or SPARK-24579). However given the rounds and months of discussions we have 
come down to just the public API.

If the community thinks a new set of public API is maintainable, I don’t see 
any problem with that.


From: Tom Graves 
Sent: Sunday, May 26, 2019 8:22:59 AM
To: hol...@pigscanfly.ca; Reynold Xin
Cc: Bobby Evans; DB Tsai; Dongjoon Hyun; Imran Rashid; Jason Lowe; Matei 
Zaharia; Thomas graves; Xiangrui Meng; Xiangrui Meng; dev
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
Processing Support

More feedback would be great, this has been open a long time though, let's 
extend til Wednesday the 29th and see where we are at.

Tom



Sent from Yahoo Mail on 
Android<https://go.onelink.me/107872968?pid=InProduct=Global_Internal_YGrowth_AndroidEmailSig__AndroidUsers_wl=ym_sub1=Internal_sub2=Global_YGrowth_sub3=EmailSignature>

On Sat, May 25, 2019 at 6:28 PM, Holden Karau
 wrote:
Same I meant to catch up after kubecon but had some unexpected travels.

On Sat, May 25, 2019 at 10:56 PM Reynold Xin 
mailto:r...@databricks.com>> wrote:
Can we push this to June 1st? I have been meaning to read it but unfortunately 
keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1

Thanks,
Dongjoon.

On Fri, May 24, 2019 at 17:03 DB Tsai  wrote:
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans 
mailto:reva...@gmail.com>> wrote:
>
> It would allow for the columnar processing to be extended through the 
> shuffle.  So if I were doing say an FPGA accelerated extension it could 
> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as 
> input instead of a Row. The extended version of the ShuffleExchangeExec could 
> then do the partitioning on the incoming batch and instead of producing a 
> ShuffleRowRDD for the exchange they could produce something like a 
> ShuffleBatchRDD that would let the serializing and deserializing happen in a 
> column based format for a faster exchange, assuming that columnar processing 
> is also happening after the exchange. This is just like providing a columnar 
> version of any other catalyst operator, except in this case it is a bit more 
> complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid  
> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this 
>> extensions for dealing with shuffles, can you explain that part?  I don't 
>> see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves 
>> mailto:tgra...@apache.org>> wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau 
>>> mailto:hol...@pigscanfly.ca>> wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I 
>>> > don’t have the cycles to review it in detail and make an informed vote 
>>> > until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng 
>>> > mailto:m...@databricks.com>> wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't 
>>> >> feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick 
>>> >> check. Beside ColumnarBatch and ColumnarVector, we also need to make the 
>>> >> following public. People who are familiar with SQL internals should help 
>>> >> assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-25 Thread Dongjoon Hyun
+1

Thanks,
Dongjoon.

On Fri, May 24, 2019 at 17:03 DB Tsai  wrote:

> +1 on exposing the APIs for columnar processing support.
>
> I understand that the scope of this SPIP doesn't cover AI / ML
> use-cases. But I saw a good performance gain when I converted data
> from rows to columns to leverage on SIMD architectures in a POC ML
> application.
>
> With the exposed columnar processing support, I can imagine that the
> heavy lifting parts of ML applications (such as computing the
> objective functions) can be written as columnar expressions that
> leverage on SIMD architectures to get a good speedup.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> On Wed, May 15, 2019 at 2:59 PM Bobby Evans  wrote:
> >
> > It would allow for the columnar processing to be extended through the
> shuffle.  So if I were doing say an FPGA accelerated extension it could
> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as
> input instead of a Row. The extended version of the ShuffleExchangeExec
> could then do the partitioning on the incoming batch and instead of
> producing a ShuffleRowRDD for the exchange they could produce something
> like a ShuffleBatchRDD that would let the serializing and deserializing
> happen in a column based format for a faster exchange, assuming that
> columnar processing is also happening after the exchange. This is just like
> providing a columnar version of any other catalyst operator, except in this
> case it is a bit more complex of an operator.
> >
> > On Wed, May 15, 2019 at 12:15 PM Imran Rashid
>  wrote:
> >>
> >> sorry I am late to the discussion here -- the jira mentions using this
> extensions for dealing with shuffles, can you explain that part?  I don't
> see how you would use this to change shuffle behavior at all.
> >>
> >> On Tue, May 14, 2019 at 10:59 AM Thomas graves 
> wrote:
> >>>
> >>> Thanks for replying, I'll extend the vote til May 26th to allow your
> >>> and other people feedback who haven't had time to look at it.
> >>>
> >>> Tom
> >>>
> >>> On Mon, May 13, 2019 at 4:43 PM Holden Karau 
> wrote:
> >>> >
> >>> > I’d like to ask this vote period to be extended, I’m interested but
> I don’t have the cycles to review it in detail and make an informed vote
> until the 25th.
> >>> >
> >>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng 
> wrote:
> >>> >>
> >>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I
> don't feel strongly about it. I would still suggest doing the following:
> >>> >>
> >>> >> 1. Link the POC mentioned in Q4. So people can verify the POC
> result.
> >>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick
> check. Beside ColumnarBatch and ColumnarVector, we also need to make the
> following public. People who are familiar with SQL internals should help
> assess the risk.
> >>> >> * ColumnarArray
> >>> >> * ColumnarMap
> >>> >> * unsafe.types.CaledarInterval
> >>> >> * ColumnarRow
> >>> >> * UTF8String
> >>> >> * ArrayData
> >>> >> * ...
> >>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't
> match the purpose of this SPIP. It does make some code cleaner. But I guess
> for ETL use cases, it won't bring much value.
> >>> >>
> >>> > --
> >>> > Twitter: https://twitter.com/holdenkarau
> >>> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-24 Thread DB Tsai
+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans  wrote:
>
> It would allow for the columnar processing to be extended through the 
> shuffle.  So if I were doing say an FPGA accelerated extension it could 
> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as 
> input instead of a Row. The extended version of the ShuffleExchangeExec could 
> then do the partitioning on the incoming batch and instead of producing a 
> ShuffleRowRDD for the exchange they could produce something like a 
> ShuffleBatchRDD that would let the serializing and deserializing happen in a 
> column based format for a faster exchange, assuming that columnar processing 
> is also happening after the exchange. This is just like providing a columnar 
> version of any other catalyst operator, except in this case it is a bit more 
> complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid  
> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this 
>> extensions for dealing with shuffles, can you explain that part?  I don't 
>> see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves  wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau  wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I 
>>> > don’t have the cycles to review it in detail and make an informed vote 
>>> > until the 25th.
>>> >
>>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng  wrote:
>>> >>
>>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't 
>>> >> feel strongly about it. I would still suggest doing the following:
>>> >>
>>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick 
>>> >> check. Beside ColumnarBatch and ColumnarVector, we also need to make the 
>>> >> following public. People who are familiar with SQL internals should help 
>>> >> assess the risk.
>>> >> * ColumnarArray
>>> >> * ColumnarMap
>>> >> * unsafe.types.CaledarInterval
>>> >> * ColumnarRow
>>> >> * UTF8String
>>> >> * ArrayData
>>> >> * ...
>>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match 
>>> >> the purpose of this SPIP. It does make some code cleaner. But I guess 
>>> >> for ETL use cases, it won't bring much value.
>>> >>
>>> > --
>>> > Twitter: https://twitter.com/holdenkarau
>>> > Books (Learning Spark, High Performance Spark, etc.): 
>>> > https://amzn.to/2MaRAG9
>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-15 Thread Bobby Evans
It would allow for the columnar processing to be extended through the
shuffle.  So if I were doing say an FPGA accelerated extension it could
replace the ShuffleExechangeExec with one that can take a ColumnarBatch as
input instead of a Row. The extended version of the ShuffleExchangeExec
could then do the partitioning on the incoming batch and instead of
producing a ShuffleRowRDD for the exchange they could produce something
like a ShuffleBatchRDD that would let the serializing and deserializing
happen in a column based format for a faster exchange, assuming that
columnar processing is also happening after the exchange. This is just like
providing a columnar version of any other catalyst operator, except in this
case it is a bit more complex of an operator.

On Wed, May 15, 2019 at 12:15 PM Imran Rashid 
wrote:

> sorry I am late to the discussion here -- the jira mentions using this
> extensions for dealing with shuffles, can you explain that part?  I don't
> see how you would use this to change shuffle behavior at all.
>
> On Tue, May 14, 2019 at 10:59 AM Thomas graves  wrote:
>
>> Thanks for replying, I'll extend the vote til May 26th to allow your
>> and other people feedback who haven't had time to look at it.
>>
>> Tom
>>
>> On Mon, May 13, 2019 at 4:43 PM Holden Karau 
>> wrote:
>> >
>> > I’d like to ask this vote period to be extended, I’m interested but I
>> don’t have the cycles to review it in detail and make an informed vote
>> until the 25th.
>> >
>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng 
>> wrote:
>> >>
>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't
>> feel strongly about it. I would still suggest doing the following:
>> >>
>> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick
>> check. Beside ColumnarBatch and ColumnarVector, we also need to make the
>> following public. People who are familiar with SQL internals should help
>> assess the risk.
>> >> * ColumnarArray
>> >> * ColumnarMap
>> >> * unsafe.types.CaledarInterval
>> >> * ColumnarRow
>> >> * UTF8String
>> >> * ArrayData
>> >> * ...
>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match
>> the purpose of this SPIP. It does make some code cleaner. But I guess for
>> ETL use cases, it won't bring much value.
>> >>
>> > --
>> > Twitter: https://twitter.com/holdenkarau
>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-15 Thread Imran Rashid
sorry I am late to the discussion here -- the jira mentions using this
extensions for dealing with shuffles, can you explain that part?  I don't
see how you would use this to change shuffle behavior at all.

On Tue, May 14, 2019 at 10:59 AM Thomas graves  wrote:

> Thanks for replying, I'll extend the vote til May 26th to allow your
> and other people feedback who haven't had time to look at it.
>
> Tom
>
> On Mon, May 13, 2019 at 4:43 PM Holden Karau  wrote:
> >
> > I’d like to ask this vote period to be extended, I’m interested but I
> don’t have the cycles to review it in detail and make an informed vote
> until the 25th.
> >
> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng 
> wrote:
> >>
> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't
> feel strongly about it. I would still suggest doing the following:
> >>
> >> 1. Link the POC mentioned in Q4. So people can verify the POC result.
> >> 2. List public APIs we plan to expose in Appendix A. I did a quick
> check. Beside ColumnarBatch and ColumnarVector, we also need to make the
> following public. People who are familiar with SQL internals should help
> assess the risk.
> >> * ColumnarArray
> >> * ColumnarMap
> >> * unsafe.types.CaledarInterval
> >> * ColumnarRow
> >> * UTF8String
> >> * ArrayData
> >> * ...
> >> 3. I still feel using Pandas UDF as the mid-term success doesn't match
> the purpose of this SPIP. It does make some code cleaner. But I guess for
> ETL use cases, it won't bring much value.
> >>
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-14 Thread Thomas graves
Thanks for replying, I'll extend the vote til May 26th to allow your
and other people feedback who haven't had time to look at it.

Tom

On Mon, May 13, 2019 at 4:43 PM Holden Karau  wrote:
>
> I’d like to ask this vote period to be extended, I’m interested but I don’t 
> have the cycles to review it in detail and make an informed vote until the 
> 25th.
>
> On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng  wrote:
>>
>> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel 
>> strongly about it. I would still suggest doing the following:
>>
>> 1. Link the POC mentioned in Q4. So people can verify the POC result.
>> 2. List public APIs we plan to expose in Appendix A. I did a quick check. 
>> Beside ColumnarBatch and ColumnarVector, we also need to make the following 
>> public. People who are familiar with SQL internals should help assess the 
>> risk.
>> * ColumnarArray
>> * ColumnarMap
>> * unsafe.types.CaledarInterval
>> * ColumnarRow
>> * UTF8String
>> * ArrayData
>> * ...
>> 3. I still feel using Pandas UDF as the mid-term success doesn't match the 
>> purpose of this SPIP. It does make some code cleaner. But I guess for ETL 
>> use cases, it won't bring much value.
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-13 Thread Holden Karau
I’d like to ask this vote period to be extended, I’m interested but I don’t
have the cycles to review it in detail and make an informed vote until the
25th.

On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng  wrote:

> My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't
> feel strongly about it. I would still suggest doing the following:
>
> 1. Link the POC mentioned in Q4. So people can verify the POC result.
> 2. List public APIs we plan to expose in Appendix A. I did a quick check.
> Beside ColumnarBatch and ColumnarVector, we also need to make the following
> public. People who are familiar with SQL internals should help assess the
> risk.
> * ColumnarArray
> * ColumnarMap
> * unsafe.types.CaledarInterval
> * ColumnarRow
> * UTF8String
> * ArrayData
> * ...
> 3. I still feel using Pandas UDF as the mid-term success doesn't match the
> purpose of this SPIP. It does make some code cleaner. But I guess for ETL
> use cases, it won't bring much value.
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-13 Thread Xiangrui Meng
My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel
strongly about it. I would still suggest doing the following:

1. Link the POC mentioned in Q4. So people can verify the POC result.
2. List public APIs we plan to expose in Appendix A. I did a quick check.
Beside ColumnarBatch and ColumnarVector, we also need to make the following
public. People who are familiar with SQL internals should help assess the
risk.
* ColumnarArray
* ColumnarMap
* unsafe.types.CaledarInterval
* ColumnarRow
* UTF8String
* ArrayData
* ...
3. I still feel using Pandas UDF as the mid-term success doesn't match the
purpose of this SPIP. It does make some code cleaner. But I guess for ETL
use cases, it won't bring much value.


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-13 Thread Thomas graves
It would be nice to get feedback from people who responded on the
other vote thread - Reynold, Matei, Xiangrui, does the new version
look good?

Thanks,
Tom

On Mon, May 13, 2019 at 8:22 AM Jason Lowe  wrote:
>
> +1 (non-binding)
>
> Jason
>
> On Tue, May 7, 2019 at 1:37 PM Thomas graves  wrote:
>>
>> Hi everyone,
>>
>> I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
>> for extended Columnar Processing Support.  The proposal is to extend
>> the support to allow for more columnar processing.  We had previous
>> vote and discussion threads and have updated the SPIP based on the
>> comments to clarify a few things and reduce the scope.
>>
>> You can find the updated proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396.
>>
>> Please vote as early as you can, I will leave the vote open until next
>> Monday (May 13th), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thanks!
>> Tom Graves
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-13 Thread Jason Lowe
+1 (non-binding)

Jason

On Tue, May 7, 2019 at 1:37 PM Thomas graves  wrote:

> Hi everyone,
>
> I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
> for extended Columnar Processing Support.  The proposal is to extend
> the support to allow for more columnar processing.  We had previous
> vote and discussion threads and have updated the SPIP based on the
> comments to clarify a few things and reduce the scope.
>
> You can find the updated proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396.
>
> Please vote as early as you can, I will leave the vote open until next
> Monday (May 13th), 2pm CST to give people plenty of time.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thanks!
> Tom Graves
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


RE: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-12 Thread tcondie
+1 (non-binding)

 

Tyson Condie

 

From: Kazuaki Ishizaki  
Sent: Thursday, May 9, 2019 9:17 AM
To: Bryan Cutler 
Cc: Bobby Evans ; Spark dev list ;
Thomas graves 
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar
Processing Support

 

+1 (non-binding)

Kazuaki Ishizaki



From:Bryan Cutler mailto:cutl...@gmail.com> >
To:Bobby Evans mailto:bo...@apache.org> >
Cc:Thomas graves mailto:tgra...@apache.org> >,
Spark dev list mailto:dev@spark.apache.org> >
Date:2019/05/09 03:20
Subject:    Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
Columnar Processing Support

  _  




+1 (non-binding)

On Tue, May 7, 2019 at 12:04 PM Bobby Evans < <mailto:bo...@apache.org>
bo...@apache.org> wrote:
I am +!

On Tue, May 7, 2019 at 1:37 PM Thomas graves < <mailto:tgra...@apache.org>
tgra...@apache.org> wrote:
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:
 <https://issues.apache.org/jira/browse/SPARK-27396>
https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

-
To unsubscribe e-mail:  <mailto:dev-unsubscr...@spark.apache.org>
dev-unsubscr...@spark.apache.org





Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-09 Thread Kazuaki Ishizaki
+1 (non-binding)

Kazuaki Ishizaki



From:   Bryan Cutler 
To: Bobby Evans 
Cc: Thomas graves , Spark dev list 

Date:   2019/05/09 03:20
Subject:Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended 
Columnar Processing Support



+1 (non-binding)

On Tue, May 7, 2019 at 12:04 PM Bobby Evans  wrote:
I am +!

On Tue, May 7, 2019 at 1:37 PM Thomas graves  wrote:
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:
https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-08 Thread Bryan Cutler
+1 (non-binding)

On Tue, May 7, 2019 at 12:04 PM Bobby Evans  wrote:

> I am +!
>
> On Tue, May 7, 2019 at 1:37 PM Thomas graves  wrote:
>
>> Hi everyone,
>>
>> I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
>> for extended Columnar Processing Support.  The proposal is to extend
>> the support to allow for more columnar processing.  We had previous
>> vote and discussion threads and have updated the SPIP based on the
>> comments to clarify a few things and reduce the scope.
>>
>> You can find the updated proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396.
>>
>> Please vote as early as you can, I will leave the vote open until next
>> Monday (May 13th), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thanks!
>> Tom Graves
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-07 Thread Bobby Evans
I am +!

On Tue, May 7, 2019 at 1:37 PM Thomas graves  wrote:

> Hi everyone,
>
> I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
> for extended Columnar Processing Support.  The proposal is to extend
> the support to allow for more columnar processing.  We had previous
> vote and discussion threads and have updated the SPIP based on the
> comments to clarify a few things and reduce the scope.
>
> You can find the updated proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396.
>
> Please vote as early as you can, I will leave the vote open until next
> Monday (May 13th), 2pm CST to give people plenty of time.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thanks!
> Tom Graves
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-07 Thread Thomas graves
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:
https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-02 Thread Bryan Cutler
019 at 1:03 PM Matei Zaharia <
> matei.zaha...@gmail.com>
> > wrote:
> > > > FYI, I’d also be concerned about exposing the Arrow API or format as
> a
> > public API if it’s not yet stable. Is stabilization of the API and format
> > coming soon on the roadmap there? Maybe someone can work with the Arrow
> > community to make that happen.
> > > >
> > > > We’ve been bitten lots of times by API changes forced by external
> > libraries even when those were widely popular. For example, we used
> Guava’s
> > Optional for a while, which changed at some point, and we also had issues
> > with Protobuf and Scala itself (especially how Scala’s APIs appear in
> > Java). API breakage might not be as serious in dynamic languages like
> > Python, where you can often keep compatibility with old behaviors, but it
> > really hurts in Java and Scala.
> > > >
> > > > The problem is especially bad for us because of two aspects of how
> > Spark is used:
> > > >
> > > > 1) Spark is used for production data transformation jobs that people
> > need to keep running for a long time. Nobody wants to make changes to a
> job
> > that’s been working fine and computing something correctly for years just
> > to get a bug fix from the latest Spark release or whatever. It’s much
> > better if they can upgrade Spark without editing every job.
> > > >
> > > > 2) Spark is often used as “glue” to combine data processing code in
> > other libraries, and these might start to require different versions of
> our
> > dependencies. For example, the Guava class exposed in Spark became a
> > problem when third-party libraries started requiring a new version of
> > Guava: those new libraries just couldn’t work with Spark. Protobuf was
> > especially bad because some users wanted to read data stored as Protobufs
> > (or in a format that uses Protobuf inside), so they needed a different
> > version of the library in their main data processing code.
> > > >
> > > > If there was some guarantee that this stuff would remain
> > backward-compatible, we’d be in a much better stuff. It’s not that hard
> to
> > keep a storage format backward-compatible: just document the format and
> > extend it only in ways that don’t break the meaning of old data (for
> > example, add new version numbers or field types that are read in a
> > different way). It’s a bit harder for a Java API, but maybe Spark could
> > just expose byte arrays directly and work on those if the API is not
> > guaranteed to stay stable (that is, we’d still use our own classes to
> > manipulate the data internally, and end users could use the Arrow library
> > if they want it).
> > > >
> > > > Matei
> > > >
> > > > > On Apr 20, 2019, at 8:38 AM, Bobby Evans 
> wrote:
> > > > >
> > > > > I think you misunderstood the point of this SPIP. I responded to
> > your comments in the SPIP JIRA.
> > > > >
> > > > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
> > wrote:
> > > > > I posted my comment in the JIRA. Main concerns here:
> > > > >
> > > > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might
> > have 1.0 release someday.
> > > > > 2. ML/DL systems that can benefits from columnar format are mostly
> > in Python.
> > > > > 3. Simple operations, though benefits vectorization, might not be
> > worth the data exchange overhead.
> > > > >
> > > > > So would an improved Pandas UDF API would be good enough? For
> > example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > > >
> > > > > Sorry that I should join the discussion earlier! Hope it is not too
> > late:)
> > > > >
> > > > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > > > +1 (non-binding) for better columnar data processing support.
> > > > >
> > > > >
> > > > >
> > > > > From: Jules Damji 
> > > > > Sent: Friday, April 19, 2019 12:21 PM
> > > > > To: Bryan Cutler 
> > > > > Cc: Dev 
> > > > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> > Columnar Processing Support
> > > > >
> > > > >
> > > > >
> > > > > + (non-binding)
> > > > >
> > > > > Sent from my iPhone
> > > > >

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-30 Thread Bobby Evans
but it
> really hurts in Java and Scala.
> > >
> > > The problem is especially bad for us because of two aspects of how
> Spark is used:
> > >
> > > 1) Spark is used for production data transformation jobs that people
> need to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
> > >
> > > 2) Spark is often used as “glue” to combine data processing code in
> other libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
> > >
> > > If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
> > >
> > > Matei
> > >
> > > > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > > >
> > > > I think you misunderstood the point of this SPIP. I responded to
> your comments in the SPIP JIRA.
> > > >
> > > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
> wrote:
> > > > I posted my comment in the JIRA. Main concerns here:
> > > >
> > > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might
> have 1.0 release someday.
> > > > 2. ML/DL systems that can benefits from columnar format are mostly
> in Python.
> > > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > > >
> > > > So would an improved Pandas UDF API would be good enough? For
> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > >
> > > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > > >
> > > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > > +1 (non-binding) for better columnar data processing support.
> > > >
> > > >
> > > >
> > > > From: Jules Damji 
> > > > Sent: Friday, April 19, 2019 12:21 PM
> > > > To: Bryan Cutler 
> > > > Cc: Dev 
> > > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > > >
> > > >
> > > >
> > > > + (non-binding)
> > > >
> > > > Sent from my iPhone
> > > >
> > > > Pardon the dumb thumb typos :)
> > > >
> > > >
> > > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler 
> wrote:
> > > >
> > > > +1 (non-binding)
> > > >
> > > >
> > > >
> > > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe 
> wrote:
> > > >
> > > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > > >
> > > >
> > > >
> > > > Jason
> > > >
> > > >
> > > >
> > > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>  wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > >
> > > >
> > > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > > >
> > > >
> > > >
> > > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > > >
> > > >
> > > >
> > > > Please vote as early as you can, I will leave the vote open until
> next Monday (the 22nd), 2pm CST to give people plenty of time.
> > > >
> > > >
> > > >
> > > > [ ] +1: Accept the proposal as an official SPIP
> > > >
> > > > [ ] +0
> > > >
> > > > [ ] -1: I don't think this is a good idea because ...
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Thanks!
> > > >
> > > > Tom Graves
> > > >
> > >
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-23 Thread Matei Zaharia
 
> > libraries, and these might start to require different versions of our 
> > dependencies. For example, the Guava class exposed in Spark became a 
> > problem when third-party libraries started requiring a new version of 
> > Guava: those new libraries just couldn’t work with Spark. Protobuf was 
> > especially bad because some users wanted to read data stored as Protobufs 
> > (or in a format that uses Protobuf inside), so they needed a different 
> > version of the library in their main data processing code.
> > 
> > If there was some guarantee that this stuff would remain 
> > backward-compatible, we’d be in a much better stuff. It’s not that hard to 
> > keep a storage format backward-compatible: just document the format and 
> > extend it only in ways that don’t break the meaning of old data (for 
> > example, add new version numbers or field types that are read in a 
> > different way). It’s a bit harder for a Java API, but maybe Spark could 
> > just expose byte arrays directly and work on those if the API is not 
> > guaranteed to stay stable (that is, we’d still use our own classes to 
> > manipulate the data internally, and end users could use the Arrow library 
> > if they want it).
> > 
> > Matei
> > 
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > > 
> > > I think you misunderstood the point of this SPIP. I responded to your 
> > > comments in the SPIP JIRA.
> > > 
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > > 
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> > > release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in 
> > > Python.
> > > 3. Simple operations, though benefits vectorization, might not be worth 
> > > the data exchange overhead.
> > > 
> > > So would an improved Pandas UDF API would be good enough? For example, 
> > > SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > 
> > > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > > 
> > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > > 
> > >  
> > > 
> > > From: Jules Damji  
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler 
> > > Cc: Dev 
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> > > Processing Support
> > > 
> > >  
> > > 
> > > + (non-binding)
> > > 
> > > Sent from my iPhone
> > > 
> > > Pardon the dumb thumb typos :)
> > > 
> > > 
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > > 
> > > +1 (non-binding)
> > > 
> > >  
> > > 
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > > 
> > > +1 (non-binding).  Looking forward to seeing better support for 
> > > processing columnar data.
> > > 
> > >  
> > > 
> > > Jason
> > > 
> > >  
> > > 
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
> > >  wrote:
> > > 
> > > Hi everyone,
> > > 
> > >  
> > > 
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for 
> > > extended Columnar Processing Support.  The proposal is to extend the 
> > > support to allow for more columnar processing.
> > > 
> > >  
> > > 
> > > You can find the full proposal in the jira at: 
> > > https://issues.apache.org/jira/browse/SPARK-27396. There was also a 
> > > DISCUSS thread in the dev mailing list.
> > > 
> > >  
> > > 
> > > Please vote as early as you can, I will leave the vote open until next 
> > > Monday (the 22nd), 2pm CST to give people plenty of time.
> > > 
> > >  
> > > 
> > > [ ] +1: Accept the proposal as an official SPIP
> > > 
> > > [ ] +0
> > > 
> > > [ ] -1: I don't think this is a good idea because ...
> > > 
> > >  
> > > 
> > >  
> > > 
> > > Thanks!
> > > 
> > > Tom Graves
> > > 
> > 
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Tom Graves
compatible. That being 
said,changes to format are not taken lightly and are backwards compatible 
whenpossible. I think it would be fair to mark the APIs exposing Arrow data 
asexperimental for the time being, and clearly state the version that must 
beused to be compatible in the docs. Also, adding features like this 
andSPARK-24579 will probably help adoption of Arrow and accelerate a 
1.0release. Adding the Arrow dev list to CC.

Bryan

On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia wrote:

Okay, that makes sense, but is the Arrow data format stable? If not, werisk 
breakage when Arrow changes in the future and some libraries usingthis feature 
are begin to use the new Arrow code.

Matei


On Apr 20, 2019, at 1:39 PM, Bobby Evans  wrote:

I want to be clear that this SPIP is not proposing exposing Arrow


APIs/Classes through any Spark APIs. SPARK-24579 is doing that, andbecause of 
the overlap between the two SPIPs I scaled this one back toconcentrate just on 
the columnar processing aspects. Sorry for theconfusion as I didn't update the 
JIRA description clearly enough when weadjusted it during the discussion on the 
JIRA. As part of the columnarprocessing, we plan on providing arrow formatted 
data, but that will beexposed through a Spark owned API.


On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia 


wrote:


FYI, I’d also be concerned about exposing the Arrow API or format as a


public API if it’s not yet stable. Is stabilization of the API and formatcoming 
soon on the roadmap there? Maybe someone can work with the Arrowcommunity to 
make that happen.


We’ve been bitten lots of times by API changes forced by external


libraries even when those were widely popular. For example, we used 
Guava’sOptional for a while, which changed at some point, and we also had 
issueswith Protobuf and Scala itself (especially how Scala’s APIs appear 
inJava). API breakage might not be as serious in dynamic languages likePython, 
where you can often keep compatibility with old behaviors, but itreally hurts 
in Java and Scala.


The problem is especially bad for us because of two aspects of how


Spark is used:


1) Spark is used for production data transformation jobs that people


need to keep running for a long time. Nobody wants to make changes to a 
jobthat’s been working fine and computing something correctly for years justto 
get a bug fix from the latest Spark release or whatever. It’s muchbetter if 
they can upgrade Spark without editing every job.


2) Spark is often used as “glue” to combine data processing code in


other libraries, and these might start to require different versions of 
ourdependencies. For example, the Guava class exposed in Spark became aproblem 
when third-party libraries started requiring a new version ofGuava: those new 
libraries just couldn’t work with Spark. Protobuf wasespecially bad because 
some users wanted to read data stored as Protobufs
(or in a format that uses Protobuf inside), so they needed a differentversion 
of the library in their main data processing code.


If there was some guarantee that this stuff would remain


backward-compatible, we’d be in a much better stuff. It’s not that hard tokeep 
a storage format backward-compatible: just document the format andextend it 
only in ways that don’t break the meaning of old data (forexample, add new 
version numbers or field types that are read in adifferent way). It’s a bit 
harder for a Java API, but maybe Spark couldjust expose byte arrays directly 
and work on those if the API is notguaranteed to stay stable (that is, we’d 
still use our own classes tomanipulate the data internally, and end users could 
use the Arrow libraryif they want it).


Matei


On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:

I think you misunderstood the point of this SPIP. I responded to your



comments in the SPIP JIRA.



On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 



wrote:



I posted my comment in the JIRA. Main concerns here:

1. Exposing third-party Java APIs in Spark is risky. Arrow might have



1.0 release someday.



2. ML/DL systems that can benefits from columnar format are mostly in



Python.



3. Simple operations, though benefits vectorization, might not be



worth the data exchange overhead.



So would an improved Pandas UDF API would be good enough? For



example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).



Sorry that I should join the discussion earlier! Hope it is not too



late:)



On Fri, Apr 19, 2019 at 1:20 PM  wrote:
+1 (non-binding) for better columnar data processing support.

From: Jules Damji 
Sent: Friday, April 19, 2019 12:21 PM
To: Bryan Cutler 
Cc: Dev 
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended



Columnar Processing Support



+ (non-binding)

Sent from my iPhone

Pardon the dumb thumb typos :)

On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:

+1 (non-binding)

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:

+1 (non-binding). Looking forward to see

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Bobby Evans
l use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>>
>> Matei
>>
>> On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
>>
>> I think you misunderstood the point of this SPIP. I responded to your
>>
>> comments in the SPIP JIRA.
>>
>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
>>
>> wrote:
>>
>> I posted my comment in the JIRA. Main concerns here:
>>
>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>>
>> 1.0 release someday.
>>
>> 2. ML/DL systems that can benefits from columnar format are mostly in
>>
>> Python.
>>
>> 3. Simple operations, though benefits vectorization, might not be
>>
>> worth the data exchange overhead.
>>
>> So would an improved Pandas UDF API would be good enough? For
>>
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>>
>> Sorry that I should join the discussion earlier! Hope it is not too
>>
>> late:)
>>
>> On Fri, Apr 19, 2019 at 1:20 PM  wrote:
>> +1 (non-binding) for better columnar data processing support.
>>
>> From: Jules Damji 
>> Sent: Friday, April 19, 2019 12:21 PM
>> To: Bryan Cutler 
>> Cc: Dev 
>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>>
>> Columnar Processing Support
>>
>> + (non-binding)
>>
>> Sent from my iPhone
>>
>> Pardon the dumb thumb typos :)
>>
>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
>>
>> +1 (non-binding)
>>
>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
>>
>> +1 (non-binding). Looking forward to seeing better support for
>>
>> processing columnar data.
>>
>> Jason
>>
>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>
>>  wrote:
>>
>> Hi everyone,
>>
>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>>
>> extended Columnar Processing Support. The proposal is to extend the
>> support to allow for more columnar processing.
>>
>> You can find the full proposal in the jira at:
>>
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>>
>> Please vote as early as you can, I will leave the vote open until
>>
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>>
>> [ ] +0
>>
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thanks!
>>
>> Tom Graves
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Reynold Xin
y hurts in Java and Scala.
>>> 
>>> 
>>>> 
>>>> 
>>>> The problem is especially bad for us because of two aspects of how
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Spark is used:
>>> 
>>> 
>>>> 
>>>> 
>>>> 1) Spark is used for production data transformation jobs that people
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> need to keep running for a long time. Nobody wants to make changes to a
>>> job that’s been working fine and computing something correctly for years
>>> just to get a bug fix from the latest Spark release or whatever. It’s much
>>> better if they can upgrade Spark without editing every job.
>>> 
>>> 
>>>> 
>>>> 
>>>> 2) Spark is often used as “glue” to combine data processing code in
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> other libraries, and these might start to require different versions of
>>> our dependencies. For example, the Guava class exposed in Spark became a
>>> problem when third-party libraries started requiring a new version of
>>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>>> especially bad because some users wanted to read data stored as Protobufs
>>> (or in a format that uses Protobuf inside), so they needed a different
>>> version of the library in their main data processing code.
>>> 
>>> 
>>>> 
>>>> 
>>>> If there was some guarantee that this stuff would remain
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>>> keep a storage format backward-compatible: just document the format and
>>> extend it only in ways that don’t break the meaning of old data (for
>>> example, add new version numbers or field types that are read in a
>>> different way). It’s a bit harder for a Java API, but maybe Spark could
>>> just expose byte arrays directly and work on those if the API is not
>>> guaranteed to stay stable (that is, we’d still use our own classes to
>>> manipulate the data internally, and end users could use the Arrow library
>>> if they want it).
>>> 
>>> 
>>>> 
>>>> 
>>>> Matei
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 20, 2019, at 8:38 AM, Bobby Evans < revans2@ gmail. com (
>>>>> reva...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> I think you misunderstood the point of this SPIP. I responded to your
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> comments in the SPIP JIRA.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng < mengxr@ gmail. com (
>>>>> men...@gmail.com ) >
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> wrote:
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> I posted my comment in the JIRA. Main concerns here:
>>>>> 
>>>>> 
>>>>> 
>>>>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 1.0 release someday.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 2. ML/DL systems that can benefits from columnar format are mostly in
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Python.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 3. Simple operations, though benefits vectorization, might not be
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> worth the data exchange overhead.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> So would an improved Pandas UDF API would be good enough? For
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Sorry that I should join the discussion earlier! Hope it is not too
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> late:)
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Apr 19, 2019 at 1:20 PM < tcondie@ gmail. com ( tcon...@gmail.com 
>>>>> )
>>>>> > wrote:
>>>>> +1 (non-binding) for better columnar data processing support.
>>>>> 
>>>>> 
>>>>> 
>>>>> From: Jules Damji < dmatrix@ comcast. net ( dmat...@comcast.net ) >
>>>>> Sent: Friday, April 19, 2019 12:21 PM
>>>>> To: Bryan Cutler < cutlerb@ gmail. com ( cutl...@gmail.com ) >
>>>>> Cc: Dev < dev@ spark. apache. org ( dev@spark.apache.org ) >
>>>>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Columnar Processing Support
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> + (non-binding)
>>>>> 
>>>>> 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> 
>>>>> 
>>>>> Pardon the dumb thumb typos :)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler < cutlerb@ gmail. com (
>>>>> cutl...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 (non-binding)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe < jlowe@ apache. org (
>>>>> jl...@apache.org ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 (non-binding). Looking forward to seeing better support for
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> processing columnar data.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Jason
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> < tgraves_cs@ yahoo. com. invalid ( tgraves...@yahoo.com.invalid ) > wrote:
>>> 
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> 
>>>>> 
>>>>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> extended Columnar Processing Support. The proposal is to extend the
>>> support to allow for more columnar processing.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> You can find the full proposal in the jira at:
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27396 (
>>> https://issues.apache.org/jira/browse/SPARK-27396 ). There was also a
>>> DISCUSS thread in the dev mailing list.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Please vote as early as you can, I will leave the vote open until
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> 
>>>>> 
>>>>> 
>>>>> [ ] +0
>>>>> 
>>>>> 
>>>>> 
>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> 
>>>>> Tom Graves
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> - To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>> 
>> 
> 
> 
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Xiangrui Meng
 when those were widely popular. For example, we used Guava’s
> Optional for a while, which changed at some point, and we also had issues
> with Protobuf and Scala itself (especially how Scala’s APIs appear in
> Java). API breakage might not be as serious in dynamic languages like
> Python, where you can often keep compatibility with old behaviors, but it
> really hurts in Java and Scala.
> >
> > The problem is especially bad for us because of two aspects of how Spark
> is used:
> >
> > 1) Spark is used for production data transformation jobs that people
> need to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
> >
> > 2) Spark is often used as “glue” to combine data processing code in
> other libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
> >
> > If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
> >
> > Matei
> >
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > >
> > > I think you misunderstood the point of this SPIP. I responded to your
> comments in the SPIP JIRA.
> > >
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
> wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > >
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
> 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > >
> > > So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > >
> > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > >
> > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > >
> > >
> > >
> > > From: Jules Damji 
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler 
> > > Cc: Dev 
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > >
> > >
> > >
> > > + (non-binding)
> > >
> > > Sent from my iPhone
> > >
> > > Pardon the dumb thumb typos :)
> > >
> > >
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > >
> > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > >
> > >
> > >
> > > Jason
> > >
> > >
> > >
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>  wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > >
> > >
> > >
> > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > >
> > >
> > >
> > > Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
> > >
> > >
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > >
> > > [ ] +0
> > >
> > > [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Tom Graves
> > >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Tom Graves
 not that hard to keep a storage format 
> backward-compatible: just document the format and extend it only in ways that 
> don’t break the meaning of old data (for example, add new version numbers or 
> field types that are read in a different way). It’s a bit harder for a Java 
> API, but maybe Spark could just expose byte arrays directly and work on those 
> if the API is not guaranteed to stay stable (that is, we’d still use our own 
> classes to manipulate the data internally, and end users could use the Arrow 
> library if they want it).
> 
> Matei
> 
> > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > 
> > I think you misunderstood the point of this SPIP. I responded to your 
> > comments in the SPIP JIRA.
> > 
> > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:
> > I posted my comment in the JIRA. Main concerns here:
> > 
> > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> > release someday.
> > 2. ML/DL systems that can benefits from columnar format are mostly in 
> > Python.
> > 3. Simple operations, though benefits vectorization, might not be worth the 
> > data exchange overhead.
> > 
> > So would an improved Pandas UDF API would be good enough? For example, 
> > SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > 
> > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > 
> > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > +1 (non-binding) for better columnar data processing support.
> > 
> >  
> > 
> > From: Jules Damji  
> > Sent: Friday, April 19, 2019 12:21 PM
> > To: Bryan Cutler 
> > Cc: Dev 
> > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> > Processing Support
> > 
> >  
> > 
> > + (non-binding)
> > 
> > Sent from my iPhone
> > 
> > Pardon the dumb thumb typos :)
> > 
> > 
> > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > 
> > +1 (non-binding)
> > 
> >  
> > 
> > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > 
> > +1 (non-binding).  Looking forward to seeing better support for processing 
> > columnar data.
> > 
> >  
> > 
> > Jason
> > 
> >  
> > 
> > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves  
> > wrote:
> > 
> > Hi everyone,
> > 
> >  
> > 
> > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
> > Columnar Processing Support.  The proposal is to extend the support to 
> > allow for more columnar processing.
> > 
> >  
> > 
> > You can find the full proposal in the jira at: 
> > https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
> > thread in the dev mailing list.
> > 
> >  
> > 
> > Please vote as early as you can, I will leave the vote open until next 
> > Monday (the 22nd), 2pm CST to give people plenty of time.
> > 
> >  
> > 
> > [ ] +1: Accept the proposal as an official SPIP
> > 
> > [ ] +0
> > 
> > [ ] -1: I don't think this is a good idea because ...
> > 
> >  
> > 
> >  
> > 
> > Thanks!
> > 
> > Tom Graves
> > 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



  

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Bobby Evans
g code.
>> >
>> > If there was some guarantee that this stuff would remain
>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>> keep a storage format backward-compatible: just document the format and
>> extend it only in ways that don’t break the meaning of old data (for
>> example, add new version numbers or field types that are read in a
>> different way). It’s a bit harder for a Java API, but maybe Spark could
>> just expose byte arrays directly and work on those if the API is not
>> guaranteed to stay stable (that is, we’d still use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>> >
>> > Matei
>> >
>> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
>> > >
>> > > I think you misunderstood the point of this SPIP. I responded to your
>> comments in the SPIP JIRA.
>> > >
>> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
>> wrote:
>> > > I posted my comment in the JIRA. Main concerns here:
>> > >
>> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>> 1.0 release someday.
>> > > 2. ML/DL systems that can benefits from columnar format are mostly in
>> Python.
>> > > 3. Simple operations, though benefits vectorization, might not be
>> worth the data exchange overhead.
>> > >
>> > > So would an improved Pandas UDF API would be good enough? For
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>> > >
>> > > Sorry that I should join the discussion earlier! Hope it is not too
>> late:)
>> > >
>> > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
>> > > +1 (non-binding) for better columnar data processing support.
>> > >
>> > >
>> > >
>> > > From: Jules Damji 
>> > > Sent: Friday, April 19, 2019 12:21 PM
>> > > To: Bryan Cutler 
>> > > Cc: Dev 
>> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>> Columnar Processing Support
>> > >
>> > >
>> > >
>> > > + (non-binding)
>> > >
>> > > Sent from my iPhone
>> > >
>> > > Pardon the dumb thumb typos :)
>> > >
>> > >
>> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
>> > >
>> > > +1 (non-binding)
>> > >
>> > >
>> > >
>> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
>> > >
>> > > +1 (non-binding).  Looking forward to seeing better support for
>> processing columnar data.
>> > >
>> > >
>> > >
>> > > Jason
>> > >
>> > >
>> > >
>> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>  wrote:
>> > >
>> > > Hi everyone,
>> > >
>> > >
>> > >
>> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>> extended Columnar Processing Support.  The proposal is to extend the
>> support to allow for more columnar processing.
>> > >
>> > >
>> > >
>> > > You can find the full proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>> > >
>> > >
>> > >
>> > > Please vote as early as you can, I will leave the vote open until
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>> > >
>> > >
>> > >
>> > > [ ] +1: Accept the proposal as an official SPIP
>> > >
>> > > [ ] +0
>> > >
>> > > [ ] -1: I don't think this is a good idea because ...
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Thanks!
>> > >
>> > > Tom Graves
>> > >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Bryan Cutler
 have
> 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > >
> > > So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > >
> > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > >
> > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > >
> > >
> > >
> > > From: Jules Damji 
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler 
> > > Cc: Dev 
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > >
> > >
> > >
> > > + (non-binding)
> > >
> > > Sent from my iPhone
> > >
> > > Pardon the dumb thumb typos :)
> > >
> > >
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > >
> > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > >
> > >
> > >
> > > Jason
> > >
> > >
> > >
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>  wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > >
> > >
> > >
> > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > >
> > >
> > >
> > > Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
> > >
> > >
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > >
> > > [ ] +0
> > >
> > > [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Tom Graves
> > >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Matei Zaharia
Okay, that makes sense, but is the Arrow data format stable? If not, we risk 
breakage when Arrow changes in the future and some libraries using this feature 
are begin to use the new Arrow code.

Matei

> On Apr 20, 2019, at 1:39 PM, Bobby Evans  wrote:
> 
> I want to be clear that this SPIP is not proposing exposing Arrow 
> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and because 
> of the overlap between the two SPIPs I scaled this one back to concentrate 
> just on the columnar processing aspects. Sorry for the confusion as I didn't 
> update the JIRA description clearly enough when we adjusted it during the 
> discussion on the JIRA.  As part of the columnar processing, we plan on 
> providing arrow formatted data, but that will be exposed through a Spark 
> owned API.
> 
> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia  wrote:
> FYI, I’d also be concerned about exposing the Arrow API or format as a public 
> API if it’s not yet stable. Is stabilization of the API and format coming 
> soon on the roadmap there? Maybe someone can work with the Arrow community to 
> make that happen.
> 
> We’ve been bitten lots of times by API changes forced by external libraries 
> even when those were widely popular. For example, we used Guava’s Optional 
> for a while, which changed at some point, and we also had issues with 
> Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API 
> breakage might not be as serious in dynamic languages like Python, where you 
> can often keep compatibility with old behaviors, but it really hurts in Java 
> and Scala.
> 
> The problem is especially bad for us because of two aspects of how Spark is 
> used:
> 
> 1) Spark is used for production data transformation jobs that people need to 
> keep running for a long time. Nobody wants to make changes to a job that’s 
> been working fine and computing something correctly for years just to get a 
> bug fix from the latest Spark release or whatever. It’s much better if they 
> can upgrade Spark without editing every job.
> 
> 2) Spark is often used as “glue” to combine data processing code in other 
> libraries, and these might start to require different versions of our 
> dependencies. For example, the Guava class exposed in Spark became a problem 
> when third-party libraries started requiring a new version of Guava: those 
> new libraries just couldn’t work with Spark. Protobuf was especially bad 
> because some users wanted to read data stored as Protobufs (or in a format 
> that uses Protobuf inside), so they needed a different version of the library 
> in their main data processing code.
> 
> If there was some guarantee that this stuff would remain backward-compatible, 
> we’d be in a much better stuff. It’s not that hard to keep a storage format 
> backward-compatible: just document the format and extend it only in ways that 
> don’t break the meaning of old data (for example, add new version numbers or 
> field types that are read in a different way). It’s a bit harder for a Java 
> API, but maybe Spark could just expose byte arrays directly and work on those 
> if the API is not guaranteed to stay stable (that is, we’d still use our own 
> classes to manipulate the data internally, and end users could use the Arrow 
> library if they want it).
> 
> Matei
> 
> > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > 
> > I think you misunderstood the point of this SPIP. I responded to your 
> > comments in the SPIP JIRA.
> > 
> > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:
> > I posted my comment in the JIRA. Main concerns here:
> > 
> > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> > release someday.
> > 2. ML/DL systems that can benefits from columnar format are mostly in 
> > Python.
> > 3. Simple operations, though benefits vectorization, might not be worth the 
> > data exchange overhead.
> > 
> > So would an improved Pandas UDF API would be good enough? For example, 
> > SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > 
> > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > 
> > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > +1 (non-binding) for better columnar data processing support.
> > 
> >  
> > 
> > From: Jules Damji  
> > Sent: Friday, April 19, 2019 12:21 PM
> > To: Bryan Cutler 
> > Cc: Dev 
> > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> > Processing Support
> > 
> >  
> > 
> > + (non-binding)
> > 
> > Sent from my iPhone
> > 
> > Pardon the dumb thumb 

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Bobby Evans
I want to be clear that this SPIP is not proposing exposing Arrow
APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
because of the overlap between the two SPIPs I scaled this one back to
concentrate just on the columnar processing aspects. Sorry for the
confusion as I didn't update the JIRA description clearly enough when we
adjusted it during the discussion on the JIRA.  As part of the columnar
processing, we plan on providing arrow formatted data, but that will be
exposed through a Spark owned API.

On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia 
wrote:

> FYI, I’d also be concerned about exposing the Arrow API or format as a
> public API if it’s not yet stable. Is stabilization of the API and format
> coming soon on the roadmap there? Maybe someone can work with the Arrow
> community to make that happen.
>
> We’ve been bitten lots of times by API changes forced by external
> libraries even when those were widely popular. For example, we used Guava’s
> Optional for a while, which changed at some point, and we also had issues
> with Protobuf and Scala itself (especially how Scala’s APIs appear in
> Java). API breakage might not be as serious in dynamic languages like
> Python, where you can often keep compatibility with old behaviors, but it
> really hurts in Java and Scala.
>
> The problem is especially bad for us because of two aspects of how Spark
> is used:
>
> 1) Spark is used for production data transformation jobs that people need
> to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
>
> 2) Spark is often used as “glue” to combine data processing code in other
> libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
>
> If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
>
> Matei
>
> > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> >
> > I think you misunderstood the point of this SPIP. I responded to your
> comments in the SPIP JIRA.
> >
> > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:
> > I posted my comment in the JIRA. Main concerns here:
> >
> > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
> 1.0 release someday.
> > 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> > 3. Simple operations, though benefits vectorization, might not be worth
> the data exchange overhead.
> >
> > So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> >
> > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> >
> > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > +1 (non-binding) for better columnar data processing support.
> >
> >
> >
> > From: Jules Damji 
> > Sent: Friday, April 19, 2019 12:21 PM
> > To: Bryan Cutler 
> > Cc: Dev 
> > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar
> Processing Support
> >
> >
> >
> > + (non-binding)
> >
> > Sent from my iPhone
> >
> > Pardon the dumb thumb typos :)
> >
> >
> > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> >
> > +1 (non-binding)
> >
> >
> >
> > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> >
> > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> >
> >
> >
> > Jason
> >
> >
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Matei Zaharia
FYI, I’d also be concerned about exposing the Arrow API or format as a public 
API if it’s not yet stable. Is stabilization of the API and format coming soon 
on the roadmap there? Maybe someone can work with the Arrow community to make 
that happen.

We’ve been bitten lots of times by API changes forced by external libraries 
even when those were widely popular. For example, we used Guava’s Optional for 
a while, which changed at some point, and we also had issues with Protobuf and 
Scala itself (especially how Scala’s APIs appear in Java). API breakage might 
not be as serious in dynamic languages like Python, where you can often keep 
compatibility with old behaviors, but it really hurts in Java and Scala.

The problem is especially bad for us because of two aspects of how Spark is 
used:

1) Spark is used for production data transformation jobs that people need to 
keep running for a long time. Nobody wants to make changes to a job that’s been 
working fine and computing something correctly for years just to get a bug fix 
from the latest Spark release or whatever. It’s much better if they can upgrade 
Spark without editing every job.

2) Spark is often used as “glue” to combine data processing code in other 
libraries, and these might start to require different versions of our 
dependencies. For example, the Guava class exposed in Spark became a problem 
when third-party libraries started requiring a new version of Guava: those new 
libraries just couldn’t work with Spark. Protobuf was especially bad because 
some users wanted to read data stored as Protobufs (or in a format that uses 
Protobuf inside), so they needed a different version of the library in their 
main data processing code.

If there was some guarantee that this stuff would remain backward-compatible, 
we’d be in a much better stuff. It’s not that hard to keep a storage format 
backward-compatible: just document the format and extend it only in ways that 
don’t break the meaning of old data (for example, add new version numbers or 
field types that are read in a different way). It’s a bit harder for a Java 
API, but maybe Spark could just expose byte arrays directly and work on those 
if the API is not guaranteed to stay stable (that is, we’d still use our own 
classes to manipulate the data internally, and end users could use the Arrow 
library if they want it).

Matei

> On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> 
> I think you misunderstood the point of this SPIP. I responded to your 
> comments in the SPIP JIRA.
> 
> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:
> I posted my comment in the JIRA. Main concerns here:
> 
> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> release someday.
> 2. ML/DL systems that can benefits from columnar format are mostly in Python.
> 3. Simple operations, though benefits vectorization, might not be worth the 
> data exchange overhead.
> 
> So would an improved Pandas UDF API would be good enough? For example, 
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> 
> Sorry that I should join the discussion earlier! Hope it is not too late:)
> 
> On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> +1 (non-binding) for better columnar data processing support.
> 
>  
> 
> From: Jules Damji  
> Sent: Friday, April 19, 2019 12:21 PM
> To: Bryan Cutler 
> Cc: Dev 
> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> Processing Support
> 
>  
> 
> + (non-binding)
> 
> Sent from my iPhone
> 
> Pardon the dumb thumb typos :)
> 
> 
> On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> 
> +1 (non-binding)
> 
>  
> 
> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> 
> +1 (non-binding).  Looking forward to seeing better support for processing 
> columnar data.
> 
>  
> 
> Jason
> 
>  
> 
> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves  
> wrote:
> 
> Hi everyone,
> 
>  
> 
> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
> Columnar Processing Support.  The proposal is to extend the support to allow 
> for more columnar processing.
> 
>  
> 
> You can find the full proposal in the jira at: 
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
> thread in the dev mailing list.
> 
>  
> 
> Please vote as early as you can, I will leave the vote open until next Monday 
> (the 22nd), 2pm CST to give people plenty of time.
> 
>  
> 
> [ ] +1: Accept the proposal as an official SPIP
> 
> [ ] +0
> 
> [ ] -1: I don't think this is a good idea because ...
> 
>  
> 
>  
> 
> Thanks!
> 
> Tom Graves
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Bobby Evans
I think you misunderstood the point of this SPIP. I responded to your
comments in the SPIP JIRA.

On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:

> I posted my comment in the JIRA
> <https://issues.apache.org/jira/browse/SPARK-27396?focusedCommentId=16822367=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16822367>.
> Main concerns here:
>
> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0
> release someday.
> 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> 3. Simple operations, though benefits vectorization, might not be worth
> the data exchange overhead.
>
> So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> (UDF that
> takes an iterator of of Arrow batches).
>
> Sorry that I should join the discussion earlier! Hope it is not too late:)
>
> On Fri, Apr 19, 2019 at 1:20 PM  wrote:
>
>> +1 (non-binding) for better columnar data processing support.
>>
>>
>>
>> *From:* Jules Damji 
>> *Sent:* Friday, April 19, 2019 12:21 PM
>> *To:* Bryan Cutler 
>> *Cc:* Dev 
>> *Subject:* Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>> Columnar Processing Support
>>
>>
>>
>> + (non-binding)
>>
>> Sent from my iPhone
>>
>> Pardon the dumb thumb typos :)
>>
>>
>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
>>
>> +1 (non-binding)
>>
>>
>>
>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
>>
>> +1 (non-binding).  Looking forward to seeing better support for
>> processing columnar data.
>>
>>
>>
>> Jason
>>
>>
>>
>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
>> wrote:
>>
>> Hi everyone,
>>
>>
>>
>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>> extended Columnar Processing Support.  The proposal is to extend the
>> support to allow for more columnar processing.
>>
>>
>>
>> You can find the full proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>>
>>
>>
>> Please vote as early as you can, I will leave the vote open until next
>> Monday (the 22nd), 2pm CST to give people plenty of time.
>>
>>
>>
>> [ ] +1: Accept the proposal as an official SPIP
>>
>> [ ] +0
>>
>> [ ] -1: I don't think this is a good idea because ...
>>
>>
>>
>>
>>
>> Thanks!
>>
>> Tom Graves
>>
>>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread Xiangrui Meng
I posted my comment in the JIRA
<https://issues.apache.org/jira/browse/SPARK-27396?focusedCommentId=16822367=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16822367>.
Main concerns here:

1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0
release someday.
2. ML/DL systems that can benefits from columnar format are mostly in
Python.
3. Simple operations, though benefits vectorization, might not be worth the
data exchange overhead.

So would an improved Pandas UDF API would be good enough? For example,
SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> (UDF that
takes an iterator of of Arrow batches).

Sorry that I should join the discussion earlier! Hope it is not too late:)

On Fri, Apr 19, 2019 at 1:20 PM  wrote:

> +1 (non-binding) for better columnar data processing support.
>
>
>
> *From:* Jules Damji 
> *Sent:* Friday, April 19, 2019 12:21 PM
> *To:* Bryan Cutler 
> *Cc:* Dev 
> *Subject:* Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
>
>
>
> + (non-binding)
>
> Sent from my iPhone
>
> Pardon the dumb thumb typos :)
>
>
> On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
>
> +1 (non-binding)
>
>
>
> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
>
> +1 (non-binding).  Looking forward to seeing better support for processing
> columnar data.
>
>
>
> Jason
>
>
>
> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
> wrote:
>
> Hi everyone,
>
>
>
> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
>
>
>
> You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
>
>
>
> Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
>
>
>
> [ ] +1: Accept the proposal as an official SPIP
>
> [ ] +0
>
> [ ] -1: I don't think this is a good idea because ...
>
>
>
>
>
> Thanks!
>
> Tom Graves
>
>


RE: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread tcondie
+1 (non-binding) for better columnar data processing support.

 

From: Jules Damji  
Sent: Friday, April 19, 2019 12:21 PM
To: Bryan Cutler 
Cc: Dev 
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
Processing Support

 

+ (non-binding)

Sent from my iPhone

Pardon the dumb thumb typos :)


On Apr 19, 2019, at 10:30 AM, Bryan Cutler mailto:cutl...@gmail.com> > wrote:

+1 (non-binding)

 

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe mailto:jl...@apache.org> > wrote:

+1 (non-binding).  Looking forward to seeing better support for processing 
columnar data.

 

Jason

 

On Tue, Apr 16, 2019 at 10:38 AM Tom Graves mailto:tgraves...@yahoo.com.invalid> > wrote:

Hi everyone,

 

I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
Columnar Processing Support.  The proposal is to extend the support to allow 
for more columnar processing.

 

You can find the full proposal in the jira at: 
https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
thread in the dev mailing list.

 

Please vote as early as you can, I will leave the vote open until next Monday 
(the 22nd), 2pm CST to give people plenty of time.

 

[ ] +1: Accept the proposal as an official SPIP

[ ] +0

[ ] -1: I don't think this is a good idea because ...

 

 

Thanks!

Tom Graves



Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread Jules Damji
+ (non-binding)

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> 
> +1 (non-binding)
> 
>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
>> +1 (non-binding).  Looking forward to seeing better support for processing 
>> columnar data.
>> 
>> Jason
>> 
>>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves  
>>> wrote:
>>> Hi everyone,
>>> 
>>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
>>> Columnar Processing Support.  The proposal is to extend the support to 
>>> allow for more columnar processing.
>>> 
>>> You can find the full proposal in the jira at: 
>>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
>>> thread in the dev mailing list.
>>> 
>>> Please vote as early as you can, I will leave the vote open until next 
>>> Monday (the 22nd), 2pm CST to give people plenty of time.
>>> 
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don't think this is a good idea because ...
>>> 
>>> 
>>> Thanks!
>>> Tom Graves


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread Bryan Cutler
+1 (non-binding)

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:

> +1 (non-binding).  Looking forward to seeing better support for processing
> columnar data.
>
> Jason
>
> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
> wrote:
>
>> Hi everyone,
>>
>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>> extended Columnar Processing Support.  The proposal is to extend the
>> support to allow for more columnar processing.
>>
>> You can find the full proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>>
>> Please vote as early as you can, I will leave the vote open until next
>> Monday (the 22nd), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>>
>> Thanks!
>> Tom Graves
>>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-18 Thread Jason Lowe
+1 (non-binding).  Looking forward to seeing better support for processing
columnar data.

Jason

On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
wrote:

> Hi everyone,
>
> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
>
> You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
>
> Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
>
> Thanks!
> Tom Graves
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-16 Thread Bobby Evans
I am +1, I better be because I am proposing the SPIP.

Thanks,

Bobby

On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
wrote:

> Hi everyone,
>
> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
>
> You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
>
> Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
>
> Thanks!
> Tom Graves
>


[VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-16 Thread Tom Graves
Hi everyone,
I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
Columnar Processing Support.  The proposal is to extend the support to allow 
for more columnar processing.
You can find the full proposal in the jira at: 
https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
thread in the dev mailing list.
Please vote as early as you can, I will leave the vote open until next Monday 
(the 22nd), 2pm CST to give people plenty of time.
[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don't think this 
is a good idea because ...

Thanks!Tom Graves