Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-30 Thread Bobby Evans
but it
> really hurts in Java and Scala.
> > >
> > > The problem is especially bad for us because of two aspects of how
> Spark is used:
> > >
> > > 1) Spark is used for production data transformation jobs that people
> need to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
> > >
> > > 2) Spark is often used as “glue” to combine data processing code in
> other libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
> > >
> > > If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
> > >
> > > Matei
> > >
> > > > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > > >
> > > > I think you misunderstood the point of this SPIP. I responded to
> your comments in the SPIP JIRA.
> > > >
> > > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
> wrote:
> > > > I posted my comment in the JIRA. Main concerns here:
> > > >
> > > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might
> have 1.0 release someday.
> > > > 2. ML/DL systems that can benefits from columnar format are mostly
> in Python.
> > > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > > >
> > > > So would an improved Pandas UDF API would be good enough? For
> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > >
> > > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > > >
> > > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > > +1 (non-binding) for better columnar data processing support.
> > > >
> > > >
> > > >
> > > > From: Jules Damji 
> > > > Sent: Friday, April 19, 2019 12:21 PM
> > > > To: Bryan Cutler 
> > > > Cc: Dev 
> > > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > > >
> > > >
> > > >
> > > > + (non-binding)
> > > >
> > > > Sent from my iPhone
> > > >
> > > > Pardon the dumb thumb typos :)
> > > >
> > > >
> > > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler 
> wrote:
> > > >
> > > > +1 (non-binding)
> > > >
> > > >
> > > >
> > > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe 
> wrote:
> > > >
> > > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > > >
> > > >
> > > >
> > > > Jason
> > > >
> > > >
> > > >
> > > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>  wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > >
> > > >
> > > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > > >
> > > >
> > > >
> > > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > > >
> > > >
> > > >
> > > > Please vote as early as you can, I will leave the vote open until
> next Monday (the 22nd), 2pm CST to give people plenty of time.
> > > >
> > > >
> > > >
> > > > [ ] +1: Accept the proposal as an official SPIP
> > > >
> > > > [ ] +0
> > > >
> > > > [ ] -1: I don't think this is a good idea because ...
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Thanks!
> > > >
> > > > Tom Graves
> > > >
> > >
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-23 Thread Matei Zaharia
 
> > libraries, and these might start to require different versions of our 
> > dependencies. For example, the Guava class exposed in Spark became a 
> > problem when third-party libraries started requiring a new version of 
> > Guava: those new libraries just couldn’t work with Spark. Protobuf was 
> > especially bad because some users wanted to read data stored as Protobufs 
> > (or in a format that uses Protobuf inside), so they needed a different 
> > version of the library in their main data processing code.
> > 
> > If there was some guarantee that this stuff would remain 
> > backward-compatible, we’d be in a much better stuff. It’s not that hard to 
> > keep a storage format backward-compatible: just document the format and 
> > extend it only in ways that don’t break the meaning of old data (for 
> > example, add new version numbers or field types that are read in a 
> > different way). It’s a bit harder for a Java API, but maybe Spark could 
> > just expose byte arrays directly and work on those if the API is not 
> > guaranteed to stay stable (that is, we’d still use our own classes to 
> > manipulate the data internally, and end users could use the Arrow library 
> > if they want it).
> > 
> > Matei
> > 
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > > 
> > > I think you misunderstood the point of this SPIP. I responded to your 
> > > comments in the SPIP JIRA.
> > > 
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > > 
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> > > release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in 
> > > Python.
> > > 3. Simple operations, though benefits vectorization, might not be worth 
> > > the data exchange overhead.
> > > 
> > > So would an improved Pandas UDF API would be good enough? For example, 
> > > SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > 
> > > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > > 
> > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > > 
> > >  
> > > 
> > > From: Jules Damji  
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler 
> > > Cc: Dev 
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> > > Processing Support
> > > 
> > >  
> > > 
> > > + (non-binding)
> > > 
> > > Sent from my iPhone
> > > 
> > > Pardon the dumb thumb typos :)
> > > 
> > > 
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > > 
> > > +1 (non-binding)
> > > 
> > >  
> > > 
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > > 
> > > +1 (non-binding).  Looking forward to seeing better support for 
> > > processing columnar data.
> > > 
> > >  
> > > 
> > > Jason
> > > 
> > >  
> > > 
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
> > >  wrote:
> > > 
> > > Hi everyone,
> > > 
> > >  
> > > 
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for 
> > > extended Columnar Processing Support.  The proposal is to extend the 
> > > support to allow for more columnar processing.
> > > 
> > >  
> > > 
> > > You can find the full proposal in the jira at: 
> > > https://issues.apache.org/jira/browse/SPARK-27396. There was also a 
> > > DISCUSS thread in the dev mailing list.
> > > 
> > >  
> > > 
> > > Please vote as early as you can, I will leave the vote open until next 
> > > Monday (the 22nd), 2pm CST to give people plenty of time.
> > > 
> > >  
> > > 
> > > [ ] +1: Accept the proposal as an official SPIP
> > > 
> > > [ ] +0
> > > 
> > > [ ] -1: I don't think this is a good idea because ...
> > > 
> > >  
> > > 
> > >  
> > > 
> > > Thanks!
> > > 
> > > Tom Graves
> > > 
> > 
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 



Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Tom Graves
compatible. That being 
said,changes to format are not taken lightly and are backwards compatible 
whenpossible. I think it would be fair to mark the APIs exposing Arrow data 
asexperimental for the time being, and clearly state the version that must 
beused to be compatible in the docs. Also, adding features like this 
andSPARK-24579 will probably help adoption of Arrow and accelerate a 
1.0release. Adding the Arrow dev list to CC.

Bryan

On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia wrote:

Okay, that makes sense, but is the Arrow data format stable? If not, werisk 
breakage when Arrow changes in the future and some libraries usingthis feature 
are begin to use the new Arrow code.

Matei


On Apr 20, 2019, at 1:39 PM, Bobby Evans  wrote:

I want to be clear that this SPIP is not proposing exposing Arrow


APIs/Classes through any Spark APIs. SPARK-24579 is doing that, andbecause of 
the overlap between the two SPIPs I scaled this one back toconcentrate just on 
the columnar processing aspects. Sorry for theconfusion as I didn't update the 
JIRA description clearly enough when weadjusted it during the discussion on the 
JIRA. As part of the columnarprocessing, we plan on providing arrow formatted 
data, but that will beexposed through a Spark owned API.


On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia 


wrote:


FYI, I’d also be concerned about exposing the Arrow API or format as a


public API if it’s not yet stable. Is stabilization of the API and formatcoming 
soon on the roadmap there? Maybe someone can work with the Arrowcommunity to 
make that happen.


We’ve been bitten lots of times by API changes forced by external


libraries even when those were widely popular. For example, we used 
Guava’sOptional for a while, which changed at some point, and we also had 
issueswith Protobuf and Scala itself (especially how Scala’s APIs appear 
inJava). API breakage might not be as serious in dynamic languages likePython, 
where you can often keep compatibility with old behaviors, but itreally hurts 
in Java and Scala.


The problem is especially bad for us because of two aspects of how


Spark is used:


1) Spark is used for production data transformation jobs that people


need to keep running for a long time. Nobody wants to make changes to a 
jobthat’s been working fine and computing something correctly for years justto 
get a bug fix from the latest Spark release or whatever. It’s muchbetter if 
they can upgrade Spark without editing every job.


2) Spark is often used as “glue” to combine data processing code in


other libraries, and these might start to require different versions of 
ourdependencies. For example, the Guava class exposed in Spark became aproblem 
when third-party libraries started requiring a new version ofGuava: those new 
libraries just couldn’t work with Spark. Protobuf wasespecially bad because 
some users wanted to read data stored as Protobufs
(or in a format that uses Protobuf inside), so they needed a differentversion 
of the library in their main data processing code.


If there was some guarantee that this stuff would remain


backward-compatible, we’d be in a much better stuff. It’s not that hard tokeep 
a storage format backward-compatible: just document the format andextend it 
only in ways that don’t break the meaning of old data (forexample, add new 
version numbers or field types that are read in adifferent way). It’s a bit 
harder for a Java API, but maybe Spark couldjust expose byte arrays directly 
and work on those if the API is notguaranteed to stay stable (that is, we’d 
still use our own classes tomanipulate the data internally, and end users could 
use the Arrow libraryif they want it).


Matei


On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:

I think you misunderstood the point of this SPIP. I responded to your



comments in the SPIP JIRA.



On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 



wrote:



I posted my comment in the JIRA. Main concerns here:

1. Exposing third-party Java APIs in Spark is risky. Arrow might have



1.0 release someday.



2. ML/DL systems that can benefits from columnar format are mostly in



Python.



3. Simple operations, though benefits vectorization, might not be



worth the data exchange overhead.



So would an improved Pandas UDF API would be good enough? For



example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).



Sorry that I should join the discussion earlier! Hope it is not too



late:)



On Fri, Apr 19, 2019 at 1:20 PM  wrote:
+1 (non-binding) for better columnar data processing support.

From: Jules Damji 
Sent: Friday, April 19, 2019 12:21 PM
To: Bryan Cutler 
Cc: Dev 
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended



Columnar Processing Support



+ (non-binding)

Sent from my iPhone

Pardon the dumb thumb typos :)

On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:

+1 (non-binding)

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:

+1 (non-binding). Looking forward to see

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Bobby Evans
l use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>>
>> Matei
>>
>> On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
>>
>> I think you misunderstood the point of this SPIP. I responded to your
>>
>> comments in the SPIP JIRA.
>>
>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
>>
>> wrote:
>>
>> I posted my comment in the JIRA. Main concerns here:
>>
>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>>
>> 1.0 release someday.
>>
>> 2. ML/DL systems that can benefits from columnar format are mostly in
>>
>> Python.
>>
>> 3. Simple operations, though benefits vectorization, might not be
>>
>> worth the data exchange overhead.
>>
>> So would an improved Pandas UDF API would be good enough? For
>>
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>>
>> Sorry that I should join the discussion earlier! Hope it is not too
>>
>> late:)
>>
>> On Fri, Apr 19, 2019 at 1:20 PM  wrote:
>> +1 (non-binding) for better columnar data processing support.
>>
>> From: Jules Damji 
>> Sent: Friday, April 19, 2019 12:21 PM
>> To: Bryan Cutler 
>> Cc: Dev 
>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>>
>> Columnar Processing Support
>>
>> + (non-binding)
>>
>> Sent from my iPhone
>>
>> Pardon the dumb thumb typos :)
>>
>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
>>
>> +1 (non-binding)
>>
>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
>>
>> +1 (non-binding). Looking forward to seeing better support for
>>
>> processing columnar data.
>>
>> Jason
>>
>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>
>>  wrote:
>>
>> Hi everyone,
>>
>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>>
>> extended Columnar Processing Support. The proposal is to extend the
>> support to allow for more columnar processing.
>>
>> You can find the full proposal in the jira at:
>>
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>>
>> Please vote as early as you can, I will leave the vote open until
>>
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>>
>> [ ] +0
>>
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thanks!
>>
>> Tom Graves
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Reynold Xin
y hurts in Java and Scala.
>>> 
>>> 
>>>> 
>>>> 
>>>> The problem is especially bad for us because of two aspects of how
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Spark is used:
>>> 
>>> 
>>>> 
>>>> 
>>>> 1) Spark is used for production data transformation jobs that people
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> need to keep running for a long time. Nobody wants to make changes to a
>>> job that’s been working fine and computing something correctly for years
>>> just to get a bug fix from the latest Spark release or whatever. It’s much
>>> better if they can upgrade Spark without editing every job.
>>> 
>>> 
>>>> 
>>>> 
>>>> 2) Spark is often used as “glue” to combine data processing code in
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> other libraries, and these might start to require different versions of
>>> our dependencies. For example, the Guava class exposed in Spark became a
>>> problem when third-party libraries started requiring a new version of
>>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>>> especially bad because some users wanted to read data stored as Protobufs
>>> (or in a format that uses Protobuf inside), so they needed a different
>>> version of the library in their main data processing code.
>>> 
>>> 
>>>> 
>>>> 
>>>> If there was some guarantee that this stuff would remain
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>>> keep a storage format backward-compatible: just document the format and
>>> extend it only in ways that don’t break the meaning of old data (for
>>> example, add new version numbers or field types that are read in a
>>> different way). It’s a bit harder for a Java API, but maybe Spark could
>>> just expose byte arrays directly and work on those if the API is not
>>> guaranteed to stay stable (that is, we’d still use our own classes to
>>> manipulate the data internally, and end users could use the Arrow library
>>> if they want it).
>>> 
>>> 
>>>> 
>>>> 
>>>> Matei
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 20, 2019, at 8:38 AM, Bobby Evans < revans2@ gmail. com (
>>>>> reva...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> I think you misunderstood the point of this SPIP. I responded to your
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> comments in the SPIP JIRA.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng < mengxr@ gmail. com (
>>>>> men...@gmail.com ) >
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> wrote:
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> I posted my comment in the JIRA. Main concerns here:
>>>>> 
>>>>> 
>>>>> 
>>>>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 1.0 release someday.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 2. ML/DL systems that can benefits from columnar format are mostly in
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Python.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 3. Simple operations, though benefits vectorization, might not be
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> worth the data exchange overhead.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> So would an improved Pandas UDF API would be good enough? For
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Sorry that I should join the discussion earlier! Hope it is not too
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> late:)
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Apr 19, 2019 at 1:20 PM < tcondie@ gmail. com ( tcon...@gmail.com 
>>>>> )
>>>>> > wrote:
>>>>> +1 (non-binding) for better columnar data processing support.
>>>>> 
>>>>> 
>>>>> 
>>>>> From: Jules Damji < dmatrix@ comcast. net ( dmat...@comcast.net ) >
>>>>> Sent: Friday, April 19, 2019 12:21 PM
>>>>> To: Bryan Cutler < cutlerb@ gmail. com ( cutl...@gmail.com ) >
>>>>> Cc: Dev < dev@ spark. apache. org ( d...@spark.apache.org ) >
>>>>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Columnar Processing Support
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> + (non-binding)
>>>>> 
>>>>> 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> 
>>>>> 
>>>>> Pardon the dumb thumb typos :)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler < cutlerb@ gmail. com (
>>>>> cutl...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 (non-binding)
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe < jlowe@ apache. org (
>>>>> jl...@apache.org ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> +1 (non-binding). Looking forward to seeing better support for
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> processing columnar data.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Jason
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> < tgraves_cs@ yahoo. com. invalid ( tgraves...@yahoo.com.invalid ) > wrote:
>>> 
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> 
>>>>> 
>>>>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> extended Columnar Processing Support. The proposal is to extend the
>>> support to allow for more columnar processing.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> You can find the full proposal in the jira at:
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27396 (
>>> https://issues.apache.org/jira/browse/SPARK-27396 ). There was also a
>>> DISCUSS thread in the dev mailing list.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Please vote as early as you can, I will leave the vote open until
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>> 
>>>>> 
>>>>> 
>>>>> [ ] +0
>>>>> 
>>>>> 
>>>>> 
>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> 
>>>>> Tom Graves
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> - To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>> 
>> 
> 
> 
>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Bobby Evans
me being, and clearly state the version that must be
>> used to be compatible in the docs. Also, adding features like this and
>> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
>> release. Adding the Arrow dev list to CC.
>>
>> Bryan
>>
>> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia 
>> wrote:
>>
>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>> risk breakage when Arrow changes in the future and some libraries using
>> this feature are begin to use the new Arrow code.
>>
>> Matei
>>
>> > On Apr 20, 2019, at 1:39 PM, Bobby Evans  wrote:
>> >
>> > I want to be clear that this SPIP is not proposing exposing Arrow
>> APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
>> because of the overlap between the two SPIPs I scaled this one back to
>> concentrate just on the columnar processing aspects. Sorry for the
>> confusion as I didn't update the JIRA description clearly enough when we
>> adjusted it during the discussion on the JIRA.  As part of the columnar
>> processing, we plan on providing arrow formatted data, but that will be
>> exposed through a Spark owned API.
>> >
>> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia 
>> wrote:
>> > FYI, I’d also be concerned about exposing the Arrow API or format as a
>> public API if it’s not yet stable. Is stabilization of the API and format
>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>> community to make that happen.
>> >
>> > We’ve been bitten lots of times by API changes forced by external
>> libraries even when those were widely popular. For example, we used Guava’s
>> Optional for a while, which changed at some point, and we also had issues
>> with Protobuf and Scala itself (especially how Scala’s APIs appear in
>> Java). API breakage might not be as serious in dynamic languages like
>> Python, where you can often keep compatibility with old behaviors, but it
>> really hurts in Java and Scala.
>> >
>> > The problem is especially bad for us because of two aspects of how
>> Spark is used:
>> >
>> > 1) Spark is used for production data transformation jobs that people
>> need to keep running for a long time. Nobody wants to make changes to a job
>> that’s been working fine and computing something correctly for years just
>> to get a bug fix from the latest Spark release or whatever. It’s much
>> better if they can upgrade Spark without editing every job.
>> >
>> > 2) Spark is often used as “glue” to combine data processing code in
>> other libraries, and these might start to require different versions of our
>> dependencies. For example, the Guava class exposed in Spark became a
>> problem when third-party libraries started requiring a new version of
>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>> especially bad because some users wanted to read data stored as Protobufs
>> (or in a format that uses Protobuf inside), so they needed a different
>> version of the library in their main data processing code.
>> >
>> > If there was some guarantee that this stuff would remain
>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>> keep a storage format backward-compatible: just document the format and
>> extend it only in ways that don’t break the meaning of old data (for
>> example, add new version numbers or field types that are read in a
>> different way). It’s a bit harder for a Java API, but maybe Spark could
>> just expose byte arrays directly and work on those if the API is not
>> guaranteed to stay stable (that is, we’d still use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>> >
>> > Matei
>> >
>> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
>> > >
>> > > I think you misunderstood the point of this SPIP. I responded to your
>> comments in the SPIP JIRA.
>> > >
>> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
>> wrote:
>> > > I posted my comment in the JIRA. Main concerns here:
>> > >
>> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>> 1.0 release someday.
>> > > 2. ML/DL systems that can benefits from columnar format are mostly in
>> Python.
>> > > 3. Simple operations, though benefits vectorization, might not be
>> worth the data exchange overhead.
>> > >
>&

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Xiangrui Meng
 when those were widely popular. For example, we used Guava’s
> Optional for a while, which changed at some point, and we also had issues
> with Protobuf and Scala itself (especially how Scala’s APIs appear in
> Java). API breakage might not be as serious in dynamic languages like
> Python, where you can often keep compatibility with old behaviors, but it
> really hurts in Java and Scala.
> >
> > The problem is especially bad for us because of two aspects of how Spark
> is used:
> >
> > 1) Spark is used for production data transformation jobs that people
> need to keep running for a long time. Nobody wants to make changes to a job
> that’s been working fine and computing something correctly for years just
> to get a bug fix from the latest Spark release or whatever. It’s much
> better if they can upgrade Spark without editing every job.
> >
> > 2) Spark is often used as “glue” to combine data processing code in
> other libraries, and these might start to require different versions of our
> dependencies. For example, the Guava class exposed in Spark became a
> problem when third-party libraries started requiring a new version of
> Guava: those new libraries just couldn’t work with Spark. Protobuf was
> especially bad because some users wanted to read data stored as Protobufs
> (or in a format that uses Protobuf inside), so they needed a different
> version of the library in their main data processing code.
> >
> > If there was some guarantee that this stuff would remain
> backward-compatible, we’d be in a much better stuff. It’s not that hard to
> keep a storage format backward-compatible: just document the format and
> extend it only in ways that don’t break the meaning of old data (for
> example, add new version numbers or field types that are read in a
> different way). It’s a bit harder for a Java API, but maybe Spark could
> just expose byte arrays directly and work on those if the API is not
> guaranteed to stay stable (that is, we’d still use our own classes to
> manipulate the data internally, and end users could use the Arrow library
> if they want it).
> >
> > Matei
> >
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > >
> > > I think you misunderstood the point of this SPIP. I responded to your
> comments in the SPIP JIRA.
> > >
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
> wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > >
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
> 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > >
> > > So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > >
> > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > >
> > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > >
> > >
> > >
> > > From: Jules Damji 
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler 
> > > Cc: Dev 
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > >
> > >
> > >
> > > + (non-binding)
> > >
> > > Sent from my iPhone
> > >
> > > Pardon the dumb thumb typos :)
> > >
> > >
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > >
> > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > >
> > >
> > >
> > > Jason
> > >
> > >
> > >
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>  wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > >
> > >
> > >
> > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > >
> > >
> > >
> > > Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
> > >
> > >
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > >
> > > [ ] +0
> > >
> > > [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Tom Graves
> > >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Tom Graves
 not that hard to keep a storage format 
> backward-compatible: just document the format and extend it only in ways that 
> don’t break the meaning of old data (for example, add new version numbers or 
> field types that are read in a different way). It’s a bit harder for a Java 
> API, but maybe Spark could just expose byte arrays directly and work on those 
> if the API is not guaranteed to stay stable (that is, we’d still use our own 
> classes to manipulate the data internally, and end users could use the Arrow 
> library if they want it).
> 
> Matei
> 
> > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > 
> > I think you misunderstood the point of this SPIP. I responded to your 
> > comments in the SPIP JIRA.
> > 
> > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:
> > I posted my comment in the JIRA. Main concerns here:
> > 
> > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> > release someday.
> > 2. ML/DL systems that can benefits from columnar format are mostly in 
> > Python.
> > 3. Simple operations, though benefits vectorization, might not be worth the 
> > data exchange overhead.
> > 
> > So would an improved Pandas UDF API would be good enough? For example, 
> > SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > 
> > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > 
> > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > +1 (non-binding) for better columnar data processing support.
> > 
> >  
> > 
> > From: Jules Damji  
> > Sent: Friday, April 19, 2019 12:21 PM
> > To: Bryan Cutler 
> > Cc: Dev 
> > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> > Processing Support
> > 
> >  
> > 
> > + (non-binding)
> > 
> > Sent from my iPhone
> > 
> > Pardon the dumb thumb typos :)
> > 
> > 
> > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > 
> > +1 (non-binding)
> > 
> >  
> > 
> > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > 
> > +1 (non-binding).  Looking forward to seeing better support for processing 
> > columnar data.
> > 
> >  
> > 
> > Jason
> > 
> >  
> > 
> > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves  
> > wrote:
> > 
> > Hi everyone,
> > 
> >  
> > 
> > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
> > Columnar Processing Support.  The proposal is to extend the support to 
> > allow for more columnar processing.
> > 
> >  
> > 
> > You can find the full proposal in the jira at: 
> > https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
> > thread in the dev mailing list.
> > 
> >  
> > 
> > Please vote as early as you can, I will leave the vote open until next 
> > Monday (the 22nd), 2pm CST to give people plenty of time.
> > 
> >  
> > 
> > [ ] +1: Accept the proposal as an official SPIP
> > 
> > [ ] +0
> > 
> > [ ] -1: I don't think this is a good idea because ...
> > 
> >  
> > 
> >  
> > 
> > Thanks!
> > 
> > Tom Graves
> > 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



  

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Bobby Evans
g code.
>> >
>> > If there was some guarantee that this stuff would remain
>> backward-compatible, we’d be in a much better stuff. It’s not that hard to
>> keep a storage format backward-compatible: just document the format and
>> extend it only in ways that don’t break the meaning of old data (for
>> example, add new version numbers or field types that are read in a
>> different way). It’s a bit harder for a Java API, but maybe Spark could
>> just expose byte arrays directly and work on those if the API is not
>> guaranteed to stay stable (that is, we’d still use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>> >
>> > Matei
>> >
>> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
>> > >
>> > > I think you misunderstood the point of this SPIP. I responded to your
>> comments in the SPIP JIRA.
>> > >
>> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng 
>> wrote:
>> > > I posted my comment in the JIRA. Main concerns here:
>> > >
>> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>> 1.0 release someday.
>> > > 2. ML/DL systems that can benefits from columnar format are mostly in
>> Python.
>> > > 3. Simple operations, though benefits vectorization, might not be
>> worth the data exchange overhead.
>> > >
>> > > So would an improved Pandas UDF API would be good enough? For
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>> > >
>> > > Sorry that I should join the discussion earlier! Hope it is not too
>> late:)
>> > >
>> > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
>> > > +1 (non-binding) for better columnar data processing support.
>> > >
>> > >
>> > >
>> > > From: Jules Damji 
>> > > Sent: Friday, April 19, 2019 12:21 PM
>> > > To: Bryan Cutler 
>> > > Cc: Dev 
>> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>> Columnar Processing Support
>> > >
>> > >
>> > >
>> > > + (non-binding)
>> > >
>> > > Sent from my iPhone
>> > >
>> > > Pardon the dumb thumb typos :)
>> > >
>> > >
>> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
>> > >
>> > > +1 (non-binding)
>> > >
>> > >
>> > >
>> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
>> > >
>> > > +1 (non-binding).  Looking forward to seeing better support for
>> processing columnar data.
>> > >
>> > >
>> > >
>> > > Jason
>> > >
>> > >
>> > >
>> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>  wrote:
>> > >
>> > > Hi everyone,
>> > >
>> > >
>> > >
>> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>> extended Columnar Processing Support.  The proposal is to extend the
>> support to allow for more columnar processing.
>> > >
>> > >
>> > >
>> > > You can find the full proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>> > >
>> > >
>> > >
>> > > Please vote as early as you can, I will leave the vote open until
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>> > >
>> > >
>> > >
>> > > [ ] +1: Accept the proposal as an official SPIP
>> > >
>> > > [ ] +0
>> > >
>> > > [ ] -1: I don't think this is a good idea because ...
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Thanks!
>> > >
>> > > Tom Graves
>> > >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Bryan Cutler
 have
> 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > >
> > > So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > >
> > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > >
> > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > >
> > >
> > >
> > > From: Jules Damji 
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler 
> > > Cc: Dev 
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > >
> > >
> > >
> > > + (non-binding)
> > >
> > > Sent from my iPhone
> > >
> > > Pardon the dumb thumb typos :)
> > >
> > >
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > >
> > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > >
> > >
> > >
> > > Jason
> > >
> > >
> > >
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>  wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > >
> > >
> > >
> > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > >
> > >
> > >
> > > Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
> > >
> > >
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > >
> > > [ ] +0
> > >
> > > [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Tom Graves
> > >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>