Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Matei Zaharia Tue, 23 Apr 2019 10:51:01 -0700

Just as a note here, if the goal is the format not change, why not make that 
explicit in a versioning policy? You can always include a format version number 
and say that future versions may increment the number, but this specific 
version will always be readable in some specific way. You could also put a 
timeline on how long old version numbers will be recognized in the official 
libraries (e.g. 3 years).


Matei

> On Apr 22, 2019, at 6:36 AM, Bobby Evans <reva...@gmail.com> wrote:
> 
> Yes, it is technically possible for the layout to change.  No, it is not 
> going to happen.  It is already baked into several different official 
> libraries which are widely used, not just for holding and processing the 
> data, but also for transfer of the data between the various implementations.  
> There would have to be a really serious reason to force an incompatible 
> change at this point.  So in the worst case, we can version the layout and 
> bake that into the API that exposes the internal layout of the data.  That 
> way code that wants to program against a JAVA API can do so using the API 
> that Spark provides, those who want to interface with something that expects 
> the data in arrow format will already have to know what version of the format 
> it was programmed against and in the worst case if the layout does change we 
> can support the new layout if needed.
> 
> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cutl...@gmail.com> wrote:
> The Arrow data format is not yet stable, meaning there are no guarantees on 
> backwards/forwards compatibility. Once version 1.0 is released, it will have 
> those guarantees but it's hard to say when that will be. The remaining work 
> to get there can be seen at 
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
>  So yes, it is a risk that exposing Spark data as Arrow could cause an issue 
> if handled by a different version that is not compatible. That being said, 
> changes to format are not taken lightly and are backwards compatible when 
> possible. I think it would be fair to mark the APIs exposing Arrow data as 
> experimental for the time being, and clearly state the version that must be 
> used to be compatible in the docs. Also, adding features like this and 
> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0 
> release. Adding the Arrow dev list to CC.
> 
> Bryan
> 
> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <matei.zaha...@gmail.com> wrote:
> Okay, that makes sense, but is the Arrow data format stable? If not, we risk 
> breakage when Arrow changes in the future and some libraries using this 
> feature are begin to use the new Arrow code.
> 
> Matei
> 
> > On Apr 20, 2019, at 1:39 PM, Bobby Evans <reva...@gmail.com> wrote:
> > 
> > I want to be clear that this SPIP is not proposing exposing Arrow 
> > APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and 
> > because of the overlap between the two SPIPs I scaled this one back to 
> > concentrate just on the columnar processing aspects. Sorry for the 
> > confusion as I didn't update the JIRA description clearly enough when we 
> > adjusted it during the discussion on the JIRA.  As part of the columnar 
> > processing, we plan on providing arrow formatted data, but that will be 
> > exposed through a Spark owned API.
> > 
> > On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <matei.zaha...@gmail.com> 
> > wrote:
> > FYI, I’d also be concerned about exposing the Arrow API or format as a 
> > public API if it’s not yet stable. Is stabilization of the API and format 
> > coming soon on the roadmap there? Maybe someone can work with the Arrow 
> > community to make that happen.
> > 
> > We’ve been bitten lots of times by API changes forced by external libraries 
> > even when those were widely popular. For example, we used Guava’s Optional 
> > for a while, which changed at some point, and we also had issues with 
> > Protobuf and Scala itself (especially how Scala’s APIs appear in Java). API 
> > breakage might not be as serious in dynamic languages like Python, where 
> > you can often keep compatibility with old behaviors, but it really hurts in 
> > Java and Scala.
> > 
> > The problem is especially bad for us because of two aspects of how Spark is 
> > used:
> > 
> > 1) Spark is used for production data transformation jobs that people need 
> > to keep running for a long time. Nobody wants to make changes to a job 
> > that’s been working fine and computing something correctly for years just 
> > to get a bug fix from the latest Spark release or whatever. It’s much 
> > better if they can upgrade Spark without editing every job.
> > 
> > 2) Spark is often used as “glue” to combine data processing code in other 
> > libraries, and these might start to require different versions of our 
> > dependencies. For example, the Guava class exposed in Spark became a 
> > problem when third-party libraries started requiring a new version of 
> > Guava: those new libraries just couldn’t work with Spark. Protobuf was 
> > especially bad because some users wanted to read data stored as Protobufs 
> > (or in a format that uses Protobuf inside), so they needed a different 
> > version of the library in their main data processing code.
> > 
> > If there was some guarantee that this stuff would remain 
> > backward-compatible, we’d be in a much better stuff. It’s not that hard to 
> > keep a storage format backward-compatible: just document the format and 
> > extend it only in ways that don’t break the meaning of old data (for 
> > example, add new version numbers or field types that are read in a 
> > different way). It’s a bit harder for a Java API, but maybe Spark could 
> > just expose byte arrays directly and work on those if the API is not 
> > guaranteed to stay stable (that is, we’d still use our own classes to 
> > manipulate the data internally, and end users could use the Arrow library 
> > if they want it).
> > 
> > Matei
> > 
> > > On Apr 20, 2019, at 8:38 AM, Bobby Evans <reva...@gmail.com> wrote:
> > > 
> > > I think you misunderstood the point of this SPIP. I responded to your 
> > > comments in the SPIP JIRA.
> > > 
> > > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <men...@gmail.com> wrote:
> > > I posted my comment in the JIRA. Main concerns here:
> > > 
> > > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> > > release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in 
> > > Python.
> > > 3. Simple operations, though benefits vectorization, might not be worth 
> > > the data exchange overhead.
> > > 
> > > So would an improved Pandas UDF API would be good enough? For example, 
> > > SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > > 
> > > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > > 
> > > On Fri, Apr 19, 2019 at 1:20 PM <tcon...@gmail.com> wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > > 
> > >  
> > > 
> > > From: Jules Damji <dmat...@comcast.net> 
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler <cutl...@gmail.com>
> > > Cc: Dev <dev@spark.apache.org>
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> > > Processing Support
> > > 
> > >  
> > > 
> > > + (non-binding)
> > > 
> > > Sent from my iPhone
> > > 
> > > Pardon the dumb thumb typos :)
> > > 
> > > 
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cutl...@gmail.com> wrote:
> > > 
> > > +1 (non-binding)
> > > 
> > >  
> > > 
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jl...@apache.org> wrote:
> > > 
> > > +1 (non-binding).  Looking forward to seeing better support for 
> > > processing columnar data.
> > > 
> > >  
> > > 
> > > Jason
> > > 
> > >  
> > > 
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
> > > <tgraves...@yahoo.com.invalid> wrote:
> > > 
> > > Hi everyone,
> > > 
> > >  
> > > 
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for 
> > > extended Columnar Processing Support.  The proposal is to extend the 
> > > support to allow for more columnar processing.
> > > 
> > >  
> > > 
> > > You can find the full proposal in the jira at: 
> > > https://issues.apache.org/jira/browse/SPARK-27396. There was also a 
> > > DISCUSS thread in the dev mailing list.
> > > 
> > >  
> > > 
> > > Please vote as early as you can, I will leave the vote open until next 
> > > Monday (the 22nd), 2pm CST to give people plenty of time.
> > > 
> > >  
> > > 
> > > [ ] +1: Accept the proposal as an official SPIP
> > > 
> > > [ ] +0
> > > 
> > > [ ] -1: I don't think this is a good idea because ...
> > > 
> > >  
> > > 
> > >  
> > > 
> > > Thanks!
> > > 
> > > Tom Graves
> > > 
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

Reply via email to