Re: Ready to talk about Spark 2.0?

Romi Kuntsman Sun, 08 Nov 2015 13:43:01 -0800

Since it seems we do have so much to talk about Spark 2.0, then the answer
to the question "ready to talk about spark 2" is yes.
But that doesn't mean the development of the 1.x branch is ready to stop or
that there shouldn't be a 1.7 release.


Regarding what should go into the next major version - obviously on the
technical level it's breaking API changes and perhaps some long-awaited
architectural refactoring.

But what I think should be the major change is on the conceptual side - the
realization that the way interactive, batch and streaming data flows work
are fundamentally different, and building the framework around that will
benefit each of those flows (like events instead of microbatches in
streaming, worker-side intermediate processing in batch, etc).

So where is the best way to have a full Spark 2.0 discussion?

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Sun, Nov 8, 2015 at 10:10 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> Yes, that's clearer -- at least to me.
>
> But before going any further, let me note that we are already sliding past
> Sean's opening question of "Should we start talking about Spark 2.0?" to
> actually start talking about Spark 2.0.  I'll try to keep the rest of this
> post at a higher- or meta-level in order to attempt to avoid a somewhat
> premature discussion of detailed 2.0 proposals, since I think that we do
> still need to answer Sean's question and a couple of related questions
> before really diving into the details of 2.0 planning.  The related
> questions that I am talking about are: Is Spark 1.x done except for
> bug-fixing? and What would definitely make us say that we must begin
> working on Spark 2.0?
>
> I'm not going to try to answer my own two questions even though I'm really
> interested in how others will answer them, but I will answer Sean's by
> saying that it is a good time to start talking about Spark 2.0 -- which is
> quite different from saying that we are close to an understanding of what
> will differentiate Spark 2.0 or when we want to deliver it.
>
> On the meta-2.0 discussion, I think that it is useful to break "Things
> that will be different in 2.0" into some distinct categories.  I see at
> least three such categories for openers, although the third will probably
> need to be broken down further.
>
> The first is the simplest, would take almost no time to complete, and
> would have minimal impact on current Spark users.  This is simply getting
> rid of everything that is already marked deprecated in Spark 1.x but that
> we haven't already gotten rid of because of our commitment to maintaining
> API stability within major versions.  There should be no need for
> discussion or apology before getting rid of what is already deprecated --
> it's just gone and it's time to move on.  Kind of a category-1.1 are parts
> of the the current public API that are now marked as Experimental or
> Developer that should become part of the fully-supported public API in 2.0
> -- and there is room for debate here.
>
> The next category of things that will be different in 2.0 isn't a lot
> harder to implement, shouldn't take a lot of time to complete, but will
> have some impact on current Spark users.  I'm talking about areas in the
> current code that we know don't work the way we want them to and don't have
> the public API that we would like, but for which there aren't or can't be
> recommended alternatives yet, so the code isn't formally marked as
> deprecated.  Again, these are things that we haven't already changed mostly
> because of the need to maintain API stability in 1.x.  But because these
> haven't already been marked as deprecated, there is potential to catch
> existing Spark users by surprise when the API changes.  We don't guarantee
> API stability across major version number changes, so there isn't any
> reason why we can't make the changes we want, but we should start building
> up a comprehensive list of API changes that will occur in Spark 2.0 to at
> least minimize the amount of surprise for current Spark users.
>
> I don't already have anything like such a comprehensive list, but one
> example of the kind of thing that I am talking about is something that I've
> personally been looking at and regretting of late, and that's the
> complicated relationships among SparkListener, SQLListener, onJobEnd and
> onExecutionEnd.  A lot of this complication is because of the need to
> maintain the public API, so we end up with comments like this (
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala#L58):
> "Ideally, we need to make sure onExecutionEnd happens after onJobStart and
> onJobEnd.  However, onJobStart and onJobEnd run in the listener thread.
> Because we cannot add new SQL event types to SparkListener since it's a
> public API, we cannot guarantee that."  I think it should be pretty obvious
> that we should be making these kinds of architectural and API changes in
> 2.0 -- they are currently causing Spark developers and often Spark users to
> deal with complications that are needed mostly or entirely just to maintain
> the public API.  I know that there are other (and larger) examples of this
> kind of refactoring that others are itching to start doing, but I'll let
> them speak to those specifics as we build up the list of API changes.
>
> The third category of things that will be different in 2.0 is the category
> that could get us seriously off track or badly behind schedule.  This is
> the category that includes fundamental architectural changes or significant
> new features and functionality.  Before we go too far afield in exploring
> our wish lists for Spark 2.0, I think we need to try very hard to identify
> which architectural changes are really needed to achieve a minimum viable
> platform that can meet our goals for the complete Spark 2.0 cycle.  This
> starts to get back to my questions of whether Spark 1.x is done and whether
> we really need to start working on Spark 2.0.  If we look back at the total
> Spark ecosystem in the 0.9 timeframe vs. where it is now on the verge of
> 1.6.0, it should be clear that an amazing number of additions and
> refinements have been made to Spark itself, Spark packages, third-party
> tools and applications, etc. -- and all of that was done without requiring
> fundamental changes to Spark's architecture.  What I think that implies is
> that as items are added to our collective wish list for Spark 2.0, we need
> to be asking of each one at least two things: 1) Whether it really requires
> a fundamental change in Spark before this new feature or functionality can
> be implemented; and 2) If it does require a fundamental change, is that
> change (but not necessarily all the new features that need that change)
> something that we are willing to commit to completing before Spark 2.0.0
> can be released?  Or alternatively, is that a fundamental change that we
> can and should put off making for potentially years while the Spark 2.x
> development cycle runs its course?  If wish list items don't require
> fundamental changes, then we shouldn't feel bad about needing to say for
> many of them that they look like good and/or interesting ideas, and things
> that we may very well want to include in Spark, but that they may end up in
> Spark 2.x instead of 2.0.
>
> To finally get back to your posts, Romi, what I think you are talking
> about is the ability to compose things like Spark Jobs, RDD actions and SQL
> Executions without needing to explicitly coordinate the collection of
> intermediate results to the Driver and the redistribution of data to the
> Executors.  This is the kind of thing that is already done in some respects
> in transformations like RDD#sortByKey, but that actually breaks Spark's
> claim that transformations are lazy.  Wanting to be able to compose things
> in Spark in a manner more in line with what functional programmers expect
> and doing so without breaking other expectations of Spark users is
> something that has been on several others' wish lists for awhile now.  A
> few attempts have been made to address the issue within the Spark 1.x
> architecture, and some of the recent additions that Matei has made in
> regard to realizing adaptive DAG scheduling may allow us to push things
> further within Spark 1.x, but this may also be the kind of thing that will
> prompt us to make deeper changes in Spark 2.0.
>
> Where I thought you going at first is another category three item: Whether
> Spark should be fundamentally changed to allow streams to be handled at the
> event level instead of (or in addition to) micro-batches.
>
> So, from my perspective that is a meta-framework that I think is useful to
> shape the Spark 2.0 discussion, a couple of category three wish list items,
> and bunch of questions that I'm not even going to try to answer on my own
> -- but looking forward to the Spark 2.0 discussion.
>
> On Sun, Nov 8, 2015 at 8:14 AM, Romi Kuntsman <r...@totango.com> wrote:
>
>> Hi, thanks for the feedback
>> I'll try to explain better what I meant.
>>
>> First we had RDDs, then we had DataFrames, so could the next step be
>> something like stored procedures over DataFrames?
>> So I define the whole calculation flow, even if it includes any "actions"
>> in between, and the whole thing is planned and executed in a super
>> optimized way once I tell it "go!"
>>
>> What I mean by "feels like scripted" is that actions come back to the
>> driver, like they would if you were in front of a command prompt.
>> But often the flow contains many steps with actions in between - multiple
>> levels of aggregations, iterative machine learning algorithms etc.
>> Sending the whole "workplan" to the Spark framework would be, as I see
>> it, the next step of it's evolution, like stored procedures send a logic
>> with many SQL queries to the database.
>>
>> Was it more clear this time? :)
>>
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Sun, Nov 8, 2015 at 5:59 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> romi,
>>> unless am i misunderstanding your suggestion you might be interested in
>>> projects like the new mahout where they try to abstract out the engine with
>>> bindings, so that they can support multiple engines within a single
>>> platform. I guess cascading is heading in a similar direction (although no
>>> spark or flink yet there, just mr1 and tez).
>>>
>>> On Sun, Nov 8, 2015 at 6:33 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Major releases can change APIs, yes. Although Flink is pretty similar
>>>> in broad design and goals, the APIs are quite different in
>>>> particulars. Speaking for myself, I can't imagine merging them, as it
>>>> would either mean significantly changing Spark APIs, or making Flink
>>>> use Spark APIs. It would mean effectively removing one project which
>>>> seems infeasible.
>>>>
>>>> I am not sure of what you're saying the difference is, but I would not
>>>> describe Spark as primarily for interactive use.
>>>>
>>>> Philosophically, I don't think One Big System to Rule Them All is a
>>>> good goal. One project will never get it all right even within one
>>>> niche. It's actually valuable to have many takes on important
>>>> problems. Hence any problem worth solving gets solved 10 times. Just
>>>> look at all those SQL engines and logging frameworks...
>>>>
>>>> On Sun, Nov 8, 2015 at 10:53 AM, Romi Kuntsman <r...@totango.com>
>>>> wrote:
>>>> > A major release usually means giving up on some API backward
>>>> compatibility?
>>>> > Can this be used as a chance to merge efforts with Apache Flink
>>>> > (https://flink.apache.org/) and create the one ultimate open source
>>>> big data
>>>> > processing system?
>>>> > Spark currently feels like it was made for interactive use (like
>>>> Python and
>>>> > R), and when used others (batch/streaming), it feels like scripted
>>>> > interactive instead of really a standalone complete app. Maybe some
>>>> base
>>>> > concepts may be adapted?
>>>> >
>>>> > (I'm not currently a committer, but as a heavy Spark user I'd love to
>>>> > participate in the discussion of what can/should be in Spark 2.0)
>>>> >
>>>> > Romi Kuntsman, Big Data Engineer
>>>> > http://www.totango.com
>>>> >
>>>> > On Fri, Nov 6, 2015 at 2:53 PM, Jean-Baptiste Onofré <j...@nanthrax.net
>>>> >
>>>> > wrote:
>>>> >>
>>>> >> Hi Sean,
>>>> >>
>>>> >> Happy to see this discussion.
>>>> >>
>>>> >> I'm working on PoC to run Camel on Spark Streaming. The purpose is
>>>> to have
>>>> >> an ingestion and integration platform directly running on Spark
>>>> Streaming.
>>>> >>
>>>> >> Basically, we would be able to use a Camel Spark DSL like:
>>>> >>
>>>> >>
>>>> >>
>>>> from("jms:queue:foo").choice().when(predicate).to("job:bar").when(predicate).to("hdfs:path").otherwise("file:path")....
>>>> >>
>>>> >> Before a formal proposal (I have to do more work there), I'm just
>>>> >> wondering if such framework can be a new Spark module (Spark
>>>> Integration for
>>>> >> instance, like Spark ML, Spark Stream, etc).
>>>> >>
>>>> >> Maybe it could be a good candidate for an addition in a "major"
>>>> release
>>>> >> like Spark 2.0.
>>>> >>
>>>> >> Just my $0.01 ;)
>>>> >>
>>>> >> Regards
>>>> >> JB
>>>> >>
>>>> >>
>>>> >> On 11/06/2015 01:44 PM, Sean Owen wrote:
>>>> >>>
>>>> >>> Since branch-1.6 is cut, I was going to make version 1.7.0 in JIRA.
>>>> >>> However I've had a few side conversations recently about Spark 2.0,
>>>> and
>>>> >>> I know I and others have a number of ideas about it already.
>>>> >>>
>>>> >>> I'll go ahead and make 1.7.0, but thought I'd ask, how much other
>>>> >>> interest is there in starting to plan Spark 2.0? is that even on the
>>>> >>> table as the next release after 1.6?
>>>> >>>
>>>> >>> Sean
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Jean-Baptiste Onofré
>>>> >> jbono...@apache.org
>>>> >> http://blog.nanthrax.net
>>>> >> Talend - http://www.talend.com
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>>>> >>
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: Ready to talk about Spark 2.0?

Reply via email to