Re: SPARK-13843 and future of streaming backends

Reynold Xin Sat, 19 Mar 2016 01:17:09 -0700

Thanks for initiating this discussion. I merged the pull request because it
was unblocking another major piece of work for Spark 2.0: not requiring
assembly jars, which is arguably a lot more important than sources that are
less frequently used. I take full responsibility for that.

I think it's inaccurate to call them "backend" because it makes these
things sound a lot more serious, when in reality they are a bunch of
connectors to less frequently used streaming data sources (e.g. mqtt,
flume). But that's not that important here.

Another important factor is that over time, with the development of
structure streaming, we'd provide a new API for streaming sources that
unifies the way to connect arbitrary sources, and as a result all of these
sources need to be rewritten anyway. This is similar to the RDD ->
DataFrame transition for data sources, although it was initially painful,
but in the long run provides much better experience for end-users because
they only need to learn a single API for all sources, and it becomes
trivial to transition from one source to another, without actually
impacting business logic.

So the truth is that in the long run, the existing connectors will be
replaced by new ones, and they have been causing minor issues here and
there in the code base. Now issues like these are never black and white. By
moving them out, we'd require users to at least change the maven coordinate
in their build file (although things can still be made binary and source
compatible). So I made the call and asked the contributor to keep Kafka and
Kinesis in, because those are the most widely used (and could be more
contentious), and move everything else out.

I have personally done enough data sources or 3rd party packages for Spark
on github that I can setup a github repo with CI and maven publishing in
just under an hour. I do not expect a lot of changes to these packages
because the APIs have been fairly stable. So the thing I was optimizing for
was to minimize the time we need to spent on these packages given the
(expected) low activity and the shift to focus on structured streaming, and
also minimize the chance to break user apps to provide the best user
experience.

Github repo seems the simplest choice to me. I also made another decision
to provide separate repos (and thus issue trackers) on github for these
packages. The reason is that these connectors have very disjoint
communities. For example, the community that care about mqtt is likely very
different from the community that care about akka. It is much easier to
track all of these.

Logistics wise -- things are still in flux. I think it'd make a lot of
sense to give existing Spark committers (or at least the ones that have
contributed to streaming) write access to the github repos. IMHO, it is not
in any of the major Spark contributing organizations' strategic interest to
"own" these projects, especially considering most of the activities will
switch to structured streaming.

If one really feels strongly that we should go through all the overhead to
setup an ASF subproject for these modules that won't work with the new
structured streaming, and want to spearhead to setup separate repos
(preferably one subproject per connector), CI, separate JIRA, governance,
READMEs, voting, we can discuss that. Until then, I'd keep the github
option open because IMHO it is what works the best for end users (including
discoverability, issue tracking, release publishing, ...).

On Thu, Mar 17, 2016 at 1:50 PM, Cody Koeninger <[email protected]> wrote:

> Anyone can fork apache licensed code.  Committers can approve pull
> requests that delete code from asf repos.  Because those two things
> happen near each other in time, it's somehow a process violation?
>
> I think the discussion would be better served by concentrating on how
> we're going to solve the problem and move forward.
>
> On Thu, Mar 17, 2016 at 3:13 PM, Mridul Muralidharan <[email protected]>
> wrote:
> > I am not referring to code edits - but to migrating submodules and
> > code currently in Apache Spark to 'outside' of it.
> > If I understand correctly, assets from Apache Spark are being moved
> > out of it into thirdparty external repositories - not owned by Apache.
> >
> > At a minimum, dev@ discussion (like this one) should be initiated.
> > As PMC is responsible for the project assets (including code), signoff
> > is required for it IMO.
> >
> > More experienced Apache members might be opine better in case I got it
> wrong !
> >
> >
> > Regards,
> > Mridul
> >
> >
> > On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <[email protected]>
> wrote:
> >> Why would a PMC vote be necessary on every code deletion?
> >>
> >> There was a Jira and pull request discussion about the submodules that
> >> have been removed so far.
> >>
> >> https://issues.apache.org/jira/browse/SPARK-13843
> >>
> >> There's another ongoing one about Kafka specifically
> >>
> >> https://issues.apache.org/jira/browse/SPARK-13877
> >>
> >>
> >> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <[email protected]>
> wrote:
> >>>
> >>> I was not aware of a discussion in Dev list about this - agree with
> most of
> >>> the observations.
> >>> In addition, I did not see PMC signoff on moving (sub-)modules out.
> >>>
> >>> Regards
> >>> Mridul
> >>>
> >>>
> >>>
> >>> On Thursday, March 17, 2016, Marcelo Vanzin <[email protected]>
> wrote:
> >>>>
> >>>> Hello all,
> >>>>
> >>>> Recently a lot of the streaming backends were moved to a separate
> >>>> project on github and removed from the main Spark repo.
> >>>>
> >>>> While I think the idea is great, I'm a little worried about the
> >>>> execution. Some concerns were already raised on the bug mentioned
> >>>> above, but I'd like to have a more explicit discussion about this so
> >>>> things don't fall through the cracks.
> >>>>
> >>>> Mainly I have three concerns.
> >>>>
> >>>> i. Ownership
> >>>>
> >>>> That code used to be run by the ASF, but now it's hosted in a github
> >>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
> >>>> problematic.
> >>>>
> >>>> ii. Governance
> >>>>
> >>>> Similar to the above; who has commit access to the above repos? Will
> >>>> all the Spark committers, present and future, have commit access to
> >>>> all of those repos? Are they still going to be considered part of
> >>>> Spark and have release management done through the Spark community?
> >>>>
> >>>>
> >>>> For both of the questions above, why are they not turned into
> >>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
> >>>> a mechanism to do that, without the need to keep the code in the main
> >>>> Spark repo, right?
> >>>>
> >>>> iii. Usability
> >>>>
> >>>> This is another thing I don't see discussed. For Scala-based code
> >>>> things don't change much, I guess, if the artifact names don't change
> >>>> (another reason to keep things in the ASF?), but what about python?
> >>>> How are pyspark users expected to get that code going forward, since
> >>>> it's not in Spark's pyspark.zip anymore?
> >>>>
> >>>>
> >>>> Is there an easy way of keeping these things within the ASF Spark
> >>>> project? I think that would be better for everybody.
> >>>>
> >>>> --
> >>>> Marcelo
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [email protected]
> >>>> For additional commands, e-mail: [email protected]
> >>>>
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: SPARK-13843 and future of streaming backends

Reply via email to