Thanks for initiating this discussion. I merged the pull request because it was unblocking another major piece of work for Spark 2.0: not requiring assembly jars, which is arguably a lot more important than sources that are less frequently used. I take full responsibility for that.
I think it's inaccurate to call them "backend" because it makes these things sound a lot more serious, when in reality they are a bunch of connectors to less frequently used streaming data sources (e.g. mqtt, flume). But that's not that important here. Another important factor is that over time, with the development of structure streaming, we'd provide a new API for streaming sources that unifies the way to connect arbitrary sources, and as a result all of these sources need to be rewritten anyway. This is similar to the RDD -> DataFrame transition for data sources, although it was initially painful, but in the long run provides much better experience for end-users because they only need to learn a single API for all sources, and it becomes trivial to transition from one source to another, without actually impacting business logic. So the truth is that in the long run, the existing connectors will be replaced by new ones, and they have been causing minor issues here and there in the code base. Now issues like these are never black and white. By moving them out, we'd require users to at least change the maven coordinate in their build file (although things can still be made binary and source compatible). So I made the call and asked the contributor to keep Kafka and Kinesis in, because those are the most widely used (and could be more contentious), and move everything else out. I have personally done enough data sources or 3rd party packages for Spark on github that I can setup a github repo with CI and maven publishing in just under an hour. I do not expect a lot of changes to these packages because the APIs have been fairly stable. So the thing I was optimizing for was to minimize the time we need to spent on these packages given the (expected) low activity and the shift to focus on structured streaming, and also minimize the chance to break user apps to provide the best user experience. Github repo seems the simplest choice to me. I also made another decision to provide separate repos (and thus issue trackers) on github for these packages. The reason is that these connectors have very disjoint communities. For example, the community that care about mqtt is likely very different from the community that care about akka. It is much easier to track all of these. Logistics wise -- things are still in flux. I think it'd make a lot of sense to give existing Spark committers (or at least the ones that have contributed to streaming) write access to the github repos. IMHO, it is not in any of the major Spark contributing organizations' strategic interest to "own" these projects, especially considering most of the activities will switch to structured streaming. If one really feels strongly that we should go through all the overhead to setup an ASF subproject for these modules that won't work with the new structured streaming, and want to spearhead to setup separate repos (preferably one subproject per connector), CI, separate JIRA, governance, READMEs, voting, we can discuss that. Until then, I'd keep the github option open because IMHO it is what works the best for end users (including discoverability, issue tracking, release publishing, ...). On Thu, Mar 17, 2016 at 1:50 PM, Cody Koeninger <c...@koeninger.org> wrote: > Anyone can fork apache licensed code. Committers can approve pull > requests that delete code from asf repos. Because those two things > happen near each other in time, it's somehow a process violation? > > I think the discussion would be better served by concentrating on how > we're going to solve the problem and move forward. > > On Thu, Mar 17, 2016 at 3:13 PM, Mridul Muralidharan <mri...@gmail.com> > wrote: > > I am not referring to code edits - but to migrating submodules and > > code currently in Apache Spark to 'outside' of it. > > If I understand correctly, assets from Apache Spark are being moved > > out of it into thirdparty external repositories - not owned by Apache. > > > > At a minimum, dev@ discussion (like this one) should be initiated. > > As PMC is responsible for the project assets (including code), signoff > > is required for it IMO. > > > > More experienced Apache members might be opine better in case I got it > wrong ! > > > > > > Regards, > > Mridul > > > > > > On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> Why would a PMC vote be necessary on every code deletion? > >> > >> There was a Jira and pull request discussion about the submodules that > >> have been removed so far. > >> > >> https://issues.apache.org/jira/browse/SPARK-13843 > >> > >> There's another ongoing one about Kafka specifically > >> > >> https://issues.apache.org/jira/browse/SPARK-13877 > >> > >> > >> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mri...@gmail.com> > wrote: > >>> > >>> I was not aware of a discussion in Dev list about this - agree with > most of > >>> the observations. > >>> In addition, I did not see PMC signoff on moving (sub-)modules out. > >>> > >>> Regards > >>> Mridul > >>> > >>> > >>> > >>> On Thursday, March 17, 2016, Marcelo Vanzin <van...@cloudera.com> > wrote: > >>>> > >>>> Hello all, > >>>> > >>>> Recently a lot of the streaming backends were moved to a separate > >>>> project on github and removed from the main Spark repo. > >>>> > >>>> While I think the idea is great, I'm a little worried about the > >>>> execution. Some concerns were already raised on the bug mentioned > >>>> above, but I'd like to have a more explicit discussion about this so > >>>> things don't fall through the cracks. > >>>> > >>>> Mainly I have three concerns. > >>>> > >>>> i. Ownership > >>>> > >>>> That code used to be run by the ASF, but now it's hosted in a github > >>>> repo owned not by the ASF. That sounds a little sub-optimal, if not > >>>> problematic. > >>>> > >>>> ii. Governance > >>>> > >>>> Similar to the above; who has commit access to the above repos? Will > >>>> all the Spark committers, present and future, have commit access to > >>>> all of those repos? Are they still going to be considered part of > >>>> Spark and have release management done through the Spark community? > >>>> > >>>> > >>>> For both of the questions above, why are they not turned into > >>>> sub-projects of Spark and hosted on the ASF repos? I believe there is > >>>> a mechanism to do that, without the need to keep the code in the main > >>>> Spark repo, right? > >>>> > >>>> iii. Usability > >>>> > >>>> This is another thing I don't see discussed. For Scala-based code > >>>> things don't change much, I guess, if the artifact names don't change > >>>> (another reason to keep things in the ASF?), but what about python? > >>>> How are pyspark users expected to get that code going forward, since > >>>> it's not in Spark's pyspark.zip anymore? > >>>> > >>>> > >>>> Is there an easy way of keeping these things within the ASF Spark > >>>> project? I think that would be better for everybody. > >>>> > >>>> -- > >>>> Marcelo > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >>>> For additional commands, e-mail: dev-h...@spark.apache.org > >>>> > >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >