Re: SPARK-13843 and future of streaming backends

Steve Loughran Sat, 19 Mar 2016 19:41:04 -0700

Spark has hit one of the enternal problems of OSS projects, one hit by: ant, 
maven, hadoop, ... anything with a plugin model.

Take in the plugin: you're in control, but also down for maintenance

Leave out the plugin: other people can maintain it, be more agile, etc.

But you've lost control, and you can't even manage the links. Here I think 
maven suffered the most by keeping stuff in codehaus; migrating off there is 
still hard —not only did they lose the links: they lost the JIRA.

Maven's relationship with codehaus was very tightly coupled, lots of committers 
on both; I don't know how that relationship was handled at a higher level.

On 17 Mar 2016, at 20:51, Hari Shreedharan 
<hshreedha...@cloudera.com<mailto:hshreedha...@cloudera.com>> wrote:

I have worked with various ASF projects for 4+ years now. Sure, ASF projects 
can delete code as they feel fit. But this is the first time I have really seen 
code being "moved out" of a project without discussion. I am sure you can do 
this without violating ASF policy, but the explanation for that would be 
convoluted (someone decided to make a copy and then the ASF project deleted 
it?).

+1 for discussion. Dev changes should -> dev list; PMC for process in general. 
Don't think the ASF will overlook stuff like that.

Might want to raise this issue on the next broad report

FWIW, it may be better to just see if you can have committers to work on these 
projects: recruit the people and say 'please, only work in this area —for now". 
That gets developers on your team, which is generally considered a metric of 
health in a project.

Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with a 
focus on extras with their own support channel.

Also, moving the code out would break compatibility. AFAIK, there is no way to 
push org.apache.* artifacts directly to maven central. That happens via 
mirroring from the ASF maven repos. Even if it you could somehow directly push 
the artifacts to mvn, you really can push to org.apache.* groups only if you 
are part of the repo and acting as an agent of that project (which in this case 
would be Apache Spark). Once you move the code out, even a committer/PMC member 
would not be representing the ASF when pushing the code. I am not sure if there 
is a way to fix this issue.

This topic has cropped up in the general context of third party repos 
publishing artifacts with org.apache names but vendor specfic suffixes (e.g 
org.apache.hadoop/hadoop-common.5.3-cdh.jar

Some people were pretty unhappy about this, but the conclusion reached was 
"maven doesn't let you do anything else and still let downstream people use 
it". Futhermore, as all ASF releases are nominally the source releases *not the 
binaries*, you can look at the POMs and say "we've released source code 
designed to publish artifacts to repos —this is 'use as intended'.

People are also free to cut their own full project distributions, etc, etc. For 
example, I stick up the binaries of Windows builds independent of the ASF 
releases; these were originally just those from HDP on windows installs, now I 
check out the commit of the specific ASF release on a windows 2012 VM, do the 
build, copy the binaries. Free for all to use. But I do suspect that the ASF 
legal protections get a bit blurred here. These aren't ASF binaries, but 
binaries built directly from unmodified ASF releases.

In contrast to sticking stuff into a github repo, the moved artifacts cannot be 
published as org.apache artfacts on maven central. That's non-negotiable as far 
as the ASF are concerned. The process for releasing ASF artifacts there goes 
downstream of the ASF public release process: you stage the artifacts, they are 
part of the vote process, everything with org.apache goes through it.

That said: there is nothing to stop a set of shell org.apache artifacts being 
written which do nothing but contain transitive dependencies on artifacts in 
different groups, such as org.spark-project. The shells would be released by 
the ASF; they pull in the new stuff. And, therefore, it'd be possible to build 
a spark-assembly with the files. (I'm ignoring a loop in the build DAG here, 
playing with git submodules would let someone eliminate this by adding the 
removed libraries under a modified project.

I think there might some issues related to package names; you could make a case 
for having public APIs with the original names —they're the API, after all, and 
that's exactly what Apache Harmony did with the java.* packages.

Thanks,
Hari

On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan 
<mri...@gmail.com<mailto:mri...@gmail.com>> wrote:
I am not referring to code edits - but to migrating submodules and
code currently in Apache Spark to 'outside' of it.
If I understand correctly, assets from Apache Spark are being moved
out of it into thirdparty external repositories - not owned by Apache.

At a minimum, dev@ discussion (like this one) should be initiated.
As PMC is responsible for the project assets (including code), signoff
is required for it IMO.

More experienced Apache members might be opine better in case I got it wrong !

Regards,
Mridul

On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger 
<c...@koeninger.org<mailto:c...@koeninger.org>> wrote:
> Why would a PMC vote be necessary on every code deletion?
>
> There was a Jira and pull request discussion about the submodules that
> have been removed so far.
>
> https://issues.apache.org/jira/browse/SPARK-13843
>
> There's another ongoing one about Kafka specifically
>
> https://issues.apache.org/jira/browse/SPARK-13877
>
>
> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan 
> <mri...@gmail.com<mailto:mri...@gmail.com>> wrote:
>>
>> I was not aware of a discussion in Dev list about this - agree with most of
>> the observations.
>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>
>> Regards
>> Mridul
>>
>>
>>
>> On Thursday, March 17, 2016, Marcelo Vanzin 
>> <van...@cloudera.com<mailto:van...@cloudera.com>> wrote:
>>>
>>> Hello all,
>>>
>>> Recently a lot of the streaming backends were moved to a separate
>>> project on github and removed from the main Spark repo.
>>>
>>> While I think the idea is great, I'm a little worried about the
>>> execution. Some concerns were already raised on the bug mentioned
>>> above, but I'd like to have a more explicit discussion about this so
>>> things don't fall through the cracks.
>>>
>>> Mainly I have three concerns.
>>>
>>> i. Ownership
>>>
>>> That code used to be run by the ASF, but now it's hosted in a github
>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>> problematic.
>>>
>>> ii. Governance
>>>
>>> Similar to the above; who has commit access to the above repos? Will
>>> all the Spark committers, present and future, have commit access to
>>> all of those repos? Are they still going to be considered part of
>>> Spark and have release management done through the Spark community?
>>>
>>>
>>> For both of the questions above, why are they not turned into
>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>> a mechanism to do that, without the need to keep the code in the main
>>> Spark repo, right?
>>>
>>> iii. Usability
>>>
>>> This is another thing I don't see discussed. For Scala-based code
>>> things don't change much, I guess, if the artifact names don't change
>>> (another reason to keep things in the ASF?), but what about python?
>>> How are pyspark users expected to get that code going forward, since
>>> it's not in Spark's pyspark.zip anymore?
>>>
>>>
>>> Is there an easy way of keeping these things within the ASF Spark
>>> project? I think that would be better for everybody.
>>>
>>> --
>>> Marcelo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: 
>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
>>> For additional commands, e-mail: 
>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>

Re: SPARK-13843 and future of streaming backends

Reply via email to