Re: [DISCUSS] Spin out MR, HDFS and YARN as their own TLPs and disband Hadoop umbrella project

Robert Evans Wed, 29 Aug 2012 08:18:06 -0700

I personally am for splitting up the projects.  I think there is a lot of
potential that each of the projects could have on their own, and I expect
to see them evolve in new and interesting ways when the projects are not
tied directly together.

But, in order to get there we need to address the issues that made the
first split attempt fail.  First off we need look at all API calls that
MR, YARN, or HDFS do into common that are not @Stable, and either promote
them to @Stable or remove the need for those calls.  Second while we are
doing that we need to look at the visibility of those APIs.  How many APIs
really need to be @LimitedPrivate or should they be @Public? How many of
the APIs have no designation at all?  Third get truly serious about
maintaining binary compatibility on @Stable APIs. Fourth we need to start
splitting the projects up, starting with common.  I think it would be cool
to call it liBig, but I digress.  Once common has been split out and is on
its own for a few releases, we start splitting out HDFS, YARN, and
MapReduce.  For each of those we need to do a similar audit between the
projects and fix the interdependencies between them.  This is mostly
dependencies between YARN and MR.

As part of this we also need to have a clear set of rules about what it
takes to become a committer or PMC member for the new projects when they
split off.  I am fine with all committers become PMC members, but if we
merge the lists now and simply say all pervious committers become
committers on the new TLPs there will be a lot of committers/PMC members
that have no real desire to be on those projects.  I would propose that we
merge the committer lists, but all committers on the current project
receive an invitation to become a committer on the new projects.  ATM
convinced me that committers know their boundaries and will self censor.
I believe that many committers will decline to become committers on the
new projects either because it is out of their area of experteese or
because they are not involved with Hadoop any more, and will ignore the
invitation.   

I fear that just voting and doing an svn copy -m will result in the same
thing that happened last time.  Someone will want to make a large change.
This will require making a change to something in common, but because it
cannot easily be done in a backwards compatible way, or it will take three
steps to complete the change instead of one we will get frustrated.  If
this happens enough we will really get frustrated and try to merge the
projects back together again.   This is because the projects are too
tightly coupled together right now to really have them stand on their own.
 Just look at all of the security and token work that has been done
recently.  They have touched every single project and it has been a bit of
a nightmare.  It would be even worse if the projects were completely split
apart.

I also want us to think about the timing of this.  Do we really want to do
this before 2.0 is GA?  Doing this properly is probably going to be a
several month effort for one or two people, and a concerted effort by
everyone not to break things while they work.  If we have to rearchitect
something so that the APIs can be marked stable it may be a lot longer
then that.  Is it worth pushing the GA of 2.0 off by an entire quarter?
For me I would say yes, but I know others have different opinions, and
different schedules.

@Chris,

I can see your desire to do the split now, and then deal with the fallout
as we adapt to the changes.  I think that would work assuming that we all
are completely committed to making the changes necessary. But because we
are having this discussion at all seems to indicate that we are not all
completely committed to this, and I also feel that dealing with the
fallout is going to take a lot longer if we don't try to address some of
the problems up front.  Putting on my Yahoo! Hat, I want to avoid as many
problems and delays as I can, because my customers want a stable release
of Hadoop the features that are in 2.0.  The longer it is delayed the
longer we stay on branch-0.23.  A one quarter delay because of this I am
sure I can swing, more then that and I will start to get more pressure to
pull in new features which will probably mean that we then have to fork
which is something that I really do not want to do.

So I am +1 on merging the committer list, and +1 splitting the projects.
I would encourage us to at least do some planning and legwork up front
before splitting.  I am even +1 for setting a deadline on which date svn
-m will happen wether we are ready or not.

--Bobby Evans

On 8/28/12 10:50 PM, "Alejandro Abdelnur" <t...@cloudera.com> wrote:

>Chris, thanks for initiating the discussion.
>
>IMO a pre-requisite to this is to figure out how we'll handle the
>following:
>
>* Where does common stuff lives?
>* What are the public interfaces of each project (towards the other
>projects)?
>* How do we do development/releases? In tandem? Separate? How this
>will work in practice, currently we are constantly tweaking things
>inter-projects, sometimes in the same JIRAs, sometimes in follow up
>JIRAs.
>
>Thoughts?
>
>Thxs.
>
>On Tue, Aug 28, 2012 at 7:33 PM, Mattmann, Chris A (388J)
><chris.a.mattm...@jpl.nasa.gov> wrote:
>> [decided to minimize traffic and to simply put this in one thread]
>>
>> Hi Guys,
>>
>> See the recent discussion on these threads:
>>
>> YARN as its own Hadoop "sub project": http://s.apache.org/WW1
>> Maintain a single committer list for the Hadoop project:
>>http://s.apache.org/Owx
>>
>> ...and just pay attention to the Hadoop project over the last 3-4
>>years. It's operating
>> as a single project, that's masking separate communities that
>>themselves are really
>> separate ASF projects.
>>
>> At the ASF, this has been a problem area called "umbrella" projects and
>>over the years,
>> all I've seen from them is wasted bandwidth, artificial barriers and
>>the inventions of
>> new ways to perform process mongering and to reduce the fun in
>>developing software
>> at this fantastic foundation.
>>
>> I've talked about umbrella projects enough. We've diverted conversation
>>enough.
>> Enough people have tried to act like there is some technical mumbo
>>jumbo that is
>> preventing the eventual act of higher power that I myself hope comes
>>should these
>> discussions prove unfruitful through normal means.
>>
>> *these. are. separate. projects.*
>> 
>>*there.are.not.blocker.issues.from.spinning.out.these.projects.as.their.o
>>wn.communities*
>>
>> In this email: http://s.apache.org/rSm
>>
>> And in the 2 subsequent follow ons in that thread, I've outlined a
>>process that I'll copy
>> through below for splitting these projects into their own TLPs:
>>
>> -----snip
>> Process:
>>
>> 0. [DISCUSS] thread for <TLP name> in which you talk about #1 and #2
>>below, potentially draft resolution too.
>>
>> 1. Decide on an initial set of *PMC* members. I urge each new TLP to
>>adopt PMC==C. See reasons I've
>> already discussed.
>>
>> 2. Decide on a chair. Try not to VOTE for this explicitly, see if can
>>be discussed and consensus
>> can be reached (just a thought experiment). VOTE if necessary.
>>
>> 3. [VOTE] thread for <TLP name>
>>
>> 4. Create Project:
>>   a. paste resolution from #0 to board@ or;
>>   b. go to general@incubator and start new Incubator project.
>>
>> 5. infrastructure set up.
>>    MLs moving; new UNIX groups; website setup;
>>    SVN setup like this:
>>
>> svn copy -m "MR TLP." https://svn.apache.org/repos/asf/hadoop/
>>https://svn.apache.org/repos/asf/<insert cool MR name>; or
>> svn copy -m "YARN TLP." https://svn.apache.org/repos/asf/hadoop/
>>https://svn.apache.org/repos/asf/<insert cool YARN name>; or
>> svn copy -m "HDFS TLP." https://svn.apache.org/repos/asf/hadoop/
>>https://svn.apache.org/repos/asf/<insert cool HDFS name>
>>
>> After all 3 have been created run:
>>
>> svn remove -m "Remove Hadoop umbrella TLP. Split into separate
>>projects." https://svn.apache.org/repos/asf/hadoop
>>
>> 6. (TLPs if 4a; Incubator podling if 4b;) proceed, collaborate, operate
>>as distinct communities, and try to solve the code duplication/dependency
>> issues from there.
>>
>> 7. If 4b; then graduate as TLP from Incubator.
>>
>> -----snip
>>
>> So that's my proposal.
>>
>> Thanks guys.
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattm...@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>
>
>
>-- 
>Alejandro

Re: [DISCUSS] Spin out MR, HDFS and YARN as their own TLPs and disband Hadoop umbrella project

Reply via email to