Re: [DISCUSS] Spin out MR, HDFS and YARN as their own TLPs and disband Hadoop umbrella project

Mattmann, Chris A (388J) Wed, 29 Aug 2012 17:56:22 -0700

Hey Todd,

On Aug 29, 2012, at 5:16 PM, Todd Lipcon wrote:

> On Wed, Aug 29, 2012 at 4:54 PM, Mattmann, Chris A (388J)
> <chris.a.mattm...@jpl.nasa.gov> wrote:
>> 
>> Please provide examples that show umbrella projects work.
> 
> Hadoop, in its current form?

I don't agree that it's working. That's where you and I differ. And not just 
you and I, you and the others that have agreed with me else-thread.

Technically, the project is working for sure. Community-wise, no. 
I guess we can agree to disagree.

> 
> 
> If we copy-paste forked Common, we'd be doubling our maintenance work
> on this shared code.

Who's "we"? You? Would you expect to be a PMC member/committer in all split 
projects?

Also, are you the only person working on the project? And the "we" would
include others, right? Who may or may not be committers on the other projects?

I'm not proposing SVN copy and then all PMC members x N projects. 
Figure out who are on the PMCs for the distinct communities that are operating
on this hydra.

>> 
>> I don't know what else to tell you. I'm not going to go look up all the 
>> threads.
>> I'm not Google nor do I care to. All I can say is that I've seen it before 
>> and
>> so have others. In your own project.
>> 
> 
> What's one concrete example of where it would be better if we split?

Training off bad community practices is difficult, I'll agree with you on that.
Hopefully if these new projects went the Incubator route, you could get
some other fuddy duddy's like me that have been around and seen a lot
at the Foundation helping the new projects really understand the community
aspects.

> 
> To say that all ASF projects should work the same seems pretty bizarre
> to me.

Please show me where I said the above sentence? 

> The ASF provides license protection, infrastructure, and a set
> of guidelines for what makes successful projects.

Guidelines which the Apache Hadoop PMC continues not to follow. 
Technically successful yes. Community-wise successful, sorta.

> But I don't think it
> is the foundation's place to dictate what its projects should do "from
> above" if the projects themselves do not see a problem.

No, but it's the Foundation's (and its members) responsibility to ensure
that its projects are behaving in that loosely coupled set of principles 
and guidelines that we call the Apache way. Apache Hadoop is doing
great technically. Not so sure about the Apache way part.

> 
> If the project is so messed up, then maybe some folks should fork it
> into the incubator like you've suggested? What's wrong with the
> anarchic "let the best project succeed" philosophy, which I've also
> heard from Apache?

Yeah I proposed that too. We'll see if it happens. Concretely, I think all
of the current Hadoop "sub projects" should take a spin through the Incubator
and see how they are doing as projects. If nothing is afoul, I'm sure it would
be a pretty quick process, right? Add new some PPMC members/committers,
make a release or two, make sure all software is ALv2 and compat. You guys
are already doing that, right?

> 
>> You still point to arguing to contention -- it's more than that Todd. The 
>> project's
>> policies for inclusivity have nothing to do with arguing about technical 
>> issues.
> 
> I'm absolutely for meritocracy. I just have a high bar for what should
> be considered "merit". Perhaps the PMC as a whole has a high bar. For
> a system that stores my data, I'm pretty happy about that.

You won't be pretty happy about it when your high bar leaves you as one of the
only people int he world maintaining a 100M line code base. Especially as you
get older, have kids (or not), have a family, go on to do even bigger and better
things, and care even less about reading emails like this. 

You're going to see eventually (as will others) that the way that you grow
around this Foundation (and in software in general) is to teach others how
to do your job, and to attract people to your project, and not to shoo them 
away 
with exclusivity. You call it a "high bar" to "protect your data". I call it 
"enjoy maintaining 
the software forever and never taking a vacation". It's called scalability 
Todd. 

> 
>> 
>> Dude, you have to do that regardless, that has nothing to do with *Apache 
>> Hadoop*.
>> Take your Cloudera hat off and put your *Apache Software Foundation* hat on. 
>> Is your
>> #1 priority developing software here to stitch code back together, turn it 
>> into a deliverable
>> for your customers (I'm guessing Cloudera customers, right? B/c Apache 
>> doesn't have
>> specific customers?) and to maintain green Jenkins builds?
> 
> Yes? I think so? If we do a bad release and it loses substantial data,
> our user base would disappear quite quickly.

Of course, because 1 release kills a project right? And of course there weren't 
30 some odd
releases before that one bad one that someone could roll back to, right? Huh??

> 
>> 
>> Also tell me how the 4 SVN commands I suggested will stop you from doing the 
>> above?
>> At Apache?
> 
> If the projects are on separate release schedules, this means that
> cross-project changes have to be staged across the projects in such a
> way that neither project breaks in the interim.

Because this is what happens with Tomcat, or whatever other dependencies
you guys have in your modularized project right? You guys call up the Tomcat
PMC whenever there is a release and make sure that your Hadoop specific 
need is included in it right? Or that they include some bug fix that you really 
need?

C'mon, you know that's not the way stuff works. It's called insulation.

> 
> In the absense of a reasonable *technical* strategy to release
> independently, and a lot of work to stabilize internal APIs around
> security and IPC in particular, doing it again would cause the same
> problems it caused the first time.

I agree there should be a plan to technically work to make sure the
independent TLPs (or podlings->TLPs eventually whatever) sync up
or line up -- that would be ideal. What if it doesn't happen? Will the world
end? Probably not. Because there are good people hanging around
that will get stuff done and make sure new TLP software foo bar technically 
works great as they have always done.

> 
> It also makes the users' lives much more difficult, or forces them to
> only consume via downstream packagers.

No it doesn't. That's orthogonal?

> Earlier in this thread, you
> seemed to think that downstream packagers indicated an issue with the
> community

Nah, I was talking about downstream "companies" and their interests, not 
packagers.

> : fracturing the releases would only serve to make the ASF
> download page even less useful for someone who just wants to get going
> fast.

Why is that? Isn't that what *Apache* Big Top (incubating) is for (which also 
has an
*Apache* download page?).

> 
>> 
>> At Cloudera, tell me also how it will stop you?
>> 
> 
> If the projects were on different release schedules, then we'd be more
> likely to have to do a lot of local patching to get stuff to "fit
> together" right.

+1, this could be the case.

> Version compatibility is a difficult problem - it
> multiplies the QA matrix, complicates deployment, etc.

Yep agree. 

> It's not
> insurmountable, but unless there's something to be gained (what is it,
> again, that you think we'd gain, specifically?) I don't see why we'd
> take this additional hassle.

As for the gain, I think what you'd gain is less arguments about who to add to 
the
PMC, how to add them, less maintenance of lame ASF authorization templates
within *the same project*, less meta-discussions, and company politic 
spillover, 
and hopefully more beer to be shared by all. 

Note, I said *I think*. I'm only truly physic sometimes.

> 
>> 
>> P.S. I appreciate you and am still one of your biggest fans. Just trying to
>> help you see the bigger picture here and to wear your Apache hat.
> 
> Thanks for that. As for Apache vs Cloudera hat: I think they're well
> aligned here. Both hats want the project to be easy for people to
> contribute to, and want to avoid a bunch of wasted time spent on new
> technical issues that this would create. I want to spend that time
> making the product better, for our users benefit. Whether the users
> are Apache community users, or Cloudera customers, or Facebook's data
> scientists, they all are going to be happier if I spend a month
> improving our HA support compared to spending a month figuring out how
> to release three separate projects which somehow stitch together in a
> reasonable way at runtime without jar conflicts, tons of duplicate
> configuration work, byzantine version dependencies, etc.

That's a fair statement Todd. But that's why it's not Apache Todd, or
Apache Todooop. And why there are others at the Foundation, that you 
have to rely on, others within your project that you have to rely on, and 
why not everyone has the same interests. Some people's interests
are in patching HDFS and making it highly scalable and kicking butt
technically. Other people's interests are in discussing what they perceive
to be community issues within a project at their Foundation.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: [DISCUSS] Spin out MR, HDFS and YARN as their own TLPs and disband Hadoop umbrella project

Reply via email to