On Wed, Aug 29, 2012 at 4:54 PM, Mattmann, Chris A (388J) <chris.a.mattm...@jpl.nasa.gov> wrote: > > Please provide examples that show umbrella projects work.
Hadoop, in its current form? The code bases are tightly intertwined. We pulled out Pig/Hive/HBase because they were substantial codebases that didn't share much code with the rest, and thus could reasonably be expected to release independently. We could get HDFS and MR to that point, but we haven't yet, because they rely so much on Common. If we copy-paste forked Common, we'd be doubling our maintenance work on this shared code. We basically did this with the IPC code for HBase, and then we had double the work to protobuf-ify both HBase and HDFS/MR earlier this year. I know because I spent a bunch of hours on both. > I've been > at this Foundation a lot longer than you have. I've seen them not work > and have been involved in ones that don't work. See splits from Lucene, > the same threads (with different names, different products, different software > but the exact same issues). See your own splits from Hadoop cited elsethread. > See the friggin' Apache board minutes discussing why umbrella projects > are bad. > > I don't know what else to tell you. I'm not going to go look up all the > threads. > I'm not Google nor do I care to. All I can say is that I've seen it before and > so have others. In your own project. > What's one concrete example of where it would be better if we split? I can't think of any. We'd still have competing interests in HDFS, and we'd still get in the same arguments. To say that all ASF projects should work the same seems pretty bizarre to me. The ASF provides license protection, infrastructure, and a set of guidelines for what makes successful projects. But I don't think it is the foundation's place to dictate what its projects should do "from above" if the projects themselves do not see a problem. If the project is so messed up, then maybe some folks should fork it into the incubator like you've suggested? What's wrong with the anarchic "let the best project succeed" philosophy, which I've also heard from Apache? > You still point to arguing to contention -- it's more than that Todd. The > project's > policies for inclusivity have nothing to do with arguing about technical > issues. I'm absolutely for meritocracy. I just have a high bar for what should be considered "merit". Perhaps the PMC as a whole has a high bar. For a system that stores my data, I'm pretty happy about that. > > Dude, you have to do that regardless, that has nothing to do with *Apache > Hadoop*. > Take your Cloudera hat off and put your *Apache Software Foundation* hat on. > Is your > #1 priority developing software here to stitch code back together, turn it > into a deliverable > for your customers (I'm guessing Cloudera customers, right? B/c Apache > doesn't have > specific customers?) and to maintain green Jenkins builds? Yes? I think so? If we do a bad release and it loses substantial data, our user base would disappear quite quickly. > > Also tell me how the 4 SVN commands I suggested will stop you from doing the > above? > At Apache? If the projects are on separate release schedules, this means that cross-project changes have to be staged across the projects in such a way that neither project breaks in the interim. All of our internal APIs become public APIs. We worked like this for around a year during the "project split" period. It was super complicated and our builds were often red, we wasted a lot of time, and new users couldn't figure out how to contribute. In the absense of a reasonable *technical* strategy to release independently, and a lot of work to stabilize internal APIs around security and IPC in particular, doing it again would cause the same problems it caused the first time. It also makes the users' lives much more difficult, or forces them to only consume via downstream packagers. Earlier in this thread, you seemed to think that downstream packagers indicated an issue with the community: fracturing the releases would only serve to make the ASF download page even less useful for someone who just wants to get going fast. > > At Cloudera, tell me also how it will stop you? > If the projects were on different release schedules, then we'd be more likely to have to do a lot of local patching to get stuff to "fit together" right. Version compatibility is a difficult problem - it multiplies the QA matrix, complicates deployment, etc. It's not insurmountable, but unless there's something to be gained (what is it, again, that you think we'd gain, specifically?) I don't see why we'd take this additional hassle. > > P.S. I appreciate you and am still one of your biggest fans. Just trying to > help you see the bigger picture here and to wear your Apache hat. Thanks for that. As for Apache vs Cloudera hat: I think they're well aligned here. Both hats want the project to be easy for people to contribute to, and want to avoid a bunch of wasted time spent on new technical issues that this would create. I want to spend that time making the product better, for our users benefit. Whether the users are Apache community users, or Cloudera customers, or Facebook's data scientists, they all are going to be happier if I spend a month improving our HA support compared to spending a month figuring out how to release three separate projects which somehow stitch together in a reasonable way at runtime without jar conflicts, tons of duplicate configuration work, byzantine version dependencies, etc. -Todd -- Todd Lipcon Software Engineer, Cloudera