Monty Taylor wrote:



On 10/15/2013 08:36 PM, Sean Dague wrote:

> On 10/15/2013 04:54 PM, Vishvananda Ishaya wrote:

>> Hi Everyone,

>>

>> I've been following this conversation and weighing the different

>> sides. This is a tricky issue but I think it is important to decouple

>> further and extend our circle of trust.

>>

>> When nova started it was very easy to do feature development. As it

>> has matured the pace has slowed. This is expected and necessary, but

>> we periodically must make decoupling decisions or we will become mired

>> in overhead. We did this already with cinder and neutron, and we have

>> discussed doing this with virt drivers in the past.

>>

>> We have a large number of people attempting to contribute to small

>> sections of nova and getting frustrated with the process.  The

>> perception of developers is much more important than the actual

>> numbers here. If people are frustrated they are disincentivized to

>> help and it hurts everyone. Suggesting that these contributors need to

>> learn all of nova and help with the review queue is silly and makes us

>> seem elitist. We should make it as easy as possible for new

>> contributors to help.

>>

>> I think our current model is breaking down at our current size and we

>> need to adopt something more similar to the linux model when dealing

>> with subsystems. The hyper-v team is the only one suggesting changes,

>> but there have been similar concerns from the vmware team. I have no

>> doubt that there are similar issues with the PowerVM, Xen, Docker, lxc

>> and even kvm driver contributors.

>

> The Linux kernel process works for a couple of reasons...

>

> 1) the subsystem maintainers have known each other for a solid decade

> (i.e. 3x the lifespan of the OpenStack project), over a history of 10

> years, of people doing the right things, you build trust in their judgment.

>

> *no one* in the Linux tree was given trust first, under the hope that it

> would work out. They had to earn it, hard, by doing community work, and

> not just playing in their corner of the world.

>

> 2) This

> http://www.wired.com/wiredenterprise/2012/06/torvalds-nvidia-linux/ is

> completely acceptable behavior. So when someone has bad code, they are

> flamed to within an inch of their life, repeatedly, until they never

> ever do that again. This is actually a time saving measure in code

> review. It's a lot faster to just call people idiots then to help them

> with line by line improvements in their code, 10, 20, 30, or 40

> iterations in gerrit.

>

> We, as a community have decided, I think rightly, that #2 really isn't

> in our culture. But you can't start cherry picking parts of the Linux

> kernel community without considering how all the parts work together.

> The good and the bad are part of why the whole system works.

>

>> In my opinion, nova-core needs to be willing to trust the subsystem

>> developers and let go of a little bit of control. I frankly don't see

>> the drawbacks.

>

> I actually see huge draw backs. Culture matters. Having people active

> and willing to work on real core issues matter. The long term health of

> Nova matters.

>

> As the QA PTL I can tell you that when you look at Nova vs. Cinder vs.

> Neutron, you'll see some very clear lines about how long it takes to get

> to the bottom of a race condition, and how many deep races are in each

> of them. I find this directly related to the stance each project has

> taken on whether it's socially acceptable to only work on your own

> vendor code. Nova's insistence up until this point that if you only play

> in your corner, you don't get the same attention is important incentive

> for people to integrate and work beyond just their boundaries. I think

> diluting this part of the culture would be hugely detrimental to Nova.

>

> Let's take an example that came up today, the compute_diagnostics API.

> This is an area where we've left it completely to the virt drivers to

> vomit up a random dictionary of the day for debugging reasons, and

> stamped it as an API. With a model where we let virt driver authors go

> hide in a corner, that's never going to become an API with any kind of

> contract, and given how much effort we've spent on ensuring RPC

> versioning and message formats, the idea that we are exposing a public

> rest endpoint that's randomly fluctuating data based on date and

> underlying implementation, is a bit saddening.

>

>> I'm leaning towards giving control of the subtree to the team as the

>> best option because it is simple and works with our current QA system.

>> Alternatively, we could split out the driver into a nova subproject (2

>> below) or we could allow them to have a separate branch and do a

>> trusted merge of all changes at the end of the cycle (similar to the

>> linux model).

>>

>> I hope we can come to a solution to the summit that makes all of our

>> contributors want to participate more. I believe that giving people

>> more responsibility inspires them to participate more fully.

>

> I would like nothing more than all our contributors to participate more.

> But more has to mean caring about not only your stuff.

>

> I was called out today in the hyper-v meeting because I had the audacity

> to -1 a hyper-v patch because I wanted some reference in the code

> somewhere to format references so why we had some new random seek call

> would be understood by people down the road -

> http://eavesdrop.openstack.org/meetings/hyper_v/2013/hyper_v.2013-10-15-16.03.log.html

>

>

> As OpenStack grows, the single biggest factor in it's success isn't

> going to be a feature in a driver, it's going to be if this crazy

> complicated system holds together. Whether or not we've got a handle on

> the emergent behavior that happens in an asynchronous message based

> system, with 10s of integrated projects, and many dozens of daemons

> cross talking with each other.

>

> I mean seriously, one of the only reasons we made it through to Havana

> RC phase is because we built a search engine based system to build

> statistical frequency analysis of unique failures on our gate resets to

> fully expose the top race conditions that had gotten so bad the gate

> basically locked up. And a bunch of people went all hands on deck to

> drive these out. People jumped across normal project lines to help on

> some of these top bugs, because that's what makes OpenStack a whole system.

>

> Things actually looked *really* bleak for release for a while. All the

> people that helped out and got us through this deserve a huge pat on the

> back. That's what OpenStack is about.

>

> So I feel pretty strongly that optimizing the contribution process for

> people that aren't helping with that larger problem, is the tragedy of

> the commons, and I think entirely the wrong optimization to be made.



I agree strongly with Sean, although I can sympathize with the other

POV. So far in OpenStack, we've consistently valued an attempt at

growing the whole over the perceived velocity (or other) needs of a

subset. I think this has served us well so far, and I'm not sure I see

that we're having such a bad time that we need to ditch it. The last

thing that OpenStack needs ANY more help with is velocity. I mean, let's

be serious - we land WAY more patches in a day than is even close to sane.



Individual developer latency? Yeah. Sometimes that's going to happen.

I'm one of the 4 core infra team members and I've got patches that have

been waiting for 2 other core reviewers for a couple of weeks now. Ok.

so I work on something else - there's plenty to do. I'm in core.



The system isn't broken - it's working as designed. When something moves

faster (recently, we rolled out nodepool, and for a week or two jim was

just rolling it out directly into production because we had to to keep

up with feature freeze) you get into a bad spot afterwards (I can only

just now be useful in code review for nodepool, because it got away from

me and I wasn't involved)



That's a 4 person team, and a 2 weeks of letting up had a noticible

impact. How about a 20 person core team with hundreds of regular

contributors?



We MUST continue to be vigilent in getting people to care about more

than their specific part, or else this big complex mess is going to come

crashing down around us. Tragedy of the Commons is right - it's really

hard to get your product managers to allow you to spend time working on

something other than your vendor feature, right?



So maybe, just maybe, if we keep working with the system we've built, we

can go back to them and say "if you want the features in quicker, you

should give me 10% or 20% of my time to work on Nova overall. It'll make

Nova healthier, and it'll give us more of a basis on which to push our

specifics." Then it's a simple cost/benefit, and unless your product

manager is nuts, it should be ok.



We're also working on other things to make this better, that still

haven't fully hit yet - so before we go making additional changes, how

about we see if those help? Russell drew a line in the sane about 3rd

party testing for vendor drivers. Since then, a bunch of folks have been

doing a LOT of work in getting those systems up and going and reporting.

The current nova devs have said they'll be much more comfortable

reviewing vendor-specific code because they can see that it's being

tested. I'd like to see the fruits of that happen before we make other

systemic changes.



OK, Monty.  You've convinced me that this really does need more thought and 
work if we want this to have a snowball's chance in hell of keeping the quality 
high.  I am a team player, so I can be naïve about changing cultures.  The 
reality is that the driver guys would need to have at least as high a design-, 
quality-, test- and documentation-centric culture as the Nova and Infra teams 
for a spin-off to work.  I'm not saying they don't, just saying what Sean said, 
they haven't had the time or chance to demonstrate it.  It also sounds like you 
might have some more thoughts on how to evolve the system to keep it moving 
forward without getting away from the core principles.  I'll wait for the 
summit discussions.  I'd love to be there, but won't.



As to Sean's comment about the diags, this is an area the driver developers 
could build their reputation with other teams as well as the Nova team, by 
sitting down together and designing a system and API that is a single system 
that will work for all the virt drivers.  This is the kind of project and 
design needed to scale OpenStack, and keep bring the driver developers into the 
culture.  Show us you can work together and stretch at least outside of your 
own code to bring consistency, scalability and traceability to the driver 
*interfaces* so those developers interfacing with the drivers don't have to do 
the same work over for every driver.



Enough soap box.  Let's see a plan that keeps the quality and cooperative 
culture thriving without bringing progress to a standstill.



--Rocky



Anywhoo - I get it. I really do. It drives me CRAZY when it takes 2

weeks for a patch to land. It drove me REALLY crazy when it was 2 days

between approve and land when the gate was all borked. But just as I

think we all felt it was not the right choice to relax the gate during

the insanity, but to buckle down, jump in and engage and fix the actual

problems, I also don't think it's right to relax our current review

criteria to mitigate the developer latency issue.



BTW - remember landing patches on giant open source projects before this

one? Anybody ever try to land a patch against MySQL? Google spent 2

years getting one landed. I think we're doing un-terrible- we can

improve, but the sky is definitely not falling.



Monty



_______________________________________________

OpenStack-dev mailing list

OpenStack-dev@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to