Re: Blameless post mortem

Wilder Rodrigues Mon, 28 Sep 2015 05:51:08 -0700

I agree with the docs stuff, that I said 5 emails ago.

Once things are fixed, I will take the time to understand the code as a whole 
and write the documentation: we will need ir for release purposes anyway.


Cheers,
Wilder


> On 28 Sep 2015, at 14:47, Bharat Kumar <[email protected]> wrote:
> 
> Hi Wilder,
> 
> I am not talking about just the vpc networks. There are many other ares 
> getting effected because of this, some of them are vpn(not implemented) , rvr 
> in isolated networks etc. 
> All i am saying is the design doc will help us understand the complete impact 
> of the changes and deal with them accordingly.
> 
> 
> Regards,
> Bharat.
> 
> 
> On 28-Sep-2015, at 6:02 pm, Wilder Rodrigues <[email protected]> 
> wrote:
> 
>> Only few tests…. 51 tests against a real environment.
>> 
>> At that time Nux also tested it and we tried to get Paul Angus, Geoff and 
>> Rohit from Shape Blue to test it as well. Nux found a couple of issues that 
>> were reported and fixed (see email below).
>> 
>> When I came back from holidays, 4 weeks ago, a PR containing 360 files 
>> changed and almost 4000 lines, which was not even compiling, was merged onto 
>> Master. We have less than a handful of people executing tests against PRs - 
>> so few that I could even name who tests and who doesn’t. But hey, that’s a 
>> blames email. I’m not trying to justify anything, but that handful of 
>> people, who actually care about ACS, are getting quite fedup with this whole 
>> discussion.
>> 
>> Cheers,
>> Wilder
>> 
>> ===========================================================
>> 
>> On 20 Feb 2015, at 10:03, Nux! <[email protected]<mailto:[email protected]>> wrote:
>> 
>> Well, it looks like we were right to test it, found some problems.
>> 
>> 1 - the passwords for instances are not served properly, `wget --header 
>> "DomU_Request: send_my_password" $router:8080` returns blank response. I am 
>> not sure why this happens, though I tried to look around the router.
>> 
>> 2 - in addition to the above, in a redundant VPC the SNAT does not work. 
>> From an instance I can ping the router(s), but not any further than that. 
>> SNAT works fine in a normal/non-vpc network.
>> I'll try to look more into it later today.
>> 
>> Have a nice day :)
>> 
>> Lucian
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro<http://www.nux.ro>
>> 
>> 
>> 
>> 
>> On 28 Sep 2015, at 14:13, Bharat Kumar 
>> <[email protected]<mailto:[email protected]>> wrote:
>> 
>> Hi Sebastien,
>> 
>> You are confused, we are talking about  persistent VR config changes. below 
>> is the pr related to it.
>> https://github.com/apache/cloudstack/pull/118
>> 
>> If you look at it you will notice that there are more than 250 commits and 
>> only a few tests that were run.
>> 
>> Regards,
>> Bharat.
>> 
>> On 28-Sep-2015, at 5:24 pm, Bharat Kumar 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>>  wrote:
>> 
>> Hi Remi,
>> 
>> Whatever  ever we think  we have discovered are all well known best 
>> practices while developing code in community.
>> I agree that tests need to be run on a new PR,  but i wonder why was this 
>> ignored when merging the VR refactor code. Perhaps we will uncover some more 
>> issues if we investigate this. I believe
>> one of the reasons for this is the complexity and incomplete nature of the 
>> vr refactor change and failing to identify the areas which got effected. If 
>> we had a good documentation i think we cloud have understood the areas that 
>> were getting
>> impacted early on, areas like the vpn ,  iptables, isolated networks rvr   
>> etc  and run the relevant tests. The documentation will also help us focus 
>> on these areas while reviewing  and fixing subsequent issues. Currently no 
>> one knows the areas that got effected
>> due to the vr refactor change, we are seeing issues all over the code.  I 
>> think this is a bigger problem than what we have discussed so far.
>> 
>> I think presently we should stop fixing all the vr refactoring  bugs until 
>> we come up with a  proper document describing the VR refactoring  changes.
>> 
>> I am not suggesting that we should revert the vr refactor code, I am willing 
>> to work on this and fix the issues,  I am only asking if we can get some 
>> documentation.
>> 
>> 
>> Regards,
>> Bharat.
>> 
>> On 28-Sep-2015, at 4:59 pm, Sebastien Goasguen 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
>> 
>> 
>> On Sep 28, 2015, at 1:14 PM, Remi Bergsma 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>>  wrote:
>> 
>> Hi Bharat,
>> 
>> 
>> There is only one way to prove a feature works: with tests. That’s why I say 
>> actually _running_ the tests we have today on any new PR, is the most 
>> important thing. Having no documentation is a problem, I agree, but it is 
>> not more important IMHO. If we had the documentation, we still would have 
>> issues if nobody runs the tests and verifies pull requests. Documentation 
>> that is perfect does not automatically lead to perfect implementation. So we 
>> need tests to verify.
>> 
>> If we don’t agree that is also fine. We need to do both anyway and I think 
>> we do agree on that.
>> 
>> 
>> Also we need to move forward. We should have a live chat once 4.6 is out to 
>> discuss all issues/problems and iron out the process.
>> 
>> But reverting the VR refactor is not going to happen. There was ample 
>> discussions on the PR when it was submitted, there was time to review and 
>> raise concerns at that time. It went through quite a few reviews, tests 
>> etc…Maybe the documentation is not good, but the time to raise this concern 
>> I am afraid was six months ago. We can learn from it, but we are not going 
>> to revert it, this would not go cleanly as David mentioned.
>> 
>> So let’s get back to blockers for 4.6, are there still VR related issues 
>> with master ?
>> 
>> 
>> 
>> 
>> Regards,
>> Remi
>> 
>> 
>> 
>> 
>> 
>> 
>> On 28/09/15 12:15, "Bharat Kumar" 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>>  wrote:
>> 
>> Hi Remi,
>> 
>> i do not agree with “There is no bigger problem”  part of your reply. so I 
>> had to repeat myself to make it more clear, Not because i am not aware of 
>> what this thread is supposed to do.
>> 
>> Regards,
>> Bharat.
>> 
>> On 28-Sep-2015, at 2:51 pm, Remi Bergsma 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>>  wrote:
>> 
>> Hi Bharat,
>> 
>> I understand your frustrations but we already agreed on this so no need to 
>> repeat. This thread is supposed to list some improvements and learn from it. 
>> Your point has been taken so let’s move on.
>> 
>> We need documentation first, then do a change after which all tests should 
>> pass. Even better is we write (missing) tests before changing stuff so you 
>> know they pass before and after the fact.
>> 
>> When doing reviews, feel free to ask for design documentation if you feel it 
>> is needed.
>> 
>> Regards, Remi
>> 
>> 
>> 
>> On 28/09/15 11:02, "Bharat Kumar" 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>>  wrote:
>> 
>> Hi Remi,
>> 
>> I never intended to say that we should not run tests, but even before tests 
>> we should have proper documentation. My concern was if a major change is 
>> being introduced it should be properly documented. All the issues which we 
>> are trying to fix are majorly due to VR refactor. If there was a proper 
>> documentation for this we could
>> have fixed this in a better way.  Even to add tests we need to understand 
>> how a particular thing works and what data dose it expect. I think while 
>> fixing the python based code changes this is where most of the people are 
>> facing issues. A proper documentation will help in understanding these in a 
>> better way.
>> 
>> Thanks,
>> Bharat.
>> 
>> On 28-Sep-2015, at 1:57 pm, Remi Bergsma 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>>  wrote:
>> 
>> Hi Bharat,
>> 
>> There is no bigger problem. We should always run the tests and if we find a 
>> case that isn’t currently covered by the tests we should simply add tests 
>> for it. There’s no way we’ll get a stable master without them. The fact that 
>> they may not cover everything, is no reason to not rely on them. If a 
>> feature is not important enough to write a test for, then the feature is 
>> probably not important anyway. And if it is, then add a test :-)
>> 
>> I do agree on the design documentation requirement for any (major?) change. 
>> I found some design documentations on the subject you mention, but it should 
>> have been more detailed.
>> 
>> Regards,
>> Remi
>> 
>> 
>> 
>> 
>> 
>> 
>> On 28/09/15 09:58, "Bharat Kumar" 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>>  wrote:
>> 
>> Hi Remi,
>> 
>> Thank you for the Blame less postmortem.
>> 
>> I think there is a bigger problem here than just the review process and 
>> running tests. Even if we run the tests we cannot be sure that every thing 
>> will work as intended. The tests will only give some level of confidence. 
>> The tests may not cover all the use cases.
>> 
>> I think the problem here is that the way major changes to the code base are 
>> dealt with. For example,  VR refactoring was done without discussing the 
>> design implications and the amount of changes it would bring in. I could not 
>> find any design document. The vr refactor changed a lot of code and the way 
>> VR used to work and in my opinion it was incomplete-vpn, isolated networks, 
>> basic networks, iptable rules and rvr in isolated case etc were not 
>> implemented. Most of us are still in the process of understanding this. Even 
>> before reaching this state we had to spend a lot of time fixing issues 
>> mentioned in the thread [Blocker/Critical] VR related Issues.
>> 
>> When a change of this magnitude is being made, we should call out all the 
>> changes and document them properly. This will help people to create better 
>> fixes. Currently when we attempt to fix the isolated vr case it is effecting 
>> the vpc and vice versa. for example pr 738 fixed it for vpc networks but 
>> broke it for isolated case. I believe it is not too late to at least start 
>> documenting the changes now.
>> 
>> Thanks,
>> Bharat.
>> 
>> On 28-Sep-2015, at 10:52 am, Sanjeev N 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>> 
>> wrote:
>> 
>> I have a concern here. Some of us are actively involved in reviewing the
>> PRs related to marvin tests(Enhancing existing tests/Adding new tests). If
>> we have to test a PR it requires an environment to be created with actual
>> resources and this is going to take lot of time. Some of the tests can run
>> on simulator but most of the tests require real hardware to test. PR
>> submitter is already testing and submitting the test results along with the
>> PR. So is it require to test these PRs by reviewers?
>> 
>> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
>> 
>> Remi, thanks for the detailed post-mortem, it's a good read and great
>> learning.
>> I hope everyone reads it.
>> 
>> The one thing to emphasize is that we now have a very visible way to get
>> code into master, we have folks investing time to provide review (great),
>> we need the submitters to make due diligence and answer all comments in the
>> reviews.
>> 
>> In another project i work on, nothing can be added to the code without
>> unit tests. I think we could go down the route of asking for new
>> integration tests and unit tests for anything. If not, the PR does not get
>> merged. But let's digest your post-mortem and we can discuss after 4.6.0.
>> 
>> I see that you reverted one commit that was not made by you, that's great.
>> 
>> Let's focus on the blockers now, everything else can wait.
>> 
>> The big bonus of doing what we are doing is that once 4.6.0 is out, we can
>> merge PRs again (assuming they are properly rebased and tested) and we can
>> release 4.6.1 really quickly after.
>> 
>> -sebastien
>> 
>> On Sep 25, 2015, at 9:51 PM, Remi Bergsma 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>> wrote:
>> 
>> Hi all,
>> 
>> This mail is intended to be blameless. We need to learn something from
>> it. That's why I left out who exactly did what because it’s not relevant.
>> There are multiple examples but it's about the why. Let's learn from this
>> without blaming anyone.
>> 
>> We know we need automated testing. We have integration tests, but we are
>> unable to run all of them on any Pull Request we receive. If we would have
>> that in place, it'd be much easier to spot errors, regression and so on.
>> It'd also be more rewarding to write more tests.
>> 
>> Unfortunately we're not there yet. So, we need to do something else
>> instead until we get there. If we do nothing, we know we have many issues
>> because a master that breaks on a regular basis is the most frustrating
>> things. We said we'd use Pull Requests with at least two humans to review
>> and give their OK for a Pull Request. In the form of LGTM: Looks Good To
>> Me. Ok, so the LGTMs are there because we have no automated testing. Keep
>> that in mind. You are supposed to replace automated testing until it's
>> there.
>> 
>> Since we do this, master got a lot more stable. But every now and then
>> we still have issues. Let's look at how we do manual reviews. Again, this
>> is not to blame anyone. It's to open our eyes and make us realise what
>> we're doing and what results we get out of that.
>> 
>> 
>> Example Pull Request #784:
>> Title: CLOUDSTACK-8799 fixed the default routes
>> 
>> That's nice, it has a Jira id and a short description (as it should be).
>> 
>> The first person comes along and makes a comment:
>> "There was also an issue with VPC VRs" ... "Have you seen this issue?
>> Does your change affects the VPC VR (single/redundant)?"
>> 
>> Actually a good question. Unfortunaly there comes no answer. After a
>> reminder, it was promised to do tests against VPC networks. Great!
>> 
>> The Jenkins builds both succeed and also Travis is green. But how much
>> value does this have? They have the impression to do automated testing, and
>> although you could argue they do, it's far from complete. If it breaks, you
>> know you have an issue. But it doesn’t work the other way around.
>> 
>> Back to our example PR. In the mean time, another commit gets pushed to
>> it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira
>> issue, you see it is about redundant virtual routers. The non-VPC ones. So
>> this is vague at best. But a reviewer gives a LGTM because the person could
>> create a VPC. That doesn't have anything to do with the problem being fixed
>> in this PR nor with the comments made earlier. But, at least the person
>> said what he did and we should all do that. What nobody knew back then, was
>> that this broke the default route on VPCs.
>> 
>> Then something strange happens: the two commits from the PR end up on
>> master as direct commits. With just one LGTM and no verification from the
>> person commenting about the linked issue. This happened on Friday September
>> 11th.
>> 
>> That day 21 commits came in, from 7 Pull Request and unfortunately also
>> from some direct commits. We noticed the direct commits and notified the
>> list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot
>> came in at the same time, it was decided not to revert them. Looking back,
>> we should have done it.
>> 
>> From this point on, VPCs were broken as they wouldn't get a default
>> route. So, no public internet access from VMs in VPC tiers, no VPNs
>> working, etc. This was mentioned to the list on Thursday September 15th,
>> after some chats and debugging going on over the weekend (
>> http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)
>> 
>> Here we are, master is broken functionality wise and new Pull Requests
>> come in to fix blockers. But we cannot ever test their proper working,
>> because VPCs are broken in master and so also in the PRs branched off of
>> it. With or without change in the PR.
>> 
>> It starts to escalate as the days go by.
>> 
>> I’ll leave out the bit on how this frustrated people. Although it’s good
>> to know we do not want to be in this situation.
>> 
>> Eventually Wilder and I spent an evening and a day working on a branch
>> where we loaded 7 PRs on top of each other (all VR related) only to find
>> the VPC is still broken. It allowed us to zoom in and find the default
>> route was missing again. We said it worked 3 weeks before, because the same
>> tests that succeeded then, now were broken. We had already fixed this in PR
>> #738 on August 25 so were sure about it.
>> 
>> After some digging we could trace it back to Pull Request #784. Imagine
>> the feeling seeing your own comment there mentioning the previous issue on
>> the default gateways. Fair to say our human review process clearly failed
>> here. Many many hours were spent on this problem over the past two weeks.
>> Could we have prevented this from happening? I think so, yes.
>> 
>> 
>> This example clearly shows why:
>> 
>> - we should use Pull Requests
>> It made the change visible: Great!
>> 
>> - we do reviews and ask for feedback
>> We got feedback and questions: Also great!
>> 
>> - we should always respond to feedback and verify it is resolved, before
>> merging
>> We need to improve here. Even with two reviewers that say LGTM, we
>> should still address any feedback before merging.
>> 
>> - we should have two humans doing a review
>> We need to improve here as well. Not one reviewer, we need two. Really.
>> 
>> - we need to document why we say LGTM.
>> Another improvement. It’s nice to say LGTM, but a review of only 4
>> characters and nothing more is useless. We need to know what was tested and
>> how. Test results, screen shots or anything that shows what's been
>> verified. If you only reviewed the code, also fine but at least say that.
>> Then the next reviewer should do another type of review to get the comlete
>> picture. Remember you're replacing automated testing!
>> 
>> - we should always merge Pull Requests
>> We made it easy, merging is the de facto standard, and it has even more
>> benefits. You can trace commits back to their Pull Request (and find all
>> comments and discussion there: saves time, trust me). It also allows for
>> easier reverting of a Pull Request. We’ll see even more benefits once 4.7
>> is there. Although the intentions to merge the Pull Request were there, it
>> still didn't happen. We should always check before we push. As a committer
>> we just need to be sure.
>> 
>> - we need automated testing!
>> The sooner the better. It’s all about the missing automated testing.
>> After 4.6, we all need to focus on this. Saves a lot of time. And
>> frustrations.
>> 
>> 
>> 
>> We're doing final testing on PR #887 and will merge it soon. From that
>> point on we can look into new issues. Be aware that any PR out there that
>> was created after September 10 needs to be rebased with current master
>> (when #887 is merged). Without that, no serious testing can be done.
>> 
>> Let's be careful what to land on master. I'll only be merging Pull
>> Requests that have had proper reviews with information on what was tested.
>> At least one reviewer needs to actually verify it works (and show the rest
>> of us). We simply cannot assume it will work.
>> 
>> If we do this, I think we can start resolving the remaining blockers
>> one-by-one and go into the first RC round. Please help out where you can so
>> we can make this a success together. Thanks!
>> 
>> Looking forward to the day we have our automated testing in place ;-)
>> 
>> Regards,
>> Remi
>> 
>

Re: Blameless post mortem

Reply via email to