Re: Blameless post mortem

Remi Bergsma Mon, 28 Sep 2015 01:28:16 -0700

Hi Bharat,

There is no bigger problem. We should always run the tests and if we find a 
case that isn’t currently covered by the tests we should simply add tests for 
it. There’s no way we’ll get a stable master without them. The fact that they 
may not cover everything, is no reason to not rely on them. If a feature is not 
important enough to write a test for, then the feature is probably not 
important anyway. And if it is, then add a test :-)


I do agree on the design documentation requirement for any (major?) change. I 
found some design documentations on the subject you mention, but it should have 
been more detailed. 

Regards,
Remi






On 28/09/15 09:58, "Bharat Kumar" <[email protected]> wrote:

>Hi Remi,
>
>Thank you for the Blame less postmortem. 
>
>I think there is a bigger problem here than just the review process and 
>running tests. Even if we run the tests we cannot be sure that every thing 
>will work as intended. The tests will only give some level of confidence. The 
>tests may not cover all the use cases.
>
>I think the problem here is that the way major changes to the code base are 
>dealt with. For example,  VR refactoring was done without discussing the 
>design implications and the amount of changes it would bring in. I could not 
>find any design document. The vr refactor changed a lot of code and the way VR 
>used to work and in my opinion it was incomplete-vpn, isolated networks, basic 
>networks, iptable rules and rvr in isolated case etc were not implemented. 
>Most of us are still in the process of understanding this. Even before 
>reaching this state we had to spend a lot of time fixing issues mentioned in 
>the thread [Blocker/Critical] VR related Issues.  
>
>When a change of this magnitude is being made, we should call out all the 
>changes and document them properly. This will help people to create better 
>fixes. Currently when we attempt to fix the isolated vr case it is effecting 
>the vpc and vice versa. for example pr 738 fixed it for vpc networks but broke 
>it for isolated case. I believe it is not too late to at least start 
>documenting the changes now.
>
>Thanks,
>Bharat.
>
>On 28-Sep-2015, at 10:52 am, Sanjeev N <[email protected]> wrote:
>
>> I have a concern here. Some of us are actively involved in reviewing the
>> PRs related to marvin tests(Enhancing existing tests/Adding new tests). If
>> we have to test a PR it requires an environment to be created with actual
>> resources and this is going to take lot of time. Some of the tests can run
>> on simulator but most of the tests require real hardware to test. PR
>> submitter is already testing and submitting the test results along with the
>> PR. So is it require to test these PRs by reviewers?
>> 
>> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa <[email protected]> wrote:
>> 
>>> Remi, thanks for the detailed post-mortem, it's a good read and great
>>> learning.
>>> I hope everyone reads it.
>>> 
>>> The one thing to emphasize is that we now have a very visible way to get
>>> code into master, we have folks investing time to provide review (great),
>>> we need the submitters to make due diligence and answer all comments in the
>>> reviews.
>>> 
>>> In another project i work on, nothing can be added to the code without
>>> unit tests. I think we could go down the route of asking for new
>>> integration tests and unit tests for anything. If not, the PR does not get
>>> merged. But let's digest your post-mortem and we can discuss after 4.6.0.
>>> 
>>> I see that you reverted one commit that was not made by you, that's great.
>>> 
>>> Let's focus on the blockers now, everything else can wait.
>>> 
>>> The big bonus of doing what we are doing is that once 4.6.0 is out, we can
>>> merge PRs again (assuming they are properly rebased and tested) and we can
>>> release 4.6.1 really quickly after.
>>> 
>>> -sebastien
>>> 
>>> On Sep 25, 2015, at 9:51 PM, Remi Bergsma <[email protected]>
>>> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> This mail is intended to be blameless. We need to learn something from
>>> it. That's why I left out who exactly did what because it’s not relevant.
>>> There are multiple examples but it's about the why. Let's learn from this
>>> without blaming anyone.
>>>> 
>>>> We know we need automated testing. We have integration tests, but we are
>>> unable to run all of them on any Pull Request we receive. If we would have
>>> that in place, it'd be much easier to spot errors, regression and so on.
>>> It'd also be more rewarding to write more tests.
>>>> 
>>>> Unfortunately we're not there yet. So, we need to do something else
>>> instead until we get there. If we do nothing, we know we have many issues
>>> because a master that breaks on a regular basis is the most frustrating
>>> things. We said we'd use Pull Requests with at least two humans to review
>>> and give their OK for a Pull Request. In the form of LGTM: Looks Good To
>>> Me. Ok, so the LGTMs are there because we have no automated testing. Keep
>>> that in mind. You are supposed to replace automated testing until it's
>>> there.
>>>> 
>>>> Since we do this, master got a lot more stable. But every now and then
>>> we still have issues. Let's look at how we do manual reviews. Again, this
>>> is not to blame anyone. It's to open our eyes and make us realise what
>>> we're doing and what results we get out of that.
>>>> 
>>>> 
>>>> Example Pull Request #784:
>>>> Title: CLOUDSTACK-8799 fixed the default routes
>>>> 
>>>> That's nice, it has a Jira id and a short description (as it should be).
>>>> 
>>>> The first person comes along and makes a comment:
>>>> "There was also an issue with VPC VRs" ... "Have you seen this issue?
>>> Does your change affects the VPC VR (single/redundant)?"
>>>> 
>>>> Actually a good question. Unfortunaly there comes no answer. After a
>>> reminder, it was promised to do tests against VPC networks. Great!
>>>> 
>>>> The Jenkins builds both succeed and also Travis is green. But how much
>>> value does this have? They have the impression to do automated testing, and
>>> although you could argue they do, it's far from complete. If it breaks, you
>>> know you have an issue. But it doesn’t work the other way around.
>>>> 
>>>> Back to our example PR. In the mean time, another commit gets pushed to
>>> it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira
>>> issue, you see it is about redundant virtual routers. The non-VPC ones. So
>>> this is vague at best. But a reviewer gives a LGTM because the person could
>>> create a VPC. That doesn't have anything to do with the problem being fixed
>>> in this PR nor with the comments made earlier. But, at least the person
>>> said what he did and we should all do that. What nobody knew back then, was
>>> that this broke the default route on VPCs.
>>>> 
>>>> Then something strange happens: the two commits from the PR end up on
>>> master as direct commits. With just one LGTM and no verification from the
>>> person commenting about the linked issue. This happened on Friday September
>>> 11th.
>>>> 
>>>> That day 21 commits came in, from 7 Pull Request and unfortunately also
>>> from some direct commits. We noticed the direct commits and notified the
>>> list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot
>>> came in at the same time, it was decided not to revert them. Looking back,
>>> we should have done it.
>>>> 
>>>> From this point on, VPCs were broken as they wouldn't get a default
>>> route. So, no public internet access from VMs in VPC tiers, no VPNs
>>> working, etc. This was mentioned to the list on Thursday September 15th,
>>> after some chats and debugging going on over the weekend (
>>> http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)
>>>> 
>>>> Here we are, master is broken functionality wise and new Pull Requests
>>> come in to fix blockers. But we cannot ever test their proper working,
>>> because VPCs are broken in master and so also in the PRs branched off of
>>> it. With or without change in the PR.
>>>> 
>>>> It starts to escalate as the days go by.
>>>> 
>>>> I’ll leave out the bit on how this frustrated people. Although it’s good
>>> to know we do not want to be in this situation.
>>>> 
>>>> Eventually Wilder and I spent an evening and a day working on a branch
>>> where we loaded 7 PRs on top of each other (all VR related) only to find
>>> the VPC is still broken. It allowed us to zoom in and find the default
>>> route was missing again. We said it worked 3 weeks before, because the same
>>> tests that succeeded then, now were broken. We had already fixed this in PR
>>> #738 on August 25 so were sure about it.
>>>> 
>>>> After some digging we could trace it back to Pull Request #784. Imagine
>>> the feeling seeing your own comment there mentioning the previous issue on
>>> the default gateways. Fair to say our human review process clearly failed
>>> here. Many many hours were spent on this problem over the past two weeks.
>>> Could we have prevented this from happening? I think so, yes.
>>>> 
>>>> 
>>>> This example clearly shows why:
>>>> 
>>>> - we should use Pull Requests
>>>> It made the change visible: Great!
>>>> 
>>>> - we do reviews and ask for feedback
>>>> We got feedback and questions: Also great!
>>>> 
>>>> - we should always respond to feedback and verify it is resolved, before
>>> merging
>>>> We need to improve here. Even with two reviewers that say LGTM, we
>>> should still address any feedback before merging.
>>>> 
>>>> - we should have two humans doing a review
>>>> We need to improve here as well. Not one reviewer, we need two. Really.
>>>> 
>>>> - we need to document why we say LGTM.
>>>> Another improvement. It’s nice to say LGTM, but a review of only 4
>>> characters and nothing more is useless. We need to know what was tested and
>>> how. Test results, screen shots or anything that shows what's been
>>> verified. If you only reviewed the code, also fine but at least say that.
>>> Then the next reviewer should do another type of review to get the comlete
>>> picture. Remember you're replacing automated testing!
>>>> 
>>>> - we should always merge Pull Requests
>>>> We made it easy, merging is the de facto standard, and it has even more
>>> benefits. You can trace commits back to their Pull Request (and find all
>>> comments and discussion there: saves time, trust me). It also allows for
>>> easier reverting of a Pull Request. We’ll see even more benefits once 4.7
>>> is there. Although the intentions to merge the Pull Request were there, it
>>> still didn't happen. We should always check before we push. As a committer
>>> we just need to be sure.
>>>> 
>>>> - we need automated testing!
>>>> The sooner the better. It’s all about the missing automated testing.
>>> After 4.6, we all need to focus on this. Saves a lot of time. And
>>> frustrations.
>>>> 
>>>> 
>>>> 
>>>> We're doing final testing on PR #887 and will merge it soon. From that
>>> point on we can look into new issues. Be aware that any PR out there that
>>> was created after September 10 needs to be rebased with current master
>>> (when #887 is merged). Without that, no serious testing can be done.
>>>> 
>>>> Let's be careful what to land on master. I'll only be merging Pull
>>> Requests that have had proper reviews with information on what was tested.
>>> At least one reviewer needs to actually verify it works (and show the rest
>>> of us). We simply cannot assume it will work.
>>>> 
>>>> If we do this, I think we can start resolving the remaining blockers
>>> one-by-one and go into the first RC round. Please help out where you can so
>>> we can make this a success together. Thanks!
>>>> 
>>>> Looking forward to the day we have our automated testing in place ;-)
>>>> 
>>>> Regards,
>>>> Remi
>>>> 
>>> 
>>> 
>

Re: Blameless post mortem

Reply via email to