Re: Blameless post mortem

Bharat Kumar Mon, 28 Sep 2015 05:20:08 -0700

Dude,

There was nothing friendly about the postmortem you did, It was only partial, 
we should do a complete postmortem and then draw conclusions.


I think the post-mortems like this are of no use if we do not do them 
completely.

Regards,
Bharat.

On 28-Sep-2015, at 5:39 pm, Remi Bergsma <rberg...@schubergphilis.com> wrote:

> Dude, this is the final friendly email about his. All points have been made 
> in previous mails. This has nothing to do with ‘blameless’ and ‘learning’ 
> anymore.
> 
> Read Seb’s mail. We will move on now.
> 
> 
> Regards, Remi
> 
> 
> 
> On 28/09/15 13:54, "Bharat Kumar" <bharat.ku...@citrix.com> wrote:
> 
>> Hi Remi,
>> 
>> Whatever  ever we think  we have discovered are all well known best 
>> practices while developing code in community. 
>> I agree that tests need to be run on a new PR,  but i wonder why was this 
>> ignored when merging the VR refactor code. Perhaps we will uncover some more 
>> issues if we investigate this. I believe 
>> one of the reasons for this is the complexity and incomplete nature of the 
>> vr refactor change and failing to identify the areas which got effected. If 
>> we had a good documentation i think we cloud have understood the areas that 
>> were getting 
>> impacted early on, areas like the vpn ,  iptables, isolated networks rvr   
>> etc  and run the relevant tests. The documentation will also help us focus 
>> on these areas while reviewing  and fixing subsequent issues. Currently no 
>> one knows the areas that got effected 
>> due to the vr refactor change, we are seeing issues all over the code.  I 
>> think this is a bigger problem than what we have discussed so far.
>> 
>> I think presently we should stop fixing all the vr refactoring  bugs until 
>> we come up with a  proper document describing the VR refactoring  changes.
>> 
>> I am not suggesting that we should revert the vr refactor code, I am willing 
>> to work on this and fix the issues,  I am only asking if we can get some 
>> documentation.
>> 
>> 
>> Regards,
>> Bharat.
>> 
>> On 28-Sep-2015, at 4:59 pm, Sebastien Goasguen <run...@gmail.com> wrote:
>> 
>>> 
>>>> On Sep 28, 2015, at 1:14 PM, Remi Bergsma <rberg...@schubergphilis.com> 
>>>> wrote:
>>>> 
>>>> Hi Bharat,
>>>> 
>>>> 
>>>> There is only one way to prove a feature works: with tests. That’s why I 
>>>> say actually _running_ the tests we have today on any new PR, is the most 
>>>> important thing. Having no documentation is a problem, I agree, but it is 
>>>> not more important IMHO. If we had the documentation, we still would have 
>>>> issues if nobody runs the tests and verifies pull requests. Documentation 
>>>> that is perfect does not automatically lead to perfect implementation. So 
>>>> we need tests to verify.
>>>> 
>>>> If we don’t agree that is also fine. We need to do both anyway and I think 
>>>> we do agree on that.
>>>> 
>>> 
>>> Also we need to move forward. We should have a live chat once 4.6 is out to 
>>> discuss all issues/problems and iron out the process.
>>> 
>>> But reverting the VR refactor is not going to happen. There was ample 
>>> discussions on the PR when it was submitted, there was time to review and 
>>> raise concerns at that time. It went through quite a few reviews, tests 
>>> etc…Maybe the documentation is not good, but the time to raise this concern 
>>> I am afraid was six months ago. We can learn from it, but we are not going 
>>> to revert it, this would not go cleanly as David mentioned.
>>> 
>>> So let’s get back to blockers for 4.6, are there still VR related issues 
>>> with master ?
>>> 
>>> 
>>> 
>>> 
>>>> Regards,
>>>> Remi
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 28/09/15 12:15, "Bharat Kumar" <bharat.ku...@citrix.com> wrote:
>>>> 
>>>>> Hi Remi,
>>>>> 
>>>>> i do not agree with “There is no bigger problem”  part of your reply. so 
>>>>> I had to repeat myself to make it more clear, Not because i am not aware 
>>>>> of what this thread is supposed to do.
>>>>> 
>>>>> Regards,
>>>>> Bharat.
>>>>> 
>>>>> On 28-Sep-2015, at 2:51 pm, Remi Bergsma <rberg...@schubergphilis.com> 
>>>>> wrote:
>>>>> 
>>>>>> Hi Bharat,
>>>>>> 
>>>>>> I understand your frustrations but we already agreed on this so no need 
>>>>>> to repeat. This thread is supposed to list some improvements and learn 
>>>>>> from it. Your point has been taken so let’s move on.
>>>>>> 
>>>>>> We need documentation first, then do a change after which all tests 
>>>>>> should pass. Even better is we write (missing) tests before changing 
>>>>>> stuff so you know they pass before and after the fact. 
>>>>>> 
>>>>>> When doing reviews, feel free to ask for design documentation if you 
>>>>>> feel it is needed.
>>>>>> 
>>>>>> Regards, Remi
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 28/09/15 11:02, "Bharat Kumar" <bharat.ku...@citrix.com> wrote:
>>>>>> 
>>>>>>> Hi Remi,
>>>>>>> 
>>>>>>> I never intended to say that we should not run tests, but even before 
>>>>>>> tests we should have proper documentation. My concern was if a major 
>>>>>>> change is being introduced it should be properly documented. All the 
>>>>>>> issues which we are trying to fix are majorly due to VR refactor. If 
>>>>>>> there was a proper documentation for this we could
>>>>>>> have fixed this in a better way.  Even to add tests we need to 
>>>>>>> understand how a particular thing works and what data dose it expect. I 
>>>>>>> think while fixing the python based code changes this is where most of 
>>>>>>> the people are facing issues. A proper documentation will help in 
>>>>>>> understanding these in a better way.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Bharat.
>>>>>>> 
>>>>>>> On 28-Sep-2015, at 1:57 pm, Remi Bergsma <rberg...@schubergphilis.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Bharat,
>>>>>>>> 
>>>>>>>> There is no bigger problem. We should always run the tests and if we 
>>>>>>>> find a case that isn’t currently covered by the tests we should simply 
>>>>>>>> add tests for it. There’s no way we’ll get a stable master without 
>>>>>>>> them. The fact that they may not cover everything, is no reason to not 
>>>>>>>> rely on them. If a feature is not important enough to write a test 
>>>>>>>> for, then the feature is probably not important anyway. And if it is, 
>>>>>>>> then add a test :-)
>>>>>>>> 
>>>>>>>> I do agree on the design documentation requirement for any (major?) 
>>>>>>>> change. I found some design documentations on the subject you mention, 
>>>>>>>> but it should have been more detailed. 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Remi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 28/09/15 09:58, "Bharat Kumar" <bharat.ku...@citrix.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi Remi,
>>>>>>>>> 
>>>>>>>>> Thank you for the Blame less postmortem. 
>>>>>>>>> 
>>>>>>>>> I think there is a bigger problem here than just the review process 
>>>>>>>>> and running tests. Even if we run the tests we cannot be sure that 
>>>>>>>>> every thing will work as intended. The tests will only give some 
>>>>>>>>> level of confidence. The tests may not cover all the use cases.
>>>>>>>>> 
>>>>>>>>> I think the problem here is that the way major changes to the code 
>>>>>>>>> base are dealt with. For example,  VR refactoring was done without 
>>>>>>>>> discussing the design implications and the amount of changes it would 
>>>>>>>>> bring in. I could not find any design document. The vr refactor 
>>>>>>>>> changed a lot of code and the way VR used to work and in my opinion 
>>>>>>>>> it was incomplete-vpn, isolated networks, basic networks, iptable 
>>>>>>>>> rules and rvr in isolated case etc were not implemented. Most of us 
>>>>>>>>> are still in the process of understanding this. Even before reaching 
>>>>>>>>> this state we had to spend a lot of time fixing issues mentioned in 
>>>>>>>>> the thread [Blocker/Critical] VR related Issues.  
>>>>>>>>> 
>>>>>>>>> When a change of this magnitude is being made, we should call out all 
>>>>>>>>> the changes and document them properly. This will help people to 
>>>>>>>>> create better fixes. Currently when we attempt to fix the isolated vr 
>>>>>>>>> case it is effecting the vpc and vice versa. for example pr 738 fixed 
>>>>>>>>> it for vpc networks but broke it for isolated case. I believe it is 
>>>>>>>>> not too late to at least start documenting the changes now.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Bharat.
>>>>>>>>> 
>>>>>>>>> On 28-Sep-2015, at 10:52 am, Sanjeev N <sanj...@apache.org> wrote:
>>>>>>>>> 
>>>>>>>>>> I have a concern here. Some of us are actively involved in reviewing 
>>>>>>>>>> the
>>>>>>>>>> PRs related to marvin tests(Enhancing existing tests/Adding new 
>>>>>>>>>> tests). If
>>>>>>>>>> we have to test a PR it requires an environment to be created with 
>>>>>>>>>> actual
>>>>>>>>>> resources and this is going to take lot of time. Some of the tests 
>>>>>>>>>> can run
>>>>>>>>>> on simulator but most of the tests require real hardware to test. PR
>>>>>>>>>> submitter is already testing and submitting the test results along 
>>>>>>>>>> with the
>>>>>>>>>> PR. So is it require to test these PRs by reviewers?
>>>>>>>>>> 
>>>>>>>>>> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa <run...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Remi, thanks for the detailed post-mortem, it's a good read and 
>>>>>>>>>>> great
>>>>>>>>>>> learning.
>>>>>>>>>>> I hope everyone reads it.
>>>>>>>>>>> 
>>>>>>>>>>> The one thing to emphasize is that we now have a very visible way 
>>>>>>>>>>> to get
>>>>>>>>>>> code into master, we have folks investing time to provide review 
>>>>>>>>>>> (great),
>>>>>>>>>>> we need the submitters to make due diligence and answer all 
>>>>>>>>>>> comments in the
>>>>>>>>>>> reviews.
>>>>>>>>>>> 
>>>>>>>>>>> In another project i work on, nothing can be added to the code 
>>>>>>>>>>> without
>>>>>>>>>>> unit tests. I think we could go down the route of asking for new
>>>>>>>>>>> integration tests and unit tests for anything. If not, the PR does 
>>>>>>>>>>> not get
>>>>>>>>>>> merged. But let's digest your post-mortem and we can discuss after 
>>>>>>>>>>> 4.6.0.
>>>>>>>>>>> 
>>>>>>>>>>> I see that you reverted one commit that was not made by you, that's 
>>>>>>>>>>> great.
>>>>>>>>>>> 
>>>>>>>>>>> Let's focus on the blockers now, everything else can wait.
>>>>>>>>>>> 
>>>>>>>>>>> The big bonus of doing what we are doing is that once 4.6.0 is out, 
>>>>>>>>>>> we can
>>>>>>>>>>> merge PRs again (assuming they are properly rebased and tested) and 
>>>>>>>>>>> we can
>>>>>>>>>>> release 4.6.1 really quickly after.
>>>>>>>>>>> 
>>>>>>>>>>> -sebastien
>>>>>>>>>>> 
>>>>>>>>>>> On Sep 25, 2015, at 9:51 PM, Remi Bergsma 
>>>>>>>>>>> <rberg...@schubergphilis.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> 
>>>>>>>>>>>> This mail is intended to be blameless. We need to learn something 
>>>>>>>>>>>> from
>>>>>>>>>>> it. That's why I left out who exactly did what because it’s not 
>>>>>>>>>>> relevant.
>>>>>>>>>>> There are multiple examples but it's about the why. Let's learn 
>>>>>>>>>>> from this
>>>>>>>>>>> without blaming anyone.
>>>>>>>>>>>> 
>>>>>>>>>>>> We know we need automated testing. We have integration tests, but 
>>>>>>>>>>>> we are
>>>>>>>>>>> unable to run all of them on any Pull Request we receive. If we 
>>>>>>>>>>> would have
>>>>>>>>>>> that in place, it'd be much easier to spot errors, regression and 
>>>>>>>>>>> so on.
>>>>>>>>>>> It'd also be more rewarding to write more tests.
>>>>>>>>>>>> 
>>>>>>>>>>>> Unfortunately we're not there yet. So, we need to do something else
>>>>>>>>>>> instead until we get there. If we do nothing, we know we have many 
>>>>>>>>>>> issues
>>>>>>>>>>> because a master that breaks on a regular basis is the most 
>>>>>>>>>>> frustrating
>>>>>>>>>>> things. We said we'd use Pull Requests with at least two humans to 
>>>>>>>>>>> review
>>>>>>>>>>> and give their OK for a Pull Request. In the form of LGTM: Looks 
>>>>>>>>>>> Good To
>>>>>>>>>>> Me. Ok, so the LGTMs are there because we have no automated 
>>>>>>>>>>> testing. Keep
>>>>>>>>>>> that in mind. You are supposed to replace automated testing until 
>>>>>>>>>>> it's
>>>>>>>>>>> there.
>>>>>>>>>>>> 
>>>>>>>>>>>> Since we do this, master got a lot more stable. But every now and 
>>>>>>>>>>>> then
>>>>>>>>>>> we still have issues. Let's look at how we do manual reviews. 
>>>>>>>>>>> Again, this
>>>>>>>>>>> is not to blame anyone. It's to open our eyes and make us realise 
>>>>>>>>>>> what
>>>>>>>>>>> we're doing and what results we get out of that.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Example Pull Request #784:
>>>>>>>>>>>> Title: CLOUDSTACK-8799 fixed the default routes
>>>>>>>>>>>> 
>>>>>>>>>>>> That's nice, it has a Jira id and a short description (as it 
>>>>>>>>>>>> should be).
>>>>>>>>>>>> 
>>>>>>>>>>>> The first person comes along and makes a comment:
>>>>>>>>>>>> "There was also an issue with VPC VRs" ... "Have you seen this 
>>>>>>>>>>>> issue?
>>>>>>>>>>> Does your change affects the VPC VR (single/redundant)?"
>>>>>>>>>>>> 
>>>>>>>>>>>> Actually a good question. Unfortunaly there comes no answer. After 
>>>>>>>>>>>> a
>>>>>>>>>>> reminder, it was promised to do tests against VPC networks. Great!
>>>>>>>>>>>> 
>>>>>>>>>>>> The Jenkins builds both succeed and also Travis is green. But how 
>>>>>>>>>>>> much
>>>>>>>>>>> value does this have? They have the impression to do automated 
>>>>>>>>>>> testing, and
>>>>>>>>>>> although you could argue they do, it's far from complete. If it 
>>>>>>>>>>> breaks, you
>>>>>>>>>>> know you have an issue. But it doesn’t work the other way around.
>>>>>>>>>>>> 
>>>>>>>>>>>> Back to our example PR. In the mean time, another commit gets 
>>>>>>>>>>>> pushed to
>>>>>>>>>>> it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at 
>>>>>>>>>>> the Jira
>>>>>>>>>>> issue, you see it is about redundant virtual routers. The non-VPC 
>>>>>>>>>>> ones. So
>>>>>>>>>>> this is vague at best. But a reviewer gives a LGTM because the 
>>>>>>>>>>> person could
>>>>>>>>>>> create a VPC. That doesn't have anything to do with the problem 
>>>>>>>>>>> being fixed
>>>>>>>>>>> in this PR nor with the comments made earlier. But, at least the 
>>>>>>>>>>> person
>>>>>>>>>>> said what he did and we should all do that. What nobody knew back 
>>>>>>>>>>> then, was
>>>>>>>>>>> that this broke the default route on VPCs.
>>>>>>>>>>>> 
>>>>>>>>>>>> Then something strange happens: the two commits from the PR end up 
>>>>>>>>>>>> on
>>>>>>>>>>> master as direct commits. With just one LGTM and no verification 
>>>>>>>>>>> from the
>>>>>>>>>>> person commenting about the linked issue. This happened on Friday 
>>>>>>>>>>> September
>>>>>>>>>>> 11th.
>>>>>>>>>>>> 
>>>>>>>>>>>> That day 21 commits came in, from 7 Pull Request and unfortunately 
>>>>>>>>>>>> also
>>>>>>>>>>> from some direct commits. We noticed the direct commits and 
>>>>>>>>>>> notified the
>>>>>>>>>>> list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As 
>>>>>>>>>>> a lot
>>>>>>>>>>> came in at the same time, it was decided not to revert them. 
>>>>>>>>>>> Looking back,
>>>>>>>>>>> we should have done it.
>>>>>>>>>>>> 
>>>>>>>>>>>> From this point on, VPCs were broken as they wouldn't get a default
>>>>>>>>>>> route. So, no public internet access from VMs in VPC tiers, no VPNs
>>>>>>>>>>> working, etc. This was mentioned to the list on Thursday September 
>>>>>>>>>>> 15th,
>>>>>>>>>>> after some chats and debugging going on over the weekend (
>>>>>>>>>>> http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)
>>>>>>>>>>>> 
>>>>>>>>>>>> Here we are, master is broken functionality wise and new Pull 
>>>>>>>>>>>> Requests
>>>>>>>>>>> come in to fix blockers. But we cannot ever test their proper 
>>>>>>>>>>> working,
>>>>>>>>>>> because VPCs are broken in master and so also in the PRs branched 
>>>>>>>>>>> off of
>>>>>>>>>>> it. With or without change in the PR.
>>>>>>>>>>>> 
>>>>>>>>>>>> It starts to escalate as the days go by.
>>>>>>>>>>>> 
>>>>>>>>>>>> I’ll leave out the bit on how this frustrated people. Although 
>>>>>>>>>>>> it’s good
>>>>>>>>>>> to know we do not want to be in this situation.
>>>>>>>>>>>> 
>>>>>>>>>>>> Eventually Wilder and I spent an evening and a day working on a 
>>>>>>>>>>>> branch
>>>>>>>>>>> where we loaded 7 PRs on top of each other (all VR related) only to 
>>>>>>>>>>> find
>>>>>>>>>>> the VPC is still broken. It allowed us to zoom in and find the 
>>>>>>>>>>> default
>>>>>>>>>>> route was missing again. We said it worked 3 weeks before, because 
>>>>>>>>>>> the same
>>>>>>>>>>> tests that succeeded then, now were broken. We had already fixed 
>>>>>>>>>>> this in PR
>>>>>>>>>>> #738 on August 25 so were sure about it.
>>>>>>>>>>>> 
>>>>>>>>>>>> After some digging we could trace it back to Pull Request #784. 
>>>>>>>>>>>> Imagine
>>>>>>>>>>> the feeling seeing your own comment there mentioning the previous 
>>>>>>>>>>> issue on
>>>>>>>>>>> the default gateways. Fair to say our human review process clearly 
>>>>>>>>>>> failed
>>>>>>>>>>> here. Many many hours were spent on this problem over the past two 
>>>>>>>>>>> weeks.
>>>>>>>>>>> Could we have prevented this from happening? I think so, yes.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> This example clearly shows why:
>>>>>>>>>>>> 
>>>>>>>>>>>> - we should use Pull Requests
>>>>>>>>>>>> It made the change visible: Great!
>>>>>>>>>>>> 
>>>>>>>>>>>> - we do reviews and ask for feedback
>>>>>>>>>>>> We got feedback and questions: Also great!
>>>>>>>>>>>> 
>>>>>>>>>>>> - we should always respond to feedback and verify it is resolved, 
>>>>>>>>>>>> before
>>>>>>>>>>> merging
>>>>>>>>>>>> We need to improve here. Even with two reviewers that say LGTM, we
>>>>>>>>>>> should still address any feedback before merging.
>>>>>>>>>>>> 
>>>>>>>>>>>> - we should have two humans doing a review
>>>>>>>>>>>> We need to improve here as well. Not one reviewer, we need two. 
>>>>>>>>>>>> Really.
>>>>>>>>>>>> 
>>>>>>>>>>>> - we need to document why we say LGTM.
>>>>>>>>>>>> Another improvement. It’s nice to say LGTM, but a review of only 4
>>>>>>>>>>> characters and nothing more is useless. We need to know what was 
>>>>>>>>>>> tested and
>>>>>>>>>>> how. Test results, screen shots or anything that shows what's been
>>>>>>>>>>> verified. If you only reviewed the code, also fine but at least say 
>>>>>>>>>>> that.
>>>>>>>>>>> Then the next reviewer should do another type of review to get the 
>>>>>>>>>>> comlete
>>>>>>>>>>> picture. Remember you're replacing automated testing!
>>>>>>>>>>>> 
>>>>>>>>>>>> - we should always merge Pull Requests
>>>>>>>>>>>> We made it easy, merging is the de facto standard, and it has even 
>>>>>>>>>>>> more
>>>>>>>>>>> benefits. You can trace commits back to their Pull Request (and 
>>>>>>>>>>> find all
>>>>>>>>>>> comments and discussion there: saves time, trust me). It also 
>>>>>>>>>>> allows for
>>>>>>>>>>> easier reverting of a Pull Request. We’ll see even more benefits 
>>>>>>>>>>> once 4.7
>>>>>>>>>>> is there. Although the intentions to merge the Pull Request were 
>>>>>>>>>>> there, it
>>>>>>>>>>> still didn't happen. We should always check before we push. As a 
>>>>>>>>>>> committer
>>>>>>>>>>> we just need to be sure.
>>>>>>>>>>>> 
>>>>>>>>>>>> - we need automated testing!
>>>>>>>>>>>> The sooner the better. It’s all about the missing automated 
>>>>>>>>>>>> testing.
>>>>>>>>>>> After 4.6, we all need to focus on this. Saves a lot of time. And
>>>>>>>>>>> frustrations.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> We're doing final testing on PR #887 and will merge it soon. From 
>>>>>>>>>>>> that
>>>>>>>>>>> point on we can look into new issues. Be aware that any PR out 
>>>>>>>>>>> there that
>>>>>>>>>>> was created after September 10 needs to be rebased with current 
>>>>>>>>>>> master
>>>>>>>>>>> (when #887 is merged). Without that, no serious testing can be done.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let's be careful what to land on master. I'll only be merging Pull
>>>>>>>>>>> Requests that have had proper reviews with information on what was 
>>>>>>>>>>> tested.
>>>>>>>>>>> At least one reviewer needs to actually verify it works (and show 
>>>>>>>>>>> the rest
>>>>>>>>>>> of us). We simply cannot assume it will work.
>>>>>>>>>>>> 
>>>>>>>>>>>> If we do this, I think we can start resolving the remaining 
>>>>>>>>>>>> blockers
>>>>>>>>>>> one-by-one and go into the first RC round. Please help out where 
>>>>>>>>>>> you can so
>>>>>>>>>>> we can make this a success together. Thanks!
>>>>>>>>>>>> 
>>>>>>>>>>>> Looking forward to the day we have our automated testing in place 
>>>>>>>>>>>> ;-)
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Remi
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>>

Re: Blameless post mortem

Reply via email to