Re: Blameless post mortem

Bharat Kumar Mon, 28 Sep 2015 10:54:56 -0700

Hi Wilder,

I think you are taking this in a wrong way. I am not pissed because people are 
asking for tests.  I am pissed with the blame game that is being played in the 
community, I  do not understand why do we need to sight PRs and try to blame 
others, other who trying to fix the bugs which were not introduced by them in 
the first place. I hope we do not do this type of “blameless" email which 
blames people anyway. This dose not help anyone but in turn causes lot of pain 
to people who are actively fixing broken stuff. bugs occur unintentionally and 
due lack of information of misunderstanding of code and  I am sure we all have 
made our share of contributions to them at some point of time or other. if we 
follow this example we can do a “Blameless” postmortem of every bug in acs 
jira. Instead of this we should try to minimise them by discussing and 
implementing good processes without reverting to name calling or indirect 
blaming of people.


Regards,
Bharat.

On 28-Sep-2015, at 11:07 pm, Wilder Rodrigues <[email protected]> 
wrote:

> Koushik,
> 
> Please, say my name! Don’t mention stuff like “the person blah blah blah”, 
> please! If you want to start pointing fingers, I can also play this game and 
> get references to the PRs which were not tested, pushed straight to master 
> (when we agreed on not doing so) or got 2 LGTM without any comments about 
> test procedure/reports. In addition, I said I’m don’t know the code as a 
> whole because I was not the one writing it. We should not be talking about 
> this.
> 
> Now, please pay some attention to the work we have done over the weekend to 
> get master stable, after it was broken because not tested PRs got through. It 
> has nothing to do with lack of documentation, dude! It’s all about changing 
> code and testing it. And if the one changing it doesn’t know what is being 
> done, it’s a matter of being honest and saying it, asking for help!
> 
> Documentation is also important, as I said before, and I’m glad to help doing 
> it, for the code I wrote and the one I did not. That’s community work, from 
> my point of view at least. But if you or Bharat or Raja or anyone else got 
> pissed off with my, or anyone else, emails asking for tests, I can’t help 
> with that. I’m already working almost 60h per week trying to help the 
> community with many bugs, not only VR related, and getting no freaking “thank 
> you” back! And believe me, I’m not taking it personal. It’s just very 
> frustrating that there are people insisting on this thread for so long. It’s 
> about to become a pissing contest!
> 
> All the energy spent here today, trying to find people to blame, could have 
> been used to fix the blockers!
> 
> Btw, that’s my last email on this thread, but I hope that won’t be the last 
> one in the community. I still believe in ACS, despite those immature 
> discussions.
> 
> 
> Cheers,
> Wilder
> 
> 
> On 28 Sep 2015, at 18:49, Koushik Das 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> inline
> 
> On 28-Sep-2015, at 9:15 PM, Sebastien Goasguen 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Let me try to reply,
> 
> On Sep 28, 2015, at 5:17 PM, Koushik Das 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> I had asked for the documentation on persistent VR (PR # 118) changes in the 
> context of another discussion and this is what I got at that time.
> http://dev.cloudstack.apache.narkive.com/MH47etbS/discuss-out-of-band-vr-migration-should-we-reboot-vr-or-not#post39
> 
> 
> That was back in June. So one has to think that you were happy with the 
> answers you got, otherwise you had to keep pushing for documentation and make 
> noise about it at that time.
> 
> In the context of that discussion yes, but not in terms of overall feature. 
> Unfortunately bugs weren't reported at that time otherwise would have made 
> enough noise :).
> 
> 
> Right now as I see from the discussion no one understands the changes fully 
> and there is no documentation available explaining the design/code flow etc.
> 
> Just like anyone coming to CloudStack from scratch, now and 3 years ago. 
> There is limited or terrible documentation about how the system works, people 
> pick it up on their own by looking at the code. This is how it’s done in a 
> lot of project, and a lot of software.
> 
> This is not specific to this issue, if you really want my opinion, our 
> documentation has always sucked, our api documentation is pathetic and our 
> wiki is useless.
> 
> Are you saying since we were pathetic in terms of documentation 3 years back, 
> we still stick to the same? If we put same logic to the code base then why 
> care about PRs, code reviews at all. I feel we as a community need to improve 
> the documentation, at least start with the new changes. And if the changes 
> impact the core pieces of the system it becomes even more important. If you 
> look at PR 118 you will realise that the person who submitted the PR is 
> calming that he is not fully aware of the changes. What do you expect me to 
> make out of that?
> 
> 
> Even if someone volunteers to fix an issue, the person may not be sure if 
> something else will break given the nature of code and current status of 
> tests. The existing VR scripts (may not be as appealing as the new .py 
> scripts :) ) were stabilised over multiple releases and I am sure there must 
> have been lot of manual testing done as well. To do the same on the new 
> changes will at least take similar amount of testing.
> 
> Given that 4.6 release is nearing, one option would be to fix the pending 
> ones and release with a disclaimer that VR related functionality may have 
> regressed over last release (at this point no one knows what all other 
> potential issues may crop up).
> 
> Nope…we don’t. But it is a very general statement, we don’t know what can 
> come up due to this change or any other change…mostly because we have always 
> had bad tests coverage, unit or integration or what have you, and that 3 
> years in we still don’t have a dedicated CI.
> 
> Or the other options would be to indefinitely postpone 4.6, fix the 
> documentation, fix issues and once there is consensus on the stability aspect 
> go ahead with the release.
> 
> This in my view is traditional software engineering release concept. We don’t 
> need this in an open source project like this. We can release, it takes 5 
> minutes to cut a release. We should release and fix, release and fix, release 
> and fix.
> 
> The all idea about what we are doing is that we can release everytime we find 
> and fix a bug. There is no benefit about postponing things.
> 
> 
> Theoretically we can cut any number of releases at any frequency. But since 
> concerns have been raised on the quality aspect I feel it needs to be 
> addressed appropriately. Let users take an informed decision on whether to 
> use a particular release or not. I am not saying that postponing is the only 
> option.
> 
> 
> 
> 
> -Koushik
> 
> 
> On 28-Sep-2015, at 6:20 PM, Wilder Rodrigues 
> <[email protected]<mailto:[email protected]>>
> wrote:
> 
> I agree with the docs stuff, that I said 5 emails ago.
> 
> Once things are fixed, I will take the time to understand the code as a whole 
> and write the documentation: we will need ir for release purposes anyway.
> 
> Cheers,
> Wilder
> 
> 
> On 28 Sep 2015, at 14:47, Bharat Kumar 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Hi Wilder,
> 
> I am not talking about just the vpc networks. There are many other ares 
> getting effected because of this, some of them are vpn(not implemented) , rvr 
> in isolated networks etc.
> All i am saying is the design doc will help us understand the complete impact 
> of the changes and deal with them accordingly.
> 
> 
> Regards,
> Bharat.
> 
> 
> On 28-Sep-2015, at 6:02 pm, Wilder Rodrigues 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Only few tests…. 51 tests against a real environment.
> 
> At that time Nux also tested it and we tried to get Paul Angus, Geoff and 
> Rohit from Shape Blue to test it as well. Nux found a couple of issues that 
> were reported and fixed (see email below).
> 
> When I came back from holidays, 4 weeks ago, a PR containing 360 files 
> changed and almost 4000 lines, which was not even compiling, was merged onto 
> Master. We have less than a handful of people executing tests against PRs - 
> so few that I could even name who tests and who doesn’t. But hey, that’s a 
> blames email. I’m not trying to justify anything, but that handful of people, 
> who actually care about ACS, are getting quite fedup with this whole 
> discussion.
> 
> Cheers,
> Wilder
> 
> ===========================================================
> 
> On 20 Feb 2015, at 10:03, Nux! 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
> 
> Well, it looks like we were right to test it, found some problems.
> 
> 1 - the passwords for instances are not served properly, `wget --header 
> "DomU_Request: send_my_password" $router:8080` returns blank response. I am 
> not sure why this happens, though I tried to look around the router.
> 
> 2 - in addition to the above, in a redundant VPC the SNAT does not work. 
> >From an instance I can ping the router(s), but not any further than that. 
> SNAT works fine in a normal/non-vpc network.
> I'll try to look more into it later today.
> 
> Have a nice day :)
> 
> Lucian
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro<http://www.nux.ro><http://www.nux.ro>
> 
> 
> 
> 
> On 28 Sep 2015, at 14:13, Bharat Kumar 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>  wrote:
> 
> Hi Sebastien,
> 
> You are confused, we are talking about  persistent VR config changes. below 
> is the pr related to it.
> https://github.com/apache/cloudstack/pull/118
> 
> If you look at it you will notice that there are more than 250 commits and 
> only a few tests that were run.
> 
> Regards,
> Bharat.
> 
> On 28-Sep-2015, at 5:24 pm, Bharat Kumar 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>  wrote:
> 
> Hi Remi,
> 
> Whatever  ever we think  we have discovered are all well known best practices 
> while developing code in community.
> I agree that tests need to be run on a new PR,  but i wonder why was this 
> ignored when merging the VR refactor code. Perhaps we will uncover some more 
> issues if we investigate this. I believe
> one of the reasons for this is the complexity and incomplete nature of the vr 
> refactor change and failing to identify the areas which got effected. If we 
> had a good documentation i think we cloud have understood the areas that were 
> getting
> impacted early on, areas like the vpn ,  iptables, isolated networks rvr   
> etc  and run the relevant tests. The documentation will also help us focus on 
> these areas while reviewing and fixing subsequent issues. Currently no one 
> knows the areas that got effected
> due to the vr refactor change, we are seeing issues all over the code.  I 
> think this is a bigger problem than what we have discussed so far.
> 
> I think presently we should stop fixing all the vr refactoring  bugs until we 
> come up with a  proper document describing the VR refactoring  changes.
> 
> I am not suggesting that we should revert the vr refactor code, I am willing 
> to work on this and fix the issues,  I am only asking if we can get some 
> documentation.
> 
> 
> Regards,
> Bharat.
> 
> On 28-Sep-2015, at 4:59 pm, Sebastien Goasguen 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
> 
> 
> On Sep 28, 2015, at 1:14 PM, Remi Bergsma 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>  wrote:
> 
> Hi Bharat,
> 
> 
> There is only one way to prove a feature works: with tests. That’s why I say 
> actually _running_ the tests we have today on any new PR, is the most 
> important thing. Having no documentation is a problem, I agree, but it is not 
> more important IMHO. If we had the documentation, we still would have issues 
> if nobody runs the tests and verifies pull requests. Documentation that is 
> perfect does not automatically lead to perfect implementation. So we need 
> tests to verify.
> 
> If we don’t agree that is also fine. We need to do both anyway and I think we 
> do agree on that.
> 
> 
> Also we need to move forward. We should have a live chat once 4.6 is out to 
> discuss all issues/problems and iron out the process.
> 
> But reverting the VR refactor is not going to happen. There was ample 
> discussions on the PR when it was submitted, there was time to review and 
> raise concerns at that time. It went through quite a few reviews, tests 
> etc…Maybe the documentation is not good, but the time to raise this concern I 
> am afraid was six months ago. We can learn from it, but we are not going to 
> revert it, this would not go cleanly as David mentioned.
> 
> So let’s get back to blockers for 4.6, are there still VR related issues with 
> master ?
> 
> 
> 
> 
> Regards,
> Remi
> 
> 
> 
> 
> 
> 
> On 28/09/15 12:15, "Bharat Kumar" 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>  wrote:
> 
> Hi Remi,
> 
> i do not agree with “There is no bigger problem”  part of your reply. so I 
> had to repeat myself to make it more clear, Not because i am not aware of 
> what this thread is supposed to do.
> 
> Regards,
> Bharat.
> 
> On 28-Sep-2015, at 2:51 pm, Remi Bergsma 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>  wrote:
> 
> Hi Bharat,
> 
> I understand your frustrations but we already agreed on this so no need to 
> repeat. This thread is supposed to list some improvements and learn from it. 
> Your point has been taken so let’s move on.
> 
> We need documentation first, then do a change after which all tests should 
> pass. Even better is we write (missing) tests before changing stuff so you 
> know they pass before and after the fact.
> 
> When doing reviews, feel free to ask for design documentation if you feel it 
> is needed.
> 
> Regards, Remi
> 
> 
> 
> On 28/09/15 11:02, "Bharat Kumar" 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>  wrote:
> 
> Hi Remi,
> 
> I never intended to say that we should not run tests, but even before tests 
> we should have proper documentation. My concern was if a major change is 
> being introduced it should be properly documented. All the issues which we 
> are trying to fix are majorly due to VR refactor. If there was a proper 
> documentation for this we could
> have fixed this in a better way.  Even to add tests we need to understand how 
> a particular thing works and what data dose it expect. I think while fixing 
> the python based code changes this is where most of the people are facing 
> issues. A proper documentation will help in understanding these in a better 
> way.
> 
> Thanks,
> Bharat.
> 
> On 28-Sep-2015, at 1:57 pm, Remi Bergsma 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>  wrote:
> 
> Hi Bharat,
> 
> There is no bigger problem. We should always run the tests and if we find a 
> case that isn’t currently covered by the tests we should simply add tests for 
> it. There’s no way we’ll get a stable master without them. The fact that they 
> may not cover everything, is no reason to not rely on them. If a feature is 
> not important enough to write a test for, then the feature is probably not 
> important anyway. And if it is, then add a test :-)
> 
> I do agree on the design documentation requirement for any (major?) change. I 
> found some design documentations on the subject you mention, but it should 
> have been more detailed.
> 
> Regards,
> Remi
> 
> 
> 
> 
> 
> 
> On 28/09/15 09:58, "Bharat Kumar" 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
>  wrote:
> 
> Hi Remi,
> 
> Thank you for the Blame less postmortem.
> 
> I think there is a bigger problem here than just the review process and 
> running tests. Even if we run the tests we cannot be sure that every thing 
> will work as intended. The tests will only give some level of confidence. The 
> tests may not cover all the use cases.
> 
> I think the problem here is that the way major changes to the code base are 
> dealt with. For example,  VR refactoring was done without discussing the 
> design implications and the amount of changes it would bring in. I could not 
> find any design document. The vr refactor changed a lot of code and the way 
> VR used to work and in my opinion it was incomplete-vpn, isolated networks, 
> basic networks, iptable rules and rvr in isolated case etc were not 
> implemented. Most of us are still in the process of understanding this. Even 
> before reaching this state we had to spend a lot of time fixing issues 
> mentioned in the thread [Blocker/Critical] VR related Issues.
> 
> When a change of this magnitude is being made, we should call out all the 
> changes and document them properly. This will help people to create better 
> fixes. Currently when we attempt to fix the isolated vr case it is effecting 
> the vpc and vice versa. for example pr 738 fixed it for vpc networks but 
> broke it for isolated case. I believe it is not too late to at least start 
> documenting the changes now.
> 
> Thanks,
> Bharat.
> 
> On 28-Sep-2015, at 10:52 am, Sanjeev N 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>> 
> wrote:
> 
> I have a concern here. Some of us are actively involved in reviewing the
> PRs related to marvin tests(Enhancing existing tests/Adding new tests). If
> we have to test a PR it requires an environment to be created with actual
> resources and this is going to take lot of time. Some of the tests can run
> on simulator but most of the tests require real hardware to test. PR
> submitter is already testing and submitting the test results along with the
> PR. So is it require to test these PRs by reviewers?
> 
> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
> 
> Remi, thanks for the detailed post-mortem, it's a good read and great
> learning.
> I hope everyone reads it.
> 
> The one thing to emphasize is that we now have a very visible way to get
> code into master, we have folks investing time to provide review (great),
> we need the submitters to make due diligence and answer all comments in the
> reviews.
> 
> In another project i work on, nothing can be added to the code without
> unit tests. I think we could go down the route of asking for new
> integration tests and unit tests for anything. If not, the PR does not get
> merged. But let's digest your post-mortem and we can discuss after 4.6.0.
> 
> I see that you reverted one commit that was not made by you, that's great.
> 
> Let's focus on the blockers now, everything else can wait.
> 
> The big bonus of doing what we are doing is that once 4.6.0 is out, we can
> merge PRs again (assuming they are properly rebased and tested) and we can
> release 4.6.1 really quickly after.
> 
> -sebastien
> 
> On Sep 25, 2015, at 9:51 PM, Remi Bergsma 
> <[email protected]<mailto:[email protected]><mailto:[email protected]>>
> wrote:
> 
> Hi all,
> 
> This mail is intended to be blameless. We need to learn something from
> it. That's why I left out who exactly did what because it’s not relevant.
> There are multiple examples but it's about the why. Let's learn from this
> without blaming anyone.
> 
> We know we need automated testing. We have integration tests, but we are
> unable to run all of them on any Pull Request we receive. If we would have
> that in place, it'd be much easier to spot errors, regression and so on.
> It'd also be more rewarding to write more tests.
> 
> Unfortunately we're not there yet. So, we need to do something else
> instead until we get there. If we do nothing, we know we have many issues
> because a master that breaks on a regular basis is the most frustrating
> things. We said we'd use Pull Requests with at least two humans to review
> and give their OK for a Pull Request. In the form of LGTM: Looks Good To
> Me. Ok, so the LGTMs are there because we have no automated testing. Keep
> that in mind. You are supposed to replace automated testing until it's
> there.
> 
> Since we do this, master got a lot more stable. But every now and then
> we still have issues. Let's look at how we do manual reviews. Again, this
> is not to blame anyone. It's to open our eyes and make us realise what
> we're doing and what results we get out of that.
> 
> 
> Example Pull Request #784:
> Title: CLOUDSTACK-8799 fixed the default routes
> 
> That's nice, it has a Jira id and a short description (as it should be).
> 
> The first person comes along and makes a comment:
> "There was also an issue with VPC VRs" ... "Have you seen this issue?
> Does your change affects the VPC VR (single/redundant)?"
> 
> Actually a good question. Unfortunaly there comes no answer. After a
> reminder, it was promised to do tests against VPC networks. Great!
> 
> The Jenkins builds both succeed and also Travis is green. But how much
> value does this have? They have the impression to do automated testing, and
> although you could argue they do, it's far from complete. If it breaks, you
> know you have an issue. But it doesn’t work the other way around.
> 
> Back to our example PR. In the mean time, another commit gets pushed to
> it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at the Jira
> issue, you see it is about redundant virtual routers. The non-VPC ones. So
> this is vague at best. But a reviewer gives a LGTM because the person could
> create a VPC. That doesn't have anything to do with the problem being fixed
> in this PR nor with the comments made earlier. But, at least the person
> said what he did and we should all do that. What nobody knew back then, was
> that this broke the default route on VPCs.
> 
> Then something strange happens: the two commits from the PR end up on
> master as direct commits. With just one LGTM and no verification from the
> person commenting about the linked issue. This happened on Friday September
> 11th.
> 
> That day 21 commits came in, from 7 Pull Request and unfortunately also
> from some direct commits. We noticed the direct commits and notified the
> list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As a lot
> came in at the same time, it was decided not to revert them. Looking back,
> we should have done it.
> 
> From this point on, VPCs were broken as they wouldn't get a default
> route. So, no public internet access from VMs in VPC tiers, no VPNs
> working, etc. This was mentioned to the list on Thursday September 15th,
> after some chats and debugging going on over the weekend (
> http://cloudstack.markmail.org/message/73ulpu4p75ex24tc)
> 
> Here we are, master is broken functionality wise and new Pull Requests
> come in to fix blockers. But we cannot ever test their proper working,
> because VPCs are broken in master and so also in the PRs branched off of
> it. With or without change in the PR.
> 
> It starts to escalate as the days go by.
> 
> I’ll leave out the bit on how this frustrated people. Although it’s good
> to know we do not want to be in this situation.
> 
> Eventually Wilder and I spent an evening and a day working on a branch
> where we loaded 7 PRs on top of each other (all VR related) only to find
> the VPC is still broken. It allowed us to zoom in and find the default
> route was missing again. We said it worked 3 weeks before, because the same
> tests that succeeded then, now were broken. We had already fixed this in PR
> #738 on August 25 so were sure about it.
> 
> After some digging we could trace it back to Pull Request #784. Imagine
> the feeling seeing your own comment there mentioning the previous issue on
> the default gateways. Fair to say our human review process clearly failed
> here. Many many hours were spent on this problem over the past two weeks.
> Could we have prevented this from happening? I think so, yes.
> 
> 
> This example clearly shows why:
> 
> - we should use Pull Requests
> It made the change visible: Great!
> 
> - we do reviews and ask for feedback
> We got feedback and questions: Also great!
> 
> - we should always respond to feedback and verify it is resolved, before
> merging
> We need to improve here. Even with two reviewers that say LGTM, we
> should still address any feedback before merging.
> 
> - we should have two humans doing a review
> We need to improve here as well. Not one reviewer, we need two. Really.
> 
> - we need to document why we say LGTM.
> Another improvement. It’s nice to say LGTM, but a review of only 4
> characters and nothing more is useless. We need to know what was tested and
> how. Test results, screen shots or anything that shows what's been
> verified. If you only reviewed the code, also fine but at least say that.
> Then the next reviewer should do another type of review to get the comlete
> picture. Remember you're replacing automated testing!
> 
> - we should always merge Pull Requests
> We made it easy, merging is the de facto standard, and it has even more
> benefits. You can trace commits back to their Pull Request (and find all
> comments and discussion there: saves time, trust me). It also allows for
> easier reverting of a Pull Request. We’ll see even more benefits once 4.7
> is there. Although the intentions to merge the Pull Request were there, it
> still didn't happen. We should always check before we push. As a committer
> we just need to be sure.
> 
> - we need automated testing!
> The sooner the better. It’s all about the missing automated testing.
> After 4.6, we all need to focus on this. Saves a lot of time. And
> frustrations.
> 
> 
> 
> We're doing final testing on PR #887 and will merge it soon. From that
> point on we can look into new issues. Be aware that any PR out there that
> was created after September 10 needs to be rebased with current master
> (when #887 is merged). Without that, no serious testing can be done.
> 
> Let's be careful what to land on master. I'll only be merging Pull
> Requests that have had proper reviews with information on what was tested.
> At least one reviewer needs to actually verify it works (and show the rest
> of us). We simply cannot assume it will work.
> 
> If we do this, I think we can start resolving the remaining blockers
> one-by-one and go into the first RC round. Please help out where you can so
> we can make this a success together. Thanks!
> 
> Looking forward to the day we have our automated testing in place ;-)
> 
> Regards,
> Remi
>

Re: Blameless post mortem

Reply via email to