Dude, There was nothing friendly about the postmortem you did, It was only partial, we should do a complete postmortem and then draw conclusions.
I think the post-mortems like this are of no use if we do not do them completely. Regards, Bharat. On 28-Sep-2015, at 5:39 pm, Remi Bergsma <rberg...@schubergphilis.com> wrote: > Dude, this is the final friendly email about his. All points have been made > in previous mails. This has nothing to do with ‘blameless’ and ‘learning’ > anymore. > > Read Seb’s mail. We will move on now. > > > Regards, Remi > > > > On 28/09/15 13:54, "Bharat Kumar" <bharat.ku...@citrix.com> wrote: > >> Hi Remi, >> >> Whatever ever we think we have discovered are all well known best >> practices while developing code in community. >> I agree that tests need to be run on a new PR, but i wonder why was this >> ignored when merging the VR refactor code. Perhaps we will uncover some more >> issues if we investigate this. I believe >> one of the reasons for this is the complexity and incomplete nature of the >> vr refactor change and failing to identify the areas which got effected. If >> we had a good documentation i think we cloud have understood the areas that >> were getting >> impacted early on, areas like the vpn , iptables, isolated networks rvr >> etc and run the relevant tests. The documentation will also help us focus >> on these areas while reviewing and fixing subsequent issues. Currently no >> one knows the areas that got effected >> due to the vr refactor change, we are seeing issues all over the code. I >> think this is a bigger problem than what we have discussed so far. >> >> I think presently we should stop fixing all the vr refactoring bugs until >> we come up with a proper document describing the VR refactoring changes. >> >> I am not suggesting that we should revert the vr refactor code, I am willing >> to work on this and fix the issues, I am only asking if we can get some >> documentation. >> >> >> Regards, >> Bharat. >> >> On 28-Sep-2015, at 4:59 pm, Sebastien Goasguen <run...@gmail.com> wrote: >> >>> >>>> On Sep 28, 2015, at 1:14 PM, Remi Bergsma <rberg...@schubergphilis.com> >>>> wrote: >>>> >>>> Hi Bharat, >>>> >>>> >>>> There is only one way to prove a feature works: with tests. That’s why I >>>> say actually _running_ the tests we have today on any new PR, is the most >>>> important thing. Having no documentation is a problem, I agree, but it is >>>> not more important IMHO. If we had the documentation, we still would have >>>> issues if nobody runs the tests and verifies pull requests. Documentation >>>> that is perfect does not automatically lead to perfect implementation. So >>>> we need tests to verify. >>>> >>>> If we don’t agree that is also fine. We need to do both anyway and I think >>>> we do agree on that. >>>> >>> >>> Also we need to move forward. We should have a live chat once 4.6 is out to >>> discuss all issues/problems and iron out the process. >>> >>> But reverting the VR refactor is not going to happen. There was ample >>> discussions on the PR when it was submitted, there was time to review and >>> raise concerns at that time. It went through quite a few reviews, tests >>> etc…Maybe the documentation is not good, but the time to raise this concern >>> I am afraid was six months ago. We can learn from it, but we are not going >>> to revert it, this would not go cleanly as David mentioned. >>> >>> So let’s get back to blockers for 4.6, are there still VR related issues >>> with master ? >>> >>> >>> >>> >>>> Regards, >>>> Remi >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 28/09/15 12:15, "Bharat Kumar" <bharat.ku...@citrix.com> wrote: >>>> >>>>> Hi Remi, >>>>> >>>>> i do not agree with “There is no bigger problem” part of your reply. so >>>>> I had to repeat myself to make it more clear, Not because i am not aware >>>>> of what this thread is supposed to do. >>>>> >>>>> Regards, >>>>> Bharat. >>>>> >>>>> On 28-Sep-2015, at 2:51 pm, Remi Bergsma <rberg...@schubergphilis.com> >>>>> wrote: >>>>> >>>>>> Hi Bharat, >>>>>> >>>>>> I understand your frustrations but we already agreed on this so no need >>>>>> to repeat. This thread is supposed to list some improvements and learn >>>>>> from it. Your point has been taken so let’s move on. >>>>>> >>>>>> We need documentation first, then do a change after which all tests >>>>>> should pass. Even better is we write (missing) tests before changing >>>>>> stuff so you know they pass before and after the fact. >>>>>> >>>>>> When doing reviews, feel free to ask for design documentation if you >>>>>> feel it is needed. >>>>>> >>>>>> Regards, Remi >>>>>> >>>>>> >>>>>> >>>>>> On 28/09/15 11:02, "Bharat Kumar" <bharat.ku...@citrix.com> wrote: >>>>>> >>>>>>> Hi Remi, >>>>>>> >>>>>>> I never intended to say that we should not run tests, but even before >>>>>>> tests we should have proper documentation. My concern was if a major >>>>>>> change is being introduced it should be properly documented. All the >>>>>>> issues which we are trying to fix are majorly due to VR refactor. If >>>>>>> there was a proper documentation for this we could >>>>>>> have fixed this in a better way. Even to add tests we need to >>>>>>> understand how a particular thing works and what data dose it expect. I >>>>>>> think while fixing the python based code changes this is where most of >>>>>>> the people are facing issues. A proper documentation will help in >>>>>>> understanding these in a better way. >>>>>>> >>>>>>> Thanks, >>>>>>> Bharat. >>>>>>> >>>>>>> On 28-Sep-2015, at 1:57 pm, Remi Bergsma <rberg...@schubergphilis.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Bharat, >>>>>>>> >>>>>>>> There is no bigger problem. We should always run the tests and if we >>>>>>>> find a case that isn’t currently covered by the tests we should simply >>>>>>>> add tests for it. There’s no way we’ll get a stable master without >>>>>>>> them. The fact that they may not cover everything, is no reason to not >>>>>>>> rely on them. If a feature is not important enough to write a test >>>>>>>> for, then the feature is probably not important anyway. And if it is, >>>>>>>> then add a test :-) >>>>>>>> >>>>>>>> I do agree on the design documentation requirement for any (major?) >>>>>>>> change. I found some design documentations on the subject you mention, >>>>>>>> but it should have been more detailed. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Remi >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 28/09/15 09:58, "Bharat Kumar" <bharat.ku...@citrix.com> wrote: >>>>>>>> >>>>>>>>> Hi Remi, >>>>>>>>> >>>>>>>>> Thank you for the Blame less postmortem. >>>>>>>>> >>>>>>>>> I think there is a bigger problem here than just the review process >>>>>>>>> and running tests. Even if we run the tests we cannot be sure that >>>>>>>>> every thing will work as intended. The tests will only give some >>>>>>>>> level of confidence. The tests may not cover all the use cases. >>>>>>>>> >>>>>>>>> I think the problem here is that the way major changes to the code >>>>>>>>> base are dealt with. For example, VR refactoring was done without >>>>>>>>> discussing the design implications and the amount of changes it would >>>>>>>>> bring in. I could not find any design document. The vr refactor >>>>>>>>> changed a lot of code and the way VR used to work and in my opinion >>>>>>>>> it was incomplete-vpn, isolated networks, basic networks, iptable >>>>>>>>> rules and rvr in isolated case etc were not implemented. Most of us >>>>>>>>> are still in the process of understanding this. Even before reaching >>>>>>>>> this state we had to spend a lot of time fixing issues mentioned in >>>>>>>>> the thread [Blocker/Critical] VR related Issues. >>>>>>>>> >>>>>>>>> When a change of this magnitude is being made, we should call out all >>>>>>>>> the changes and document them properly. This will help people to >>>>>>>>> create better fixes. Currently when we attempt to fix the isolated vr >>>>>>>>> case it is effecting the vpc and vice versa. for example pr 738 fixed >>>>>>>>> it for vpc networks but broke it for isolated case. I believe it is >>>>>>>>> not too late to at least start documenting the changes now. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Bharat. >>>>>>>>> >>>>>>>>> On 28-Sep-2015, at 10:52 am, Sanjeev N <sanj...@apache.org> wrote: >>>>>>>>> >>>>>>>>>> I have a concern here. Some of us are actively involved in reviewing >>>>>>>>>> the >>>>>>>>>> PRs related to marvin tests(Enhancing existing tests/Adding new >>>>>>>>>> tests). If >>>>>>>>>> we have to test a PR it requires an environment to be created with >>>>>>>>>> actual >>>>>>>>>> resources and this is going to take lot of time. Some of the tests >>>>>>>>>> can run >>>>>>>>>> on simulator but most of the tests require real hardware to test. PR >>>>>>>>>> submitter is already testing and submitting the test results along >>>>>>>>>> with the >>>>>>>>>> PR. So is it require to test these PRs by reviewers? >>>>>>>>>> >>>>>>>>>> On Sat, Sep 26, 2015 at 1:49 PM, sebgoa <run...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Remi, thanks for the detailed post-mortem, it's a good read and >>>>>>>>>>> great >>>>>>>>>>> learning. >>>>>>>>>>> I hope everyone reads it. >>>>>>>>>>> >>>>>>>>>>> The one thing to emphasize is that we now have a very visible way >>>>>>>>>>> to get >>>>>>>>>>> code into master, we have folks investing time to provide review >>>>>>>>>>> (great), >>>>>>>>>>> we need the submitters to make due diligence and answer all >>>>>>>>>>> comments in the >>>>>>>>>>> reviews. >>>>>>>>>>> >>>>>>>>>>> In another project i work on, nothing can be added to the code >>>>>>>>>>> without >>>>>>>>>>> unit tests. I think we could go down the route of asking for new >>>>>>>>>>> integration tests and unit tests for anything. If not, the PR does >>>>>>>>>>> not get >>>>>>>>>>> merged. But let's digest your post-mortem and we can discuss after >>>>>>>>>>> 4.6.0. >>>>>>>>>>> >>>>>>>>>>> I see that you reverted one commit that was not made by you, that's >>>>>>>>>>> great. >>>>>>>>>>> >>>>>>>>>>> Let's focus on the blockers now, everything else can wait. >>>>>>>>>>> >>>>>>>>>>> The big bonus of doing what we are doing is that once 4.6.0 is out, >>>>>>>>>>> we can >>>>>>>>>>> merge PRs again (assuming they are properly rebased and tested) and >>>>>>>>>>> we can >>>>>>>>>>> release 4.6.1 really quickly after. >>>>>>>>>>> >>>>>>>>>>> -sebastien >>>>>>>>>>> >>>>>>>>>>> On Sep 25, 2015, at 9:51 PM, Remi Bergsma >>>>>>>>>>> <rberg...@schubergphilis.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> >>>>>>>>>>>> This mail is intended to be blameless. We need to learn something >>>>>>>>>>>> from >>>>>>>>>>> it. That's why I left out who exactly did what because it’s not >>>>>>>>>>> relevant. >>>>>>>>>>> There are multiple examples but it's about the why. Let's learn >>>>>>>>>>> from this >>>>>>>>>>> without blaming anyone. >>>>>>>>>>>> >>>>>>>>>>>> We know we need automated testing. We have integration tests, but >>>>>>>>>>>> we are >>>>>>>>>>> unable to run all of them on any Pull Request we receive. If we >>>>>>>>>>> would have >>>>>>>>>>> that in place, it'd be much easier to spot errors, regression and >>>>>>>>>>> so on. >>>>>>>>>>> It'd also be more rewarding to write more tests. >>>>>>>>>>>> >>>>>>>>>>>> Unfortunately we're not there yet. So, we need to do something else >>>>>>>>>>> instead until we get there. If we do nothing, we know we have many >>>>>>>>>>> issues >>>>>>>>>>> because a master that breaks on a regular basis is the most >>>>>>>>>>> frustrating >>>>>>>>>>> things. We said we'd use Pull Requests with at least two humans to >>>>>>>>>>> review >>>>>>>>>>> and give their OK for a Pull Request. In the form of LGTM: Looks >>>>>>>>>>> Good To >>>>>>>>>>> Me. Ok, so the LGTMs are there because we have no automated >>>>>>>>>>> testing. Keep >>>>>>>>>>> that in mind. You are supposed to replace automated testing until >>>>>>>>>>> it's >>>>>>>>>>> there. >>>>>>>>>>>> >>>>>>>>>>>> Since we do this, master got a lot more stable. But every now and >>>>>>>>>>>> then >>>>>>>>>>> we still have issues. Let's look at how we do manual reviews. >>>>>>>>>>> Again, this >>>>>>>>>>> is not to blame anyone. It's to open our eyes and make us realise >>>>>>>>>>> what >>>>>>>>>>> we're doing and what results we get out of that. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Example Pull Request #784: >>>>>>>>>>>> Title: CLOUDSTACK-8799 fixed the default routes >>>>>>>>>>>> >>>>>>>>>>>> That's nice, it has a Jira id and a short description (as it >>>>>>>>>>>> should be). >>>>>>>>>>>> >>>>>>>>>>>> The first person comes along and makes a comment: >>>>>>>>>>>> "There was also an issue with VPC VRs" ... "Have you seen this >>>>>>>>>>>> issue? >>>>>>>>>>> Does your change affects the VPC VR (single/redundant)?" >>>>>>>>>>>> >>>>>>>>>>>> Actually a good question. Unfortunaly there comes no answer. After >>>>>>>>>>>> a >>>>>>>>>>> reminder, it was promised to do tests against VPC networks. Great! >>>>>>>>>>>> >>>>>>>>>>>> The Jenkins builds both succeed and also Travis is green. But how >>>>>>>>>>>> much >>>>>>>>>>> value does this have? They have the impression to do automated >>>>>>>>>>> testing, and >>>>>>>>>>> although you could argue they do, it's far from complete. If it >>>>>>>>>>> breaks, you >>>>>>>>>>> know you have an issue. But it doesn’t work the other way around. >>>>>>>>>>>> >>>>>>>>>>>> Back to our example PR. In the mean time, another commit gets >>>>>>>>>>>> pushed to >>>>>>>>>>> it: "CLOUDSTACK-8799 fixed for vpc networks." But if you look at >>>>>>>>>>> the Jira >>>>>>>>>>> issue, you see it is about redundant virtual routers. The non-VPC >>>>>>>>>>> ones. So >>>>>>>>>>> this is vague at best. But a reviewer gives a LGTM because the >>>>>>>>>>> person could >>>>>>>>>>> create a VPC. That doesn't have anything to do with the problem >>>>>>>>>>> being fixed >>>>>>>>>>> in this PR nor with the comments made earlier. But, at least the >>>>>>>>>>> person >>>>>>>>>>> said what he did and we should all do that. What nobody knew back >>>>>>>>>>> then, was >>>>>>>>>>> that this broke the default route on VPCs. >>>>>>>>>>>> >>>>>>>>>>>> Then something strange happens: the two commits from the PR end up >>>>>>>>>>>> on >>>>>>>>>>> master as direct commits. With just one LGTM and no verification >>>>>>>>>>> from the >>>>>>>>>>> person commenting about the linked issue. This happened on Friday >>>>>>>>>>> September >>>>>>>>>>> 11th. >>>>>>>>>>>> >>>>>>>>>>>> That day 21 commits came in, from 7 Pull Request and unfortunately >>>>>>>>>>>> also >>>>>>>>>>> from some direct commits. We noticed the direct commits and >>>>>>>>>>> notified the >>>>>>>>>>> list (http://cloudstack.markmail.org/message/srmszloyipkxml36). As >>>>>>>>>>> a lot >>>>>>>>>>> came in at the same time, it was decided not to revert them. >>>>>>>>>>> Looking back, >>>>>>>>>>> we should have done it. >>>>>>>>>>>> >>>>>>>>>>>> From this point on, VPCs were broken as they wouldn't get a default >>>>>>>>>>> route. So, no public internet access from VMs in VPC tiers, no VPNs >>>>>>>>>>> working, etc. This was mentioned to the list on Thursday September >>>>>>>>>>> 15th, >>>>>>>>>>> after some chats and debugging going on over the weekend ( >>>>>>>>>>> http://cloudstack.markmail.org/message/73ulpu4p75ex24tc) >>>>>>>>>>>> >>>>>>>>>>>> Here we are, master is broken functionality wise and new Pull >>>>>>>>>>>> Requests >>>>>>>>>>> come in to fix blockers. But we cannot ever test their proper >>>>>>>>>>> working, >>>>>>>>>>> because VPCs are broken in master and so also in the PRs branched >>>>>>>>>>> off of >>>>>>>>>>> it. With or without change in the PR. >>>>>>>>>>>> >>>>>>>>>>>> It starts to escalate as the days go by. >>>>>>>>>>>> >>>>>>>>>>>> I’ll leave out the bit on how this frustrated people. Although >>>>>>>>>>>> it’s good >>>>>>>>>>> to know we do not want to be in this situation. >>>>>>>>>>>> >>>>>>>>>>>> Eventually Wilder and I spent an evening and a day working on a >>>>>>>>>>>> branch >>>>>>>>>>> where we loaded 7 PRs on top of each other (all VR related) only to >>>>>>>>>>> find >>>>>>>>>>> the VPC is still broken. It allowed us to zoom in and find the >>>>>>>>>>> default >>>>>>>>>>> route was missing again. We said it worked 3 weeks before, because >>>>>>>>>>> the same >>>>>>>>>>> tests that succeeded then, now were broken. We had already fixed >>>>>>>>>>> this in PR >>>>>>>>>>> #738 on August 25 so were sure about it. >>>>>>>>>>>> >>>>>>>>>>>> After some digging we could trace it back to Pull Request #784. >>>>>>>>>>>> Imagine >>>>>>>>>>> the feeling seeing your own comment there mentioning the previous >>>>>>>>>>> issue on >>>>>>>>>>> the default gateways. Fair to say our human review process clearly >>>>>>>>>>> failed >>>>>>>>>>> here. Many many hours were spent on this problem over the past two >>>>>>>>>>> weeks. >>>>>>>>>>> Could we have prevented this from happening? I think so, yes. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> This example clearly shows why: >>>>>>>>>>>> >>>>>>>>>>>> - we should use Pull Requests >>>>>>>>>>>> It made the change visible: Great! >>>>>>>>>>>> >>>>>>>>>>>> - we do reviews and ask for feedback >>>>>>>>>>>> We got feedback and questions: Also great! >>>>>>>>>>>> >>>>>>>>>>>> - we should always respond to feedback and verify it is resolved, >>>>>>>>>>>> before >>>>>>>>>>> merging >>>>>>>>>>>> We need to improve here. Even with two reviewers that say LGTM, we >>>>>>>>>>> should still address any feedback before merging. >>>>>>>>>>>> >>>>>>>>>>>> - we should have two humans doing a review >>>>>>>>>>>> We need to improve here as well. Not one reviewer, we need two. >>>>>>>>>>>> Really. >>>>>>>>>>>> >>>>>>>>>>>> - we need to document why we say LGTM. >>>>>>>>>>>> Another improvement. It’s nice to say LGTM, but a review of only 4 >>>>>>>>>>> characters and nothing more is useless. We need to know what was >>>>>>>>>>> tested and >>>>>>>>>>> how. Test results, screen shots or anything that shows what's been >>>>>>>>>>> verified. If you only reviewed the code, also fine but at least say >>>>>>>>>>> that. >>>>>>>>>>> Then the next reviewer should do another type of review to get the >>>>>>>>>>> comlete >>>>>>>>>>> picture. Remember you're replacing automated testing! >>>>>>>>>>>> >>>>>>>>>>>> - we should always merge Pull Requests >>>>>>>>>>>> We made it easy, merging is the de facto standard, and it has even >>>>>>>>>>>> more >>>>>>>>>>> benefits. You can trace commits back to their Pull Request (and >>>>>>>>>>> find all >>>>>>>>>>> comments and discussion there: saves time, trust me). It also >>>>>>>>>>> allows for >>>>>>>>>>> easier reverting of a Pull Request. We’ll see even more benefits >>>>>>>>>>> once 4.7 >>>>>>>>>>> is there. Although the intentions to merge the Pull Request were >>>>>>>>>>> there, it >>>>>>>>>>> still didn't happen. We should always check before we push. As a >>>>>>>>>>> committer >>>>>>>>>>> we just need to be sure. >>>>>>>>>>>> >>>>>>>>>>>> - we need automated testing! >>>>>>>>>>>> The sooner the better. It’s all about the missing automated >>>>>>>>>>>> testing. >>>>>>>>>>> After 4.6, we all need to focus on this. Saves a lot of time. And >>>>>>>>>>> frustrations. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> We're doing final testing on PR #887 and will merge it soon. From >>>>>>>>>>>> that >>>>>>>>>>> point on we can look into new issues. Be aware that any PR out >>>>>>>>>>> there that >>>>>>>>>>> was created after September 10 needs to be rebased with current >>>>>>>>>>> master >>>>>>>>>>> (when #887 is merged). Without that, no serious testing can be done. >>>>>>>>>>>> >>>>>>>>>>>> Let's be careful what to land on master. I'll only be merging Pull >>>>>>>>>>> Requests that have had proper reviews with information on what was >>>>>>>>>>> tested. >>>>>>>>>>> At least one reviewer needs to actually verify it works (and show >>>>>>>>>>> the rest >>>>>>>>>>> of us). We simply cannot assume it will work. >>>>>>>>>>>> >>>>>>>>>>>> If we do this, I think we can start resolving the remaining >>>>>>>>>>>> blockers >>>>>>>>>>> one-by-one and go into the first RC round. Please help out where >>>>>>>>>>> you can so >>>>>>>>>>> we can make this a success together. Thanks! >>>>>>>>>>>> >>>>>>>>>>>> Looking forward to the day we have our automated testing in place >>>>>>>>>>>> ;-) >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Remi >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >>