Re: [openstack-dev] [tripleo] [ci] recheck impact on CI infrastructure
Hi Emilien and all, On 16.12.2016 01:26, Emilien Macchi wrote: > On Thu, Dec 15, 2016 at 12:22 PM, Sven Andersonwrote: >> Hi all, >> >> while I was waiting again for the CI to be fixed and didn't want to >> torture it with additional rechecks, I wanted to find out, how much of >> our CI infrastructure we waste with rechecks. My assumption was that >> every recheck is a waste of resources based on a false negative, because >> it renders the previous build useless. So I wrote a small script[1] to >> calculate how many rechecks are made on average per built patch-set. It >> calculates the number of patch-sets of merged changes that CI was >> testing (some patch-sets are not, because they were updated before CI >> started testing), the number of rechecks issued on these patch-sets, and >> a value "CI-factor", which is the factor by which the rechecks increased >> the the CI runs, that is, without rechecks it would be 1, if every >> tested patch-set would have exactly one recheck it would be 2. > > I see 2 different topics here. > > # One is not related to $topic but still worth mentioning: > "while I was waiting again for the CI to be fixed" > > This week has been tough, and many of us burnt our time to resolve > different complex problems in TripleO CI, mostly related to external > dependencies (qemu upgrade, centos 7.3 upgrade, tripleo-ci infra, > etc). > Resolving these problems is very challenging and you'll notice that > only a few of us actually work on this task, while a lot of people > continue to push their features "hoping" that it will pass CI > sometimes and if not, well, we'll do 'recheck'. > That is a way of working I would say. I personally can't continue to > code if the project I'm working on has broken CI. > > In a previous experience, I've been working in a team where everyone > stopped regular work when CI was broken and focus on fixing it. > I'm not saying everyone should stop their tasks and help, but this > "wait and see" comment doesn't actually help us to move forward. > People need to get more involved in CI and be more helpful. I know > it's difficult, but it's something anyone can learn, like you would > learn how to write Python code for example. I think you got my mail in the wrong way. I didn't want to say that anyone is not doing it's job right and I didn't want to complain. I know how challenging this is. In my previous job I was the person running the CI (among other things). I just wanted to share the results, because I think it's interesting how much percentage of our CI infrastructure is "wasted" by rechecks, to on one hand raise awareness that we not just blindly "recheck until verified", and on the other hand, how valuable it is to keep CI stable. Is it really the case that more CI people would help here? I would have expected, as long as we don't do more modularized testing, that it doesn't scale. Would more CI people fix the problems more quickly? Or is it more like: the burden could be distributed on more shoulders, so not always the same people have to interrupt their work? The second wouldn't improve the situation but just spread the burden in a more fair manner. With my post I mainly wanted to provide reliable data and emphasize how important a stable CI and the work on this is, and that we all restrain ourselves from blindly rechecking. Happy New Year to everyone! Sven __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] [ci] recheck impact on CI infrastructure
On Thu, Dec 15, 2016 at 12:22 PM, Sven Andersonwrote: > Hi all, > > while I was waiting again for the CI to be fixed and didn't want to > torture it with additional rechecks, I wanted to find out, how much of > our CI infrastructure we waste with rechecks. My assumption was that > every recheck is a waste of resources based on a false negative, because > it renders the previous build useless. So I wrote a small script[1] to > calculate how many rechecks are made on average per built patch-set. It > calculates the number of patch-sets of merged changes that CI was > testing (some patch-sets are not, because they were updated before CI > started testing), the number of rechecks issued on these patch-sets, and > a value "CI-factor", which is the factor by which the rechecks increased > the the CI runs, that is, without rechecks it would be 1, if every > tested patch-set would have exactly one recheck it would be 2. I see 2 different topics here. # One is not related to $topic but still worth mentioning: "while I was waiting again for the CI to be fixed" This week has been tough, and many of us burnt our time to resolve different complex problems in TripleO CI, mostly related to external dependencies (qemu upgrade, centos 7.3 upgrade, tripleo-ci infra, etc). Resolving these problems is very challenging and you'll notice that only a few of us actually work on this task, while a lot of people continue to push their features "hoping" that it will pass CI sometimes and if not, well, we'll do 'recheck'. That is a way of working I would say. I personally can't continue to code if the project I'm working on has broken CI. In a previous experience, I've been working in a team where everyone stopped regular work when CI was broken and focus on fixing it. I'm not saying everyone should stop their tasks and help, but this "wait and see" comment doesn't actually help us to move forward. People need to get more involved in CI and be more helpful. I know it's difficult, but it's something anyone can learn, like you would learn how to write Python code for example. # The second one is about the actual $topic and your stats. Yes we have been thinking about a way to optimize the way we restart CI jobs and this is under discussion: https://review.openstack.org/#/c/411450/ As you can see, there is some pushback from Clark who is infra-core, so we might want to continue the discussion here and see how it goes. On the long-term, our goal is to have more consistency in the way we test TripleO and get more adoption in the tools we're developing for CI, so they are more consumable from anyone in our community. Also we hope to have more people involved when things are broken, and not always the same folks spending days and evenings to "extinguish fires". We are working hard on CI stabilization and consolidation with multinode scenarios and OVB improvements, but it takes time and iterations. Any help is highly welcome here. > The results were not as bad as my feeling, we are below 2 for most of > the projects I tested. :-) But still, on THT for instance we use 71% > more resources because of the false negatives. I made monthly > breakdowns, so you can see a positive trend at least. > > > Here the results: > > Project: tripleo-heat-templates > > month patches rechecks CI-factor > 1 221 102 1.46 > 2 282 300 2.06 > 3 588 567 1.96 > 4 220 253 2.15 > 5 333 242 1.73 > 6 459 325 1.71 > 7 612 390 1.64 > 8 694 442 1.64 > 9 717 440 1.61 > 10 474 316 1.67 > 11 358 189 1.53 > 12 16880 1.48 > total 5126 3646 1.71 > > Project: tripleo-common > > month patches rechecks CI-factor > 1 73291.4 > 2 5948 1.81 > 3 92 1012.1 > 4 1719 2.12 > 5 4727 1.57 > 6 8346 1.55 > 7 6626 1.39 > 8 209 102 1.49 > 9 261 129 1.49 > 10 11051 1.46 > 11 12147 1.39 > 12 4019 1.48 > total 1178 644 1.55 > > Project: tripleo-puppet-elements > > month patches rechecks CI-factor > 1 24 9 1.38 > 2920 3.22 > 3716 3.29 > 4924 3.67 > 5 1417 2.21 > 6 1733 2.94 > 7 1216 2.33 > 8 15212.4 > 9 10142.4 > 10 12 5 1.42 > 11 3425 1.74 > 12 10132.3 > total 173
Re: [openstack-dev] [tripleo] [ci] recheck impact on CI infrastructure
Neat, thanks Sven! Here are the nova stats: http://paste.openstack.org/show/592551/ --diana On Thu, Dec 15, 2016 at 12:22 PM, Sven Andersonwrote: > Hi all, > > while I was waiting again for the CI to be fixed and didn't want to > torture it with additional rechecks, I wanted to find out, how much of > our CI infrastructure we waste with rechecks. My assumption was that > every recheck is a waste of resources based on a false negative, because > it renders the previous build useless. So I wrote a small script[1] to > calculate how many rechecks are made on average per built patch-set. It > calculates the number of patch-sets of merged changes that CI was > testing (some patch-sets are not, because they were updated before CI > started testing), the number of rechecks issued on these patch-sets, and > a value "CI-factor", which is the factor by which the rechecks increased > the the CI runs, that is, without rechecks it would be 1, if every > tested patch-set would have exactly one recheck it would be 2. > > The results were not as bad as my feeling, we are below 2 for most of > the projects I tested. :-) But still, on THT for instance we use 71% > more resources because of the false negatives. I made monthly > breakdowns, so you can see a positive trend at least. > > > Here the results: > > Project: tripleo-heat-templates > > month patches rechecks CI-factor > 1 221 102 1.46 > 2 282 300 2.06 > 3 588 567 1.96 > 4 220 253 2.15 > 5 333 242 1.73 > 6 459 325 1.71 > 7 612 390 1.64 > 8 694 442 1.64 > 9 717 440 1.61 > 10 474 316 1.67 > 11 358 189 1.53 > 12 16880 1.48 > total 5126 3646 1.71 > > Project: tripleo-common > > month patches rechecks CI-factor > 1 73291.4 > 2 5948 1.81 > 3 92 1012.1 > 4 1719 2.12 > 5 4727 1.57 > 6 8346 1.55 > 7 6626 1.39 > 8 209 102 1.49 > 9 261 129 1.49 > 10 11051 1.46 > 11 12147 1.39 > 12 4019 1.48 > total 1178 644 1.55 > > Project: tripleo-puppet-elements > > month patches rechecks CI-factor > 1 24 9 1.38 > 2920 3.22 > 3716 3.29 > 4924 3.67 > 5 1417 2.21 > 6 1733 2.94 > 7 1216 2.33 > 8 15212.4 > 9 10142.4 > 10 12 5 1.42 > 11 3425 1.74 > 12 10132.3 > total 173 213 2.23 > > Project: puppet-tripleo > > month patches rechecks CI-factor > 1 2923 1.79 > 2 3668 2.89 > 3 40442.1 > 4 6874 2.09 > 5 12943 1.33 > 6 265 206 1.78 > 7 235 1181.5 > 8 193 130 1.67 > 9 147 123 1.84 > 10 233 159 1.68 > 11 13786 1.63 > 12 20 5 1.25 > total 1532 10791.7 > > > [1] https://gist.github.com/ansiwen/e139cbf25bc243d30629e0157fc753ff > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [tripleo] [ci] recheck impact on CI infrastructure
Hi all, while I was waiting again for the CI to be fixed and didn't want to torture it with additional rechecks, I wanted to find out, how much of our CI infrastructure we waste with rechecks. My assumption was that every recheck is a waste of resources based on a false negative, because it renders the previous build useless. So I wrote a small script[1] to calculate how many rechecks are made on average per built patch-set. It calculates the number of patch-sets of merged changes that CI was testing (some patch-sets are not, because they were updated before CI started testing), the number of rechecks issued on these patch-sets, and a value "CI-factor", which is the factor by which the rechecks increased the the CI runs, that is, without rechecks it would be 1, if every tested patch-set would have exactly one recheck it would be 2. The results were not as bad as my feeling, we are below 2 for most of the projects I tested. :-) But still, on THT for instance we use 71% more resources because of the false negatives. I made monthly breakdowns, so you can see a positive trend at least. Here the results: Project: tripleo-heat-templates month patches rechecks CI-factor 1 221 102 1.46 2 282 300 2.06 3 588 567 1.96 4 220 253 2.15 5 333 242 1.73 6 459 325 1.71 7 612 390 1.64 8 694 442 1.64 9 717 440 1.61 10 474 316 1.67 11 358 189 1.53 12 16880 1.48 total 5126 3646 1.71 Project: tripleo-common month patches rechecks CI-factor 1 73291.4 2 5948 1.81 3 92 1012.1 4 1719 2.12 5 4727 1.57 6 8346 1.55 7 6626 1.39 8 209 102 1.49 9 261 129 1.49 10 11051 1.46 11 12147 1.39 12 4019 1.48 total 1178 644 1.55 Project: tripleo-puppet-elements month patches rechecks CI-factor 1 24 9 1.38 2920 3.22 3716 3.29 4924 3.67 5 1417 2.21 6 1733 2.94 7 1216 2.33 8 15212.4 9 10142.4 10 12 5 1.42 11 3425 1.74 12 10132.3 total 173 213 2.23 Project: puppet-tripleo month patches rechecks CI-factor 1 2923 1.79 2 3668 2.89 3 40442.1 4 6874 2.09 5 12943 1.33 6 265 206 1.78 7 235 1181.5 8 193 130 1.67 9 147 123 1.84 10 233 159 1.68 11 13786 1.63 12 20 5 1.25 total 1532 10791.7 [1] https://gist.github.com/ansiwen/e139cbf25bc243d30629e0157fc753ff __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev