On 06/30/2014 04:22 PM, Jay Pipes wrote: > Hi Stackers, > > Some recent ML threads [1] and a hot IRC meeting today [2] brought up > some legitimate questions around how a newly-proposed Stackalytics > report page for Neutron External CI systems [2] represented the results > of an external CI system as "successful" or not. > > First, I want to say that Ilya and all those involved in the > Stackalytics program simply want to provide the most accurate > information to developers in a format that is easily consumed. While > there need to be some changes in how data is shown (and the wording of > things like "Tests Succeeded"), I hope that the community knows there > isn't any ill intent on the part of Mirantis or anyone who works on > Stackalytics. OK, so let's keep the conversation civil -- we're all > working towards the same goals of transparency and accuracy. :) > > Alright, now, Anita and Kurt Taylor were asking a very poignant question: > > "But what does CI tested really mean? just running tests? or tested to > pass some level of requirements?" > > In this nascent world of external CI systems, we have a set of issues > that we need to resolve: > > 1) All of the CI systems are different. > > Some run Bash scripts. Some run Jenkins slaves and devstack-gate > scripts. Others run custom Python code that spawns VMs and publishes > logs to some public domain. > > As a community, we need to decide whether it is worth putting in the > effort to create a single, unified, installable and runnable CI system, > so that we can legitimately say "all of the external systems are > identical, with the exception of the driver code for vendor X being > substituted in the Neutron codebase." > > If the goal of the external CI systems is to produce reliable, > consistent results, I feel the answer to the above is "yes", but I'm > interested to hear what others think. Frankly, in the world of > benchmarks, it would be unthinkable to say "go ahead and everyone run > your own benchmark suite", because you would get wildly different > results. A similar problem has emerged here. > > 2) There is no mediation or verification that the external CI system is > actually testing anything at all > > As a community, we need to decide whether the current system of > self-policing should continue. If it should, then language on reports > like [3] should be very clear that any numbers derived from such systems > should be taken with a grain of salt. Use of the word "Success" should > be avoided, as it has connotations (in English, at least) that the > result has been verified, which is simply not the case as long as no > verification or mediation occurs for any external CI system. > > 3) There is no clear indication of what tests are being run, and > therefore there is no clear indication of what "success" is > > I think we can all agree that a test has three possible outcomes: pass, > fail, and skip. The results of a test suite run therefore is nothing > more than the aggregation of which tests passed, which failed, and which > were skipped. > > As a community, we must document, for each project, what are expected > set of tests that must be run for each merged patch into the project's > source tree. This documentation should be discoverable so that reports > like [3] can be crystal-clear on what the data shown actually means. The > report is simply displaying the data it receives from Gerrit. The > community needs to be proactive in saying "this is what is expected to > be tested." This alone would allow the report to give information such > as "External CI system ABC performed the expected tests. X tests passed. > Y tests failed. Z tests were skipped." Likewise, it would also make it > possible for the report to give information such as "External CI system > DEF did not perform the expected tests.", which is excellent information > in and of itself. > > === > > In thinking about the likely answers to the above questions, I believe > it would be prudent to change the Stackalytics report in question [3] in > the following ways: > > a. Change the "Success %" column header to "% Reported +1 Votes" > b. Change the phrase " Green cell - tests ran successfully, red cell - > tests failed" to "Green cell - System voted +1, red cell - System voted -1" > > and then, when we have more and better data (for example, # tests > passed, failed, skipped, etc), we can provide more detailed information > than just "reported +1" or not. > > Thoughts? > > Best, > -jay > > [1] > http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933.html > [2] > http://eavesdrop.openstack.org/meetings/third_party/2014/third_party.2014-06-30-18.01.log.html > > [3] http://stackalytics.com/report/ci/neutron/7 > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Hi Jay:
Thanks for starting this thread. You raise some interesting questions. The question I had identified as needing definition is "what algorithm do we use to assess fitness of a third party ci system". http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2014-06-30.log timestamp 2014-06-30T19:23:40 This is the question that is top of mind for me. Thanks Jay, Anita. _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev