Hi: Nice discussion; as you know, I'm all for increasing the number and the quality of our tests. Here are my comments on some of raised points:
> What I noticed is that we always test our code when coding. The > problem is that test is manual. As Sam mentioned, you can get some automation with my little devscripts. In the realm of testing, see `invenio-retest-demo-site' that runs the full unit/regression/web test suite, compares the results against the last run, and warns you of differences. Could be extended to run only some tests, like: $ invenio-retest-demo-site ./modules/bibknowledge Or could run only those modules' tests that had some files changed by the given branch, quite like `invenio-check-branch' does for the tracking code kwalitee changes. So running tests could optionally become part of `invenio-check-branch', which would be especially useful when you are on Atlantis. Are you using my little helper tools in this way, and are you interested in these improvements? If so, I'll commit some of them. > For example, we are thinking of requiring the implementation of tests > for bugs that appear and gets fixed This is nothing new, it has been `gently suggested' as a `should-have feature' since ~2006: ``This is especially important if a bug had been previously found. Then a regression test case should be written to assure that it will never reappear.'' <http://invenio-demo.cern.ch/help/hacking/test-suite#3.1> If people want to write more tests, then I'm all for it, I think we have quite a usable ecosystem already... but one has to use it! > we have 616 successful unit tests out of 616 (!) and 479 successful > regression tests out of 490(!), so I don't see Invenio that broken in > that sense +1. However, 11 failing regression tests on `master' is still too much. As I recall we have only 3 regression tests that are `historically failing' in this TDD style. (Plus there are a few more failing because of the First-Day-Of-A-Month problem, plus there may be some LibreOffice ones on some platforms, etc.) So we should try to clean `master' better once again. > (For BibClassify I instead don't know the reason for the failure) It's been a recurrent issue, see: <http://invenio-software.org/ticket/817> I recall that when I rerun tests, the results were different. Have not looked at that yet, but seems like it may be a simple setup/teardown issue, or a test case call order issue, or something. > the decorator approach to marking tests looks very useful. I fully agree it will be useful to differentiate should-fail tests. However, we cannot use decorators due to Python-2.4 minimal version. Until we upgrade Python version, I would propose to simply use `def FIXME_TDD_foo()' naming technique that we sometimes used in the past via `def xtest_foo()', together with opening Trac tickets. History shows that it may take a lot of time to implement our TDD-meant feature, so let's differentiate them away in this sense, which would better address the original problem. I'll modify Invenio codebase in this respect. > Make failure very prominent in the build server (currently tests fail > and the server reports success) Yes, I agree. I had mentioned in the past that I'd like to extend the usage of the red flag on Bitten after we fix all the tests and even the kwalitee issues, e.g. on 2010-11-30: ``If we clear `make kwalitee-check-sql-queries' of false positives, then this could be plugged into Bitten reports that would raise a red flag and stuff. Anyone to join in a codefest?'' It seems that this may take a very long time, though, so I'll implement the above-mentioned FIXME technique and I'll configure the red flag in Bitten to also appear when a test fails. This, together with using devscript helpers, should address feelings that lead to this thread; WDYT? > We should refuse to commit That's what we've been usually doing, but sometimes tests fail only when they are run at a particular hour (ticket:421), sometimes only when not run repetitively (ticket:817), sometimes as part of independent test data change (ticket:842), sometimes only on Python-2.4 while we quick-integrated production hotfixes using Python-2.6 only (ticket:715), sometimes tests fail only on boxes with low memory (intbitset with VMM), etc. So stuff happens. (But virtually all of these are then caught after-merge by Bitten builds.) Best regards -- Tibor Simko