The relation between CPAN Testers and quality (or why CPAN Testers sucks if you don't need it)

David Golden Wed, 03 Sep 2008 14:10:15 -0700

Yes, another CPAN Testers post on perl-qa.  Sorry, Andy.

I want to sum up a few things that I took away from the mega-threads
yesterday and propose a series of major changes to CPAN Testers.
Special thanks to an off-list (and very civil) conversation with
chromatic for triggering these thoughts.


__Type I and Type II errors__

In statistics, a Type I error means a "false positive" or "false
alarm".  For CPAN Testers, that's a bogus FAIL report.  A Type II
error means a "false negative", e.g. a bogus PASS report.  Often,
there is a trade-off between these.  If you think about spam filtering
as an example, reducing the chance of spam getting through the filter
(false negatives) tends to increase the odds that legitimate mail gets
flagged as spam (false positives).

Generally, those involved in CPAN Testers have taken the view that
it's better to have a false positives (false alarms) than false
negatives (a bogus PASS report).  Moreover, we've tended to believe --
without any real analysis -- that the false positive *ratio* (false
FAILs divided by all FAILs) is low.

But I've never heard a single complaint about a bogus PASS report and
I hear a lot of complaints about bogus FAILS, so it's reasonable to
think that we've got the tradeoff wrong. Moreover, I think the
downside to false positives is actually higher than for false
negatives if we believe that CPAN Testers is primarily a tool to help
authors improve quality rather than a tool to give users a guarantee
about how distributions work on any given platform.

__False positive ratios by author__

Even if the aggregate false positive ratio is low, individual CPAN
authors can experience extraordinarily high false positive ratios.
What I suddenly realized is that the higher the quality of an author's
distributions, the higher the false positive ratio.

Consider a "low quality" author -- one who is prone to portability
errors, missing dependencies and so on.  Most of the FAIL reports are
legitimate problems with the distribution.

Now consider a "high quality" author -- one who is careful to write
portable code, well-specified dependencies and so on.  For this
author, most of the FAIL reports only come when a tester has a broken
or misconfigured toolchain  The false positive ratio will approach
100%.

In other words, the *reward* that CPAN Testers has for high quality is
increased annoyance from false FAIL reports with little benefit.

__Repetition is desensitizing__

>From a statistical perspective, having lots of CPAN Testers reports
for a distribution even on a common platform helps improve confidence
in the aggregate result.  Put differently, it helps weed out "outlier"
reports from a tester who happens to have a broken toolchain.

However, from author's perspective, if a report is legitimate (and
assuming they care), they really only need to hear it once.  Having
more and more testers sending the same FAIL report on platform X is
overkill and gives yet more encouragement for authors to tune out.

So the more successful CPAN Testers is in attracting new testers, the
more duplicate FAIL reports authors are likely to receive, which makes
them less likely to pay attention to them.

__When is a FAIL not a FAIL?__

There are legitimate reasons that distributions could be broken such
that they fail during PL or make in ways that are not the fault of the
tester's toolchain, so it still seems like valuable information to
know when distributions can't build as well as when they don't pass
tests.  So we should report on this and not just skip reporting.  On
the other hand, most of the false positives that provoke complaint are
toolchain issues during PL or make/Build.

Right now there is no easy way to distinguish the phase of a FAIL
report from the subject of an email.  Removing PL and make/Build
failures from the FAIL category would immediately eliminate a major
source of false positives in the FAIL category and decrease the
aggregate false positive ratio in the FAIL category.  Though, as I've
shown, while this may decrease the incidence of false positives for
high quality authors, the false positive ratio is likely to remain
high.

It almost doesn't matter whether we reclassify these as UNKNOWN or
invent new grades.  Either way partitions the FAIL space in a way that
makes it easier for authors to focus on which ever part of the
PL/make/test cycle they care about.

__What we can fix now and what we can't__

Some of these issues can be addressed fairly quickly.

First, we can lower our collective tolerance of false positives -- for
example, stop telling authors to just ignore bogus reports if they
don't like it and find ways to filter them.  We have several places to
do this -- just in the last day we've confirmed that the latest
CPANPLUS dev version doesn't generate Makefile.PL's and some testers
have upgraded.  BinGOs has just put out CPANPLUS::YACSmoke 0.04 that
filters out these cases anyway if testers aren't on the bleeding edge
of CPANPLUS.  We now need to push testers to upgrade.  As we find new
false positives, we need to find new ways to detect and suppress them.

Second, we can reclassify PL/make/Build fails to UNKNOWN.  This won't
break any of the existing reporting infrastructure the way that adding
new grades would.  I can make this change in CPAN::Reporter in a
matter of minutes and it probably wouldn't be hard to do the same in
CPANPLUS.  Then we need another round of pushing testers to upgrade
their tools.  We could also take a decision as to whether UNKNOWN
reports should be copied to authors by default or just sent to the
mailing list.

However, as long as the CPAN Testers system has individual testers
emailing authors, there is little we can do to address the problem of
repetition.  One option is to remove that feature from Test::Reporter
and reports will only go to the central list.  With the introduction
of an RSS feed (even if not yet optimal), authors will have a way to
monitor reports.  And from that central source, work can be done to
identify duplicative reports and start screening them out of
notifications.

Once that is more or less reliable, we could restart email
notifications from that central source if people felt that nagging is
critical to improve quality.  Personally, I'm coming around to the
idea that it's not the right way to go culturally for the community.
We should encourage people to use these tools, sign up for RSS or
email alerts, whatever, because they think that quality is important.
If the current nagging approach is alienating significant numbers of
perl-qa members, how can we possibly expect that it's having a
positive influence on everyone else?

Some of these proposal would be easier in CPAN Testers 2.0, which will
provide reports as structured data instead of email text, but if "exit
0" is a straw that is breaking the Perl camel's back now, then we
can't ignore 1.0 to work on 2.0 as I'm not sure anyone will care
anymore by the time it's done.

What we can't do easily is get the testers community to upgrade to
newer versions of the tools.  That is still going to be a matter of
announcements and proselytizing and so on.  But I think we can make a
good case for it, and if we can get the top 10 or so testers to
upgrade across all their testing machines then I think we'll make a
huge dent in the false positives that are undermining support for CPAN
Testers as a tool for Perl software quality.

I'm interested in feedback on these ideas -- on list or off.  In
particular, I'm now convinced that the "success" of CPAN Testers now
prompts the need to move PL/make fails to UNKNOWN and to discontinue
copying authors by individual testers.  I'm open to counter-arguments,
but they'll need to convince me of a better long-run solution to the
problems I identified.

For those who read this to the end, thank you for your attention to
what is surely becoming a tedious subject.

-- David

The relation between CPAN Testers and quality (or why CPAN Testers sucks if you don't need it)

Reply via email to