The price of synching STDOUT and STDERR

Michael G Schwern Tue, 13 Mar 2007 22:15:41 -0800

Those who don't follow TAP::Parser closely might now know about the issue
that's currently being struggled with.  It has to do with how TAP::Parser
guarantees that STDOUT and STDERR will be in sync, something Test::Harness
does not guarantee.


Test::Harness only captures STDOUT from a test script.  It leaves STDERR alone
and thus it normally goes to the screen.  This can cause subtle problems where
STDOUT and STDERR get out of sync.  STDOUT can be slightly delayed by TH's
having to read it, process it and then spit it back out while STDERR is
displaying to the screen as fast as the test script can print it.  So
sometimes when you use verbose test output warnings can appear to come before
the test which caused it.  This is a pretty common problem with anything that
tries to capture and then stitch back together STDOUT and STDERR.

Ovid said he had many requests and people begging for him to fix this in
TAP::Parser.  So he did.  TAP::Parser solves this by piping STDOUT and STDERR
together into one stream.  Since its reading from one stream everything comes
in order and thus can be spit out in order.  Seems simple and everyone is
happy, right?

There's a price for this.  And I want to make folks aware of this price.  Ovid
is insistent that syncing the streams is very important to TAP::Parser users,
but maybe once they know what they're paying for it folks won't be quite so
insistent and we can remove it.


First thing is breaks, and probably most important:  No warnings.  Everything
going to STDERR is eaten by TAP::Parser.  It cannot know the difference
between lines going to STDOUT and STDERR so it cannot tell a warning from some
junk printed to STDOUT.

$ cat ~/tmp/foo.t
#!/usr/bin/perl -w

print "1..1\n";
$foo = 1 + undef;
print "ok 1\n";
print "This is some diagnostic junk that shouldn't be shown.\n";


$ prove ~/tmp/foo.t
/Users/schwern/tmp/foo....Name "main::foo" used only once: possible typo at
/Users/schwern/tmp/foo.t line 4.
Use of uninitialized value in addition (+) at /Users/schwern/tmp/foo.t line 4.
/Users/schwern/tmp/foo....ok
All tests successful.
Files=1, Tests=1,  0 wallclock secs ( 0.00 cusr +  0.00 csys =  0.00 CPU)


$ runtests ~/tmp/foo.t
/Users/schwern/tmp/foo......ok
All tests successful.
Files=1, Tests=1,  0 wallclock secs ( 0.00 cusr +  0.00 csys =  0.00 CPU)


This is really bad, its a severe loss of test information and cripples the
ability to debug tests.  One alternative is to display all junk lines.  This
will make passing tests noisy.  Noise when there is no problem is bad, for the
same reason that "expected" warnings are bad.  It desensitizes the user and
they won't pay attention to them when there's a real failure.  Its an
important principle of a good test suite.


Number two, it displays either *all* diagnostics, meaning diagnostics printed
to STDOUT and failure diagnostics printed to STDERR or it displays *no*
diagnostics

$ cat ~/tmp/foo.t
#!/usr/bin/perl -w

$| = 1;

print "1..1\n";
print "# This is some info about the next test which should not be shown\n";
print "not ok 1\n";
print STDERR "# This is a failure diagnostic about the test which failed.\n";


$ prove ~/tmp/foo.t
/Users/schwern/tmp/foo....# This is a failure diagnostic about the test which
failed.
/Users/schwern/tmp/foo....FAILED test 1
        Failed 1/1 tests, 0.00% okay
Failed Test              Stat Wstat Total Fail  List of Failed
-------------------------------------------------------------------------------
/Users/schwern/tmp/foo.t                1    1  1
Failed 1/1 test scripts. 1/1 subtests failed.
Files=1, Tests=1,  0 wallclock secs ( 0.00 cusr +  0.00 csys =  0.00 CPU)
Failed 1/1 test programs. 1/1 subtests failed.


$ runtests ~/tmp/foo.t
/Users/schwern/tmp/foo......# This is some info about the next test which
should not be shown
/Users/schwern/tmp/foo......1/1 # This is a failure diagnostic about the test
which failed.
/Users/schwern/tmp/foo...... Failed 1/1 subtests

Test Summary Report
-------------------
/Users/schwern/tmp/foo.t (Wstat: 0 Tests: 1 Failed: 1)
  Failed tests:  1
Files=1, Tests=1,  0 wallclock secs ( 0.00 cusr +  0.00 csys =  0.00 CPU)


$ runtests -q ~/tmp/foo.t
/Users/schwern/tmp/foo...... Failed 1/1 subtests

Test Summary Report
-------------------
/Users/schwern/tmp/foo.t (Wstat: 0 Tests: 1 Failed: 1)
  Failed tests:  1
Files=1, Tests=1,  0 wallclock secs ( 0.00 cusr +  0.00 csys =  0.00 CPU)


Again, your choice is either noisy passing tests or no failure information.
There is no option to only display the failure diagnostic like prove does.
Losing failure diagnostics is worse, you can't debug the test output.


The third problem is how TODO tests are handled.  TAP::Parser has a heuristic
to guess when a set of failure diagnostics are associated with a TODO test so
it doesn't display them.  The heuristic is simply "don't display anything
between the TODO failure and the next test".  This mean you can lose important
diagnostic information.

$ cat ~/tmp/foo.t
#!/usr/bin/perl -w

$| = 1;

print "1..1\n";
print "not ok 1 # TODO\n";
print "# This is a failure diagnostic about the test which failed.\n";
print "# They should not be shown because its TODO.\n";
print STDERR "# This is about the next failure.  It should be shown.\n";
print "not ok 2\n";


$ prove ~/tmp/foo.t
/Users/schwern/tmp/foo....# This is about the next failure.  It should be shown.
/Users/schwern/tmp/foo....FAILED test 2
        Failed 1/1 tests, 0.00% okay
Failed Test              Stat Wstat Total Fail  List of Failed
-------------------------------------------------------------------------------
/Users/schwern/tmp/foo.t                1    2  2
Failed 1/1 test scripts. 0/1 subtests failed.
Files=1, Tests=1,  0 wallclock secs ( 0.00 cusr +  0.00 csys =  0.00 CPU)
Failed 1/1 test programs. 0/1 subtests failed.


$ runtests ~/tmp/foo.t
/Users/schwern/tmp/foo...... All 1 subtests passed

Test Summary Report
-------------------
/Users/schwern/tmp/foo.t (Wstat: 0 Tests: 2 Failed: 1)
  Failed tests:  2
  Parse errors: Bad plan.  You planned 1 tests but ran 2.
Files=1, Tests=2,  0 wallclock secs ( 0.00 cusr +  0.00 csys =  0.00 CPU)


There are various additional heuristics proposed to patch up these things, but
they'll always be heuristcs and thus will always have uncertainty.
Uncertainty and heuristics should be ruthlessly removed from a testing system.

Another suggestion is that the user should run with -q normally and rerun a
failing test with -v to get all their information.  This is a poor option for
several reasons.  First, a passing test might have a boatload of warnings
which you'll never see and thus won't know to rerun the test.  Second,
heisenbugs can cause a test to fail one moment and pass another.  Thus the
second run will not provide the same information.  Third, rerunning a test is
often slow either because the test itself is slow or because there's a long
delay in having it rerun.  Consider a user who emails you a bug report.  You
have to mail them back and ask them to rerun the test and then wait for their
replay, an awfully long lag time.  It might be impossible to rerun the test as
the results were archived somewhere, for example CPAN Testers, and the same
environment is no longer available.  Finally, a good parser should shield the
user from having to manually eyeball raw test output.

What it boils down to is this:  Displaying STDOUT and STDERR a little out of
sync isn't good and should be fixed, but it doesn't lose us important
information.  Everything that should be displayed is displayed, perhaps
slightly out of order.  The consequences of fixing it *does* lose important
information.  It introduces heuristics and uncertainty to decide what should
and should not be displayed.  You can no longer be sure that your harness is
telling you everything, maybe it swallowed something.  Uncertainty and
heuristics are bad in a testing system.  So to sync up STDOUT and STDERR you
pay a price.  You fix one thing and break three others.  This is not a way
forward.

The simplest solution, and one which unblocks Test::Harness 3.0 because
TAP::Parser breaks compatibility in its current state, is to remove the
syncing feature.

The price of synching STDOUT and STDERR

Reply via email to