Hello,

I'm trying to use Jenkins to poll SCM and run a big series of unit and 
integration tests that were parallelized using a Perl program written using 
Log::Log4Perl, Parallel::ForkManager, and IPC::Run.

I have version 1.424.6 on Ubuntu Server 12.04.

Strange lockups on writes to STDOUT and STDERR, and zombie processes happen 
under Jenkins, similar to the leaked process wiki page and some past threads 
discussing issues involving not draining stdout and stderr correctly, but 
somewhat different. The same commands work fine from the jenkins user over 
ssh, and on other users and other machines.

STDOUT and STDERR have been set to flush the buffer after each line is 
written, in an attempt to fix whatever stdout / stderr drain issues may exist 
(see code example below), but it did not seem to affect the behavior.

The workflow looks like:

1) Write some logs to STDOUT and to file.

2) Use ForkManager to loop over a series of unit test file paths.

3) Use IPC::Run to run each one on a 600 second timeout, check the exit 
status, and print out the interesting parts of the test STDOUT and STDERR onto 
the real STDOUT if something breaks.

Issues which happen:

1) I'm getting zombie child processes which don't exit, even though ForkManager 
normally does this fine and hasn't failed before via other means:

2) For whatever reason, the child process writes to STDOUT and STDERR don't 
appear in Jenkins, but they do continue to appear in the Log4Perl disk for a 
while after the test begins.

3) Eventually, things begin to stall, the test runner stops forking children, 
and the build locks up and will never end, even after 17+ hours waiting 
overnight. Running things one at a time with ForkManager's debug mode seems to 
work but it results in a build process too slow to be useful for me.

4) When the stall happens, things look like this:

* nothing in jenkins.log:
$ ls -l /var/log/jenkins/jenkins.log
-rw-r--r-- 1 jenkins jenkins 0 Sep 12 06:30 /var/log/jenkins/jenkins.log

* processes in unusual states, zombies, etc:

$ ps auxww | fgrep jenkins
jenkins    498  0.0  0.0      0     0 ?        Z    17:42   0:00 
[unit_tests.pl] <defunct>
jenkins    584  0.0  0.0      0     0 ?        Z    17:42   0:00 
[unit_tests.pl] <defunct>
jenkins    648  0.0  0.0      0     0 ?        Z    17:42   0:00 
[unit_tests.pl] <defunct>
jenkins    876  0.0  0.0      0     0 ?        Z    17:43   0:00 
[unit_tests.pl] <defunct>
jenkins   1140  0.0  0.0      0     0 ?        Z    17:43   0:00 
[unit_tests.pl] <defunct>
jenkins   1204  0.0  0.0      0     0 ?        Z    17:43   0:00 
[unit_tests.pl] <defunct>
jenkins   1513  0.0  0.0      0     0 ?        Z    17:43   0:00 
[unit_tests.pl] <defunct>
jenkins  22257  0.0  0.0   4392   608 ?        S    17:41   0:00 /bin/sh -xe 
/tmp/hudson997856685355182685.sh
jenkins  22262  0.0  0.0  58160 12192 ?        S    17:41   0:00 /usr/bin/perl 
smoke_tests/unit_tests.pl
jenkins  22263  0.1  0.0  58768 11820 ?        S    17:41   0:00 /usr/bin/perl 
smoke_tests/unit_tests.pl
jenkins  26713  0.0  9.5 12239304 2368828 ?    Ssl  Sep11   7:32 /usr/bin/java 
-jar /usr/share/jenkins/jenkins.war --webroot=/var/run/jenkins/war 
--httpPort=8080 --ajp13Port=-1 --preferredClassLoader=java.net.URLClassLoader 
--argumentsRealm.passwd.mhall=SecureJenkins2012 
--argumentsRealm.roles.mhall=admin --logfile=/var/log/jenkins/jenkins.log
jenkins  31174  0.0  0.0  58780 11428 ?        S    17:42   0:00 /usr/bin/perl 
smoke_tests/unit_tests.pl
jenkins  31175  0.7  0.1 146320 48380 ?        Sl   17:42   0:01 
/usr/local/ruby192/bin/ruby -r ./environment.rb 
/var/lib/jenkins/workspace/studio_security/root2/test/mu/scenario/field/tc_wget.rb
jenkins  31220  0.0  0.0  58768 11588 ?        S    17:42   0:00 /usr/bin/perl 
smoke_tests/unit_tests.pl
jenkins  31229  0.0  0.0      0     0 ?        Z    17:42   0:00 
[unit_tests.pl] <defunct>
jenkins  31988  0.0  0.0  58768 11592 ?        S    17:42   0:00 /usr/bin/perl 
smoke_tests/unit_tests.pl
jenkins  32367  0.0  0.0  58768 11588 ?        S    17:42   0:00 /usr/bin/perl 
smoke_tests/unit_tests.pl

* problematic blocking on writes to STDOUT and STDERR:

$ sudo strace -s 4096 -p 22262
Process 22262 attached - interrupt to quit
read(6, ^C <unfinished ...>

$ sudo strace -s 4096 -p 22263
Process 22263 attached - interrupt to quit
write(1, "[2012-09-19 17:43:14,691] [unit_tests.pl:40] [INFO] running tests 
from [test/mu/scenario/musl]...\n", 98^C <unfinished ...>

$ sudo strace -s 4096 -p 31220
Process 31220 attached - interrupt to quit
write(1, "[2012-09-19 17:42:37,484] [Common.pm:290] [INFO] stderr [02]: -       
 RELOAD_Frame = udp(src: &host_2, dst: &host_3, dst_port: 16428)\n", 135^C 
<unfinished ...>

$ sudo strace -s 4096 -p 31988
Process 31988 attached - interrupt to quit
write(1, "[2012-09-19 17:42:46,049] [Common.pm:272] [WARN] command [ruby -r 
./environment.rb 
/var/lib/jenkins/workspace/studio_security/root2/test/mu/scenario/from_pcap/tc_icmp.rb]
 failed\n", 178^C <unfinished ...>

* the only message present in the Jenkins log, happens when I cancel the 
frozen build:

Sep 19, 2012 6:00:07 PM hudson.model.Run run
INFO: studio_security #21 aborted
java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:503)
    at java.lang.UNIXProcess.waitFor(UNIXProcess.java:210)
    at hudson.Proc$LocalProc.join(Proc.java:319)
    at hudson.Launcher$ProcStarter.join(Launcher.java:336)
    at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:82)
    at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
    at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
    at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:692)
    at hudson.model.Build$RunnerImpl.build(Build.java:177)
    at hudson.model.Build$RunnerImpl.doRun(Build.java:139)
    at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:458)
    at hudson.model.Run.run(Run.java:1376)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
    at hudson.model.ResourceController.execute(ResourceController.java:88)
    at hudson.model.Executor.run(Executor.java:175)

* this one randomly appeared when doing a restart:

java.lang.NullPointerException
    at 
hudson.security.ContainerAuthentication.<init>(ContainerAuthentication.java:64)
    at 
hudson.security.BasicAuthenticationFilter.doFilter(BasicAuthenticationFilter.java:108)
    at 
hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
    at 
hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:76)
    at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:164)
    at winstone.FilterConfiguration.execute(FilterConfiguration.java:195)
    at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368)
    at 
hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:81)
    at winstone.FilterConfiguration.execute(FilterConfiguration.java:195)
    at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368)
    at winstone.RequestDispatcher.forward(RequestDispatcher.java:333)
    at 
winstone.RequestHandlerThread.processRequest(RequestHandlerThread.java:244)
    at winstone.RequestHandlerThread.run(RequestHandlerThread.java:150)
    at java.lang.Thread.run(Thread.java:722)

* I tried enabling FINEST level on hudson and jenkins loggers, but didn't see 
anything helpful in the output (probably wasn't looking in the right spot).

I tried using Jenkins to run a skeleton example application like this one 
below and it runs fine.

Also very important: the problem disappears completely when I disable the 
Screen appender writing to STDOUT in my Log4Perl configuration file.

I could really use a helping hand from some pros to debug this some more or 
get more verbosity from Jenkins, etc. I spent a long time looking at it and 
I'm really confused about what could be going on here! :)

Thanks,
Matthew.

Example:

#!/usr/bin/perl

use strict;
use warnings;

use Parallel::ForkManager 0.7.6;

STDERR->autoflush(1);
STDOUT->autoflush(1);

my $successes = 0;

sub run_on_finish {
    my ($pid, $exit, $id, $signal, $core, $d) = @_;
    
    ++$successes unless $exit;
}

my $fm = Parallel::ForkManager->new(12);
$fm->run_on_finish(\&run_on_finish);

for (my $i = 1; $i <= 1000; ++$i) {
    print STDOUT sprintf("Hello, STDOUT! [P] [%04d]\n", $i);
    print STDERR sprintf("Hello, STDERR! [P] [%04d]\n", $i);
    
    $fm->start($i) and next;
    
    my $d = {};
    my $status = 0;
    
    print STDOUT sprintf("Hello, STDOUT! [C] [%04d]\n", $i);
    print STDERR sprintf("Hello, STDERR! [C] [%04d]\n", $i);
    
    $fm->finish($status, $d);
}

$fm->wait_all_children();

print "FM Successes [$successes]\n";

exit(0);

Reply via email to