Hello, I'm trying to use Jenkins to poll SCM and run a big series of unit and integration tests that were parallelized using a Perl program written using Log::Log4Perl, Parallel::ForkManager, and IPC::Run.
I have version 1.424.6 on Ubuntu Server 12.04. Strange lockups on writes to STDOUT and STDERR, and zombie processes happen under Jenkins, similar to the leaked process wiki page and some past threads discussing issues involving not draining stdout and stderr correctly, but somewhat different. The same commands work fine from the jenkins user over ssh, and on other users and other machines. STDOUT and STDERR have been set to flush the buffer after each line is written, in an attempt to fix whatever stdout / stderr drain issues may exist (see code example below), but it did not seem to affect the behavior. The workflow looks like: 1) Write some logs to STDOUT and to file. 2) Use ForkManager to loop over a series of unit test file paths. 3) Use IPC::Run to run each one on a 600 second timeout, check the exit status, and print out the interesting parts of the test STDOUT and STDERR onto the real STDOUT if something breaks. Issues which happen: 1) I'm getting zombie child processes which don't exit, even though ForkManager normally does this fine and hasn't failed before via other means: 2) For whatever reason, the child process writes to STDOUT and STDERR don't appear in Jenkins, but they do continue to appear in the Log4Perl disk for a while after the test begins. 3) Eventually, things begin to stall, the test runner stops forking children, and the build locks up and will never end, even after 17+ hours waiting overnight. Running things one at a time with ForkManager's debug mode seems to work but it results in a build process too slow to be useful for me. 4) When the stall happens, things look like this: * nothing in jenkins.log: $ ls -l /var/log/jenkins/jenkins.log -rw-r--r-- 1 jenkins jenkins 0 Sep 12 06:30 /var/log/jenkins/jenkins.log * processes in unusual states, zombies, etc: $ ps auxww | fgrep jenkins jenkins 498 0.0 0.0 0 0 ? Z 17:42 0:00 [unit_tests.pl] <defunct> jenkins 584 0.0 0.0 0 0 ? Z 17:42 0:00 [unit_tests.pl] <defunct> jenkins 648 0.0 0.0 0 0 ? Z 17:42 0:00 [unit_tests.pl] <defunct> jenkins 876 0.0 0.0 0 0 ? Z 17:43 0:00 [unit_tests.pl] <defunct> jenkins 1140 0.0 0.0 0 0 ? Z 17:43 0:00 [unit_tests.pl] <defunct> jenkins 1204 0.0 0.0 0 0 ? Z 17:43 0:00 [unit_tests.pl] <defunct> jenkins 1513 0.0 0.0 0 0 ? Z 17:43 0:00 [unit_tests.pl] <defunct> jenkins 22257 0.0 0.0 4392 608 ? S 17:41 0:00 /bin/sh -xe /tmp/hudson997856685355182685.sh jenkins 22262 0.0 0.0 58160 12192 ? S 17:41 0:00 /usr/bin/perl smoke_tests/unit_tests.pl jenkins 22263 0.1 0.0 58768 11820 ? S 17:41 0:00 /usr/bin/perl smoke_tests/unit_tests.pl jenkins 26713 0.0 9.5 12239304 2368828 ? Ssl Sep11 7:32 /usr/bin/java -jar /usr/share/jenkins/jenkins.war --webroot=/var/run/jenkins/war --httpPort=8080 --ajp13Port=-1 --preferredClassLoader=java.net.URLClassLoader --argumentsRealm.passwd.mhall=SecureJenkins2012 --argumentsRealm.roles.mhall=admin --logfile=/var/log/jenkins/jenkins.log jenkins 31174 0.0 0.0 58780 11428 ? S 17:42 0:00 /usr/bin/perl smoke_tests/unit_tests.pl jenkins 31175 0.7 0.1 146320 48380 ? Sl 17:42 0:01 /usr/local/ruby192/bin/ruby -r ./environment.rb /var/lib/jenkins/workspace/studio_security/root2/test/mu/scenario/field/tc_wget.rb jenkins 31220 0.0 0.0 58768 11588 ? S 17:42 0:00 /usr/bin/perl smoke_tests/unit_tests.pl jenkins 31229 0.0 0.0 0 0 ? Z 17:42 0:00 [unit_tests.pl] <defunct> jenkins 31988 0.0 0.0 58768 11592 ? S 17:42 0:00 /usr/bin/perl smoke_tests/unit_tests.pl jenkins 32367 0.0 0.0 58768 11588 ? S 17:42 0:00 /usr/bin/perl smoke_tests/unit_tests.pl * problematic blocking on writes to STDOUT and STDERR: $ sudo strace -s 4096 -p 22262 Process 22262 attached - interrupt to quit read(6, ^C <unfinished ...> $ sudo strace -s 4096 -p 22263 Process 22263 attached - interrupt to quit write(1, "[2012-09-19 17:43:14,691] [unit_tests.pl:40] [INFO] running tests from [test/mu/scenario/musl]...\n", 98^C <unfinished ...> $ sudo strace -s 4096 -p 31220 Process 31220 attached - interrupt to quit write(1, "[2012-09-19 17:42:37,484] [Common.pm:290] [INFO] stderr [02]: - RELOAD_Frame = udp(src: &host_2, dst: &host_3, dst_port: 16428)\n", 135^C <unfinished ...> $ sudo strace -s 4096 -p 31988 Process 31988 attached - interrupt to quit write(1, "[2012-09-19 17:42:46,049] [Common.pm:272] [WARN] command [ruby -r ./environment.rb /var/lib/jenkins/workspace/studio_security/root2/test/mu/scenario/from_pcap/tc_icmp.rb] failed\n", 178^C <unfinished ...> * the only message present in the Jenkins log, happens when I cancel the frozen build: Sep 19, 2012 6:00:07 PM hudson.model.Run run INFO: studio_security #21 aborted java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at java.lang.UNIXProcess.waitFor(UNIXProcess.java:210) at hudson.Proc$LocalProc.join(Proc.java:319) at hudson.Launcher$ProcStarter.join(Launcher.java:336) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:82) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19) at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:692) at hudson.model.Build$RunnerImpl.build(Build.java:177) at hudson.model.Build$RunnerImpl.doRun(Build.java:139) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:458) at hudson.model.Run.run(Run.java:1376) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:175) * this one randomly appeared when doing a restart: java.lang.NullPointerException at hudson.security.ContainerAuthentication.<init>(ContainerAuthentication.java:64) at hudson.security.BasicAuthenticationFilter.doFilter(BasicAuthenticationFilter.java:108) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:76) at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:164) at winstone.FilterConfiguration.execute(FilterConfiguration.java:195) at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368) at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:81) at winstone.FilterConfiguration.execute(FilterConfiguration.java:195) at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:368) at winstone.RequestDispatcher.forward(RequestDispatcher.java:333) at winstone.RequestHandlerThread.processRequest(RequestHandlerThread.java:244) at winstone.RequestHandlerThread.run(RequestHandlerThread.java:150) at java.lang.Thread.run(Thread.java:722) * I tried enabling FINEST level on hudson and jenkins loggers, but didn't see anything helpful in the output (probably wasn't looking in the right spot). I tried using Jenkins to run a skeleton example application like this one below and it runs fine. Also very important: the problem disappears completely when I disable the Screen appender writing to STDOUT in my Log4Perl configuration file. I could really use a helping hand from some pros to debug this some more or get more verbosity from Jenkins, etc. I spent a long time looking at it and I'm really confused about what could be going on here! :) Thanks, Matthew. Example: #!/usr/bin/perl use strict; use warnings; use Parallel::ForkManager 0.7.6; STDERR->autoflush(1); STDOUT->autoflush(1); my $successes = 0; sub run_on_finish { my ($pid, $exit, $id, $signal, $core, $d) = @_; ++$successes unless $exit; } my $fm = Parallel::ForkManager->new(12); $fm->run_on_finish(\&run_on_finish); for (my $i = 1; $i <= 1000; ++$i) { print STDOUT sprintf("Hello, STDOUT! [P] [%04d]\n", $i); print STDERR sprintf("Hello, STDERR! [P] [%04d]\n", $i); $fm->start($i) and next; my $d = {}; my $status = 0; print STDOUT sprintf("Hello, STDOUT! [C] [%04d]\n", $i); print STDERR sprintf("Hello, STDERR! [C] [%04d]\n", $i); $fm->finish($status, $d); } $fm->wait_all_children(); print "FM Successes [$successes]\n"; exit(0);