Hi,

i have a little problem with the slave JVM randomly dying on our larger
build machines. When these machines are under heavy load the following
error tends to pop up somewhat randomly on the slave connection:

https://pastebin.com/raw/BaY2rJ7G

The likeliness of these errors increases with the amount of builds running
on the slave. If i allow only 20 executors it occurs very rarely or never
while with 100 executors its quite likely for the slave to disconnect from
this once all executors have a job running on them.

The affected machines are dual socket systems with two 18 Core Xeon CPUs
making for 36 cores and 72 threads (HT). 384GB RAM (or more) are installed
of which 200GB are assigned to a ramdisk (tmpfs). This ramdisk is used for
the jenkins workspace.

As OS we use Debian 8 (Jessie) with the 4.9 kernel from backports. The
Jenkins version is 2.55 and the installed Java version is OpenJDK
1.8.0_121.

The running builds are mostly larger C projects being compiled with gcc
and some latex documentation.

Since it occurs only with many parallel builds running this somewhat
suggests that we might be hitting some kind of limit that causes the slave
process to be terminated. However there is nothing in the logs
(journalctl, dmesg) hinting at that and as far i know neither the oom
killer nor ulimit use SIGTERM for that purpose.

The following limits are reported by `ulimit -a`:
https://pastebin.com/raw/RXXWnc49

Anyone happen to have an idea what might be the cause or what else i could
look at?

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-users/236d182d6e7e9aaff19b2fb0e642adc5.squirrel%40user.vexar.de.
For more options, visit https://groups.google.com/d/optout.

Reply via email to