Bug#667478: condor: RSS memory usage grows continuously for Condor jobs
Package: condor Version: 7.7.5~dfsg.1-2 Severity: normal Hi, We are running a backport of the Debian package of Condor 7.7.5 from experimental on a cluster of Debian stable machines. Since upgrading from 7.7.4 we noticed an increased memory demand for pretty much all jobs. I recently ran a week-long job that starts off at 10GB size and should not gain significant memory size throughout the process (as confirmed with Condor 7.7.4). After the upgrade to 7.7.5 the job continuously increases it memory demands and I have to kill it after two days when it exceeds 150GB consumption. However, the continuous growth is not limited to this particular job -- most type of long-running jobs on this machine are Python-based, though. Looking into the 7.7.5 changelog I see a number of memory-related aspects, but nothing that is a perfect match. I checked that this is not just about Condor reporting increasing memory consumption, but the respective cluster nodes actually run out of memory, because the job grows and grows. I'd be glad to get some feedback on what the problem could be and if there is a workaround. Thanks. -- System Information: Debian Release: 6.0.4 APT prefers stable-updates APT policy: (500, 'stable-updates'), (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 2.6.32-5-amd64 (SMP w/24 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages condor depends on: ii adduser3.112+nmu2add and remove users and groups ii debconf [debconf-2 1.5.36.1 Debian configuration management sy ii libc6 2.11.3-3 Embedded GNU C Library: Shared lib ii libcgroup1 0.37.1-1~nd60+1 Library to control and monitor con ii libclassad37.7.5~dfsg.1-2~nd60+1 library for Condor's classads expr ii libcomerr2 1.41.12-4stable1 common error description library ii libcurl3 7.21.0-2.1+squeeze1 Multi-protocol file transfer libra ii libdate-manip-perl 6.11-1module for manipulating dates ii libexpat1 2.0.1-7 XML parsing C library - runtime li ii libgcc11:4.4.5-8 GCC support library ii libglobus-callout0 0.7-6 Globus Toolkit - Globus Callout Li ii libglobus-common0 11.5-2Globus Toolkit - Common Library ii libglobus-ftp-cont 2.11-2Globus Toolkit - GridFTP Control L ii libglobus-gass-tra 4.3-2 Globus Toolkit - Globus Gass Trans ii libglobus-gram-cli 10.4-1Globus Toolkit - GRAM Client Libra ii libglobus-gram-pro 9.7-2 Globus Toolkit - GRAM Protocol Lib ii libglobus-gsi-call 2.7-1 Globus Toolkit - Globus GSI Callba ii libglobus-gsi-cert 6.6-1 Globus Toolkit - Globus GSI Cert U ii libglobus-gsi-cred 3.5-1 Globus Toolkit - Globus GSI Creden ii libglobus-gsi-open 0.14-6Globus Toolkit - Globus OpenSSL Er ii libglobus-gsi-prox 4.5-1 Globus Toolkit - Globus GSI Proxy ii libglobus-gsi-prox 2.3-1 Globus Toolkit - Globus GSI Proxy ii libglobus-gsi-sysc 3.1-2 Globus Toolkit - Globus GSI System ii libglobus-gss-assi 5.9-1 Globus Toolkit - GSSAPI Assist lib ii libglobus-gssapi-e 2.5-7 Globus Toolkit - GSSAPI Error Libr ii libglobus-gssapi-g 7.5-2 Globus Toolkit - GSSAPI library ii libglobus-io3 6.3-8 Globus Toolkit - uniform I/O inter ii libglobus-openssl- 1.3-1 Globus Toolkit - Globus OpenSSL Mo ii libglobus-rsl2 7.2-2 Globus Toolkit - Resource Specific ii libglobus-xio0 2.8-3 Globus Toolkit - Globus XIO Framew ii libgssapi-krb5-2 1.8.3+dfsg-4squeeze5 MIT Kerberos runtime libraries - k ii libk5crypto3 1.8.3+dfsg-4squeeze5 MIT Kerberos runtime libraries - C ii libkrb5-3 1.8.3+dfsg-4squeeze5 MIT Kerberos runtime libraries ii libkrb5support01.8.3+dfsg-4squeeze5 MIT Kerberos runtime libraries - S ii libldap-2.4-2 2.4.23-7.2OpenLDAP libraries ii libltdl7 2.2.6b-2 A system independent dlopen wrappe ii libpcre3 8.02-1.1 Perl 5 Compatible Regular Expressi ii libssl0.9.80.9.8o-4squeeze7 SSL shared libraries ii libstdc++6 4.4.5-8 The GNU Standard C++ Library v3 ii libuuid1 2.17.2-9 Universally Unique ID library ii libvirt0 0.8.3-5+squeeze2 library for interfacing with diffe ii libxml22.7.8.dfsg-2+squeeze3 GNOME XML library ii perl 5.10.1-17squeeze3 Larry Wall's Practical Extraction ii zlib1g 1:1.2.3.4.dfsg-3 compression library - runtime Versions of packages condor recommends: ii dmtcp 1.2.4-1
Bug#667478: [condor-debian] Bug#667478: condor: RSS memory usage grows continuously for Condor jobs
On Apr 4, 2012, at 6:01 AM, Michael Hanke wrote: Package: condor Version: 7.7.5~dfsg.1-2 Severity: normal Hi, We are running a backport of the Debian package of Condor 7.7.5 from experimental on a cluster of Debian stable machines. Since upgrading from 7.7.4 we noticed an increased memory demand for pretty much all jobs. I recently ran a week-long job that starts off at 10GB size and should not gain significant memory size throughout the process (as confirmed with Condor 7.7.4). After the upgrade to 7.7.5 the job continuously increases it memory demands and I have to kill it after two days when it exceeds 150GB consumption. However, the continuous growth is not limited to this particular job -- most type of long-running jobs on this machine are Python-based, though. Looking into the 7.7.5 changelog I see a number of memory-related aspects, but nothing that is a perfect match. I checked that this is not just about Condor reporting increasing memory consumption, but the respective cluster nodes actually run out of memory, because the job grows and grows. I'd be glad to get some feedback on what the problem could be and if there is a workaround. I've scanned the changes in Condor 7.7.5 and I also don't see anything that would explain a change in the memory behavior of jobs. I assume you're submitting your jobs under the vanilla universe. Have you tried logging into the execute nodes and running the programs interactively? That's a good way to test if something other than Condor changed and is responsible for the difference. You can also try running condor_ssh_to_job while a job is running to get an interactive session with the same environment as your job. You can examine the environment variables, etc. for any odd settings. You even submit a sleep job, then use condor_ssh_to_job to start your program interactively in the environment Condor sets up, possibly tweaking environment variables first. Thanks and regards, Jaime Frey UW-Madison Condor Team -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#667478: [condor-debian] Bug#667478: condor: RSS memory usage grows continuously for Condor jobs
On Apr 4, 2012 5:11 PM, Jaime Frey jf...@cs.wisc.edu wrote: I've scanned the changes in Condor 7.7.5 and I also don't see anything that would explain a change in the memory behavior of jobs. I assume you're submitting your jobs under the vanilla universe. Yes. Have you tried logging into the execute nodes and running the programs interactively? That's a good way to test if something other than Condor changed and is responsible for the difference. Yes, I did that and the problem is not present. This is what made me blame Condor and file this bug. You can also try running condor_ssh_to_job while a job is running to get an interactive session with the same environment as your job. You can examine the environment variables, etc. for any odd settings. You even submit a sleep job, then use condor_ssh_to_job to start your program interactively in the environment Condor sets up, possibly tweaking environment variables first. I haven't done that yet, and will test this next -- thanks for this suggestion! I will report back if I can replicate the behavior. Thanks, Michael
Bug#667478: [condor-debian] Bug#667478: condor: RSS memory usage grows continuously for Condor jobs
On Apr 4, 2012, at 10:34 AM, Michael Hanke wrote: On Apr 4, 2012 5:11 PM, Jaime Frey jf...@cs.wisc.edu wrote: You can also try running condor_ssh_to_job while a job is running to get an interactive session with the same environment as your job. You can examine the environment variables, etc. for any odd settings. You even submit a sleep job, then use condor_ssh_to_job to start your program interactively in the environment Condor sets up, possibly tweaking environment variables first. I haven't done that yet, and will test this next -- thanks for this suggestion! I will report back if I can replicate the behavior. Also, can you confirm that it's the job itself that's bloating in size, and not say the condor_starter? Thanks and regards, Jaime Frey UW-Madison Condor Team
Bug#667478: [condor-debian] Bug#667478: condor: RSS memory usage grows continuously for Condor jobs
On Wed, Apr 04, 2012 at 11:03:02AM -0500, Jaime Frey wrote: Also, can you confirm that it's the job itself that's bloating in size, and not say the condor_starter? Yes, it is the job, not the starter. Michael -- Michael Hanke http://mih.voxindeserto.de -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org