Bug#667478: condor: RSS memory usage grows continuously for Condor jobs

2012-04-04 Thread Michael Hanke
Package: condor
Version: 7.7.5~dfsg.1-2
Severity: normal

Hi,

We are running a backport of the Debian package of Condor 7.7.5 from
experimental on a cluster of Debian stable machines. Since upgrading
from 7.7.4 we noticed an increased memory demand for pretty much all
jobs.

I recently ran a week-long job that starts off at 10GB size and should
not gain significant memory size throughout the process (as confirmed
with Condor 7.7.4). After the upgrade to 7.7.5 the job continuously
increases it memory demands and I have to kill it after two days when it
exceeds 150GB consumption. However, the continuous growth is not limited
to this particular job -- most type of long-running jobs on this machine
are Python-based, though.

Looking into the 7.7.5 changelog I see a number of memory-related
aspects, but nothing that is a perfect match. I checked that this is not
just about Condor reporting increasing memory consumption, but the
respective cluster nodes actually run out of memory, because the job
grows and grows.

I'd be glad to get some feedback on what the problem could be and if
there is a workaround.


Thanks.


-- System Information:
Debian Release: 6.0.4
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.32-5-amd64 (SMP w/24 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages condor depends on:
ii  adduser3.112+nmu2add and remove users and groups
ii  debconf [debconf-2 1.5.36.1  Debian configuration management sy
ii  libc6  2.11.3-3  Embedded GNU C Library: Shared lib
ii  libcgroup1 0.37.1-1~nd60+1   Library to control and monitor con
ii  libclassad37.7.5~dfsg.1-2~nd60+1 library for Condor's classads expr
ii  libcomerr2 1.41.12-4stable1  common error description library
ii  libcurl3   7.21.0-2.1+squeeze1   Multi-protocol file transfer libra
ii  libdate-manip-perl 6.11-1module for manipulating dates
ii  libexpat1  2.0.1-7   XML parsing C library - runtime li
ii  libgcc11:4.4.5-8 GCC support library
ii  libglobus-callout0 0.7-6 Globus Toolkit - Globus Callout Li
ii  libglobus-common0  11.5-2Globus Toolkit - Common Library
ii  libglobus-ftp-cont 2.11-2Globus Toolkit - GridFTP Control L
ii  libglobus-gass-tra 4.3-2 Globus Toolkit - Globus Gass Trans
ii  libglobus-gram-cli 10.4-1Globus Toolkit - GRAM Client Libra
ii  libglobus-gram-pro 9.7-2 Globus Toolkit - GRAM Protocol Lib
ii  libglobus-gsi-call 2.7-1 Globus Toolkit - Globus GSI Callba
ii  libglobus-gsi-cert 6.6-1 Globus Toolkit - Globus GSI Cert U
ii  libglobus-gsi-cred 3.5-1 Globus Toolkit - Globus GSI Creden
ii  libglobus-gsi-open 0.14-6Globus Toolkit - Globus OpenSSL Er
ii  libglobus-gsi-prox 4.5-1 Globus Toolkit - Globus GSI Proxy 
ii  libglobus-gsi-prox 2.3-1 Globus Toolkit - Globus GSI Proxy 
ii  libglobus-gsi-sysc 3.1-2 Globus Toolkit - Globus GSI System
ii  libglobus-gss-assi 5.9-1 Globus Toolkit - GSSAPI Assist lib
ii  libglobus-gssapi-e 2.5-7 Globus Toolkit - GSSAPI Error Libr
ii  libglobus-gssapi-g 7.5-2 Globus Toolkit - GSSAPI library
ii  libglobus-io3  6.3-8 Globus Toolkit - uniform I/O inter
ii  libglobus-openssl- 1.3-1 Globus Toolkit - Globus OpenSSL Mo
ii  libglobus-rsl2 7.2-2 Globus Toolkit - Resource Specific
ii  libglobus-xio0 2.8-3 Globus Toolkit - Globus XIO Framew
ii  libgssapi-krb5-2   1.8.3+dfsg-4squeeze5  MIT Kerberos runtime libraries - k
ii  libk5crypto3   1.8.3+dfsg-4squeeze5  MIT Kerberos runtime libraries - C
ii  libkrb5-3  1.8.3+dfsg-4squeeze5  MIT Kerberos runtime libraries
ii  libkrb5support01.8.3+dfsg-4squeeze5  MIT Kerberos runtime libraries - S
ii  libldap-2.4-2  2.4.23-7.2OpenLDAP libraries
ii  libltdl7   2.2.6b-2  A system independent dlopen wrappe
ii  libpcre3   8.02-1.1  Perl 5 Compatible Regular Expressi
ii  libssl0.9.80.9.8o-4squeeze7  SSL shared libraries
ii  libstdc++6 4.4.5-8   The GNU Standard C++ Library v3
ii  libuuid1   2.17.2-9  Universally Unique ID library
ii  libvirt0   0.8.3-5+squeeze2  library for interfacing with diffe
ii  libxml22.7.8.dfsg-2+squeeze3 GNOME XML library
ii  perl   5.10.1-17squeeze3 Larry Wall's Practical Extraction 
ii  zlib1g 1:1.2.3.4.dfsg-3  compression library - runtime

Versions of packages condor recommends:
ii  dmtcp 1.2.4-1

Bug#667478: [condor-debian] Bug#667478: condor: RSS memory usage grows continuously for Condor jobs

2012-04-04 Thread Jaime Frey
On Apr 4, 2012, at 6:01 AM, Michael Hanke wrote:

 Package: condor
 Version: 7.7.5~dfsg.1-2
 Severity: normal
 
 Hi,
 
 We are running a backport of the Debian package of Condor 7.7.5 from
 experimental on a cluster of Debian stable machines. Since upgrading
 from 7.7.4 we noticed an increased memory demand for pretty much all
 jobs.
 
 I recently ran a week-long job that starts off at 10GB size and should
 not gain significant memory size throughout the process (as confirmed
 with Condor 7.7.4). After the upgrade to 7.7.5 the job continuously
 increases it memory demands and I have to kill it after two days when it
 exceeds 150GB consumption. However, the continuous growth is not limited
 to this particular job -- most type of long-running jobs on this machine
 are Python-based, though.
 
 Looking into the 7.7.5 changelog I see a number of memory-related
 aspects, but nothing that is a perfect match. I checked that this is not
 just about Condor reporting increasing memory consumption, but the
 respective cluster nodes actually run out of memory, because the job
 grows and grows.
 
 I'd be glad to get some feedback on what the problem could be and if
 there is a workaround.


I've scanned the changes in Condor 7.7.5 and I also don't see anything
that would explain a change in the memory behavior of jobs. I assume 
you're submitting your jobs under the vanilla universe.

Have you tried logging into the execute nodes and running the programs  
interactively? That's a good way to test if something other than Condor 
changed and is responsible for the difference.

You can also try running condor_ssh_to_job while a job is running to get 
an interactive session with the same environment as your job. You can 
examine the environment variables, etc. for any odd settings. You even  
submit a sleep job, then use condor_ssh_to_job to start your program 
interactively in the environment Condor sets up, possibly tweaking 
environment variables first.

Thanks and regards,
Jaime Frey
UW-Madison Condor Team




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#667478: [condor-debian] Bug#667478: condor: RSS memory usage grows continuously for Condor jobs

2012-04-04 Thread Michael Hanke
On Apr 4, 2012 5:11 PM, Jaime Frey jf...@cs.wisc.edu wrote:

 I've scanned the changes in Condor 7.7.5 and I also don't see anything
 that would explain a change in the memory behavior of jobs. I assume
 you're submitting your jobs under the vanilla universe.

Yes.

 Have you tried logging into the execute nodes and running the programs
 interactively? That's a good way to test if something other than Condor
 changed and is responsible for the difference.

Yes, I did that and the problem is not present. This is what made me blame
Condor and file this bug.

 You can also try running condor_ssh_to_job while a job is running to get
 an interactive session with the same environment as your job. You can
 examine the environment variables, etc. for any odd settings. You even
 submit a sleep job, then use condor_ssh_to_job to start your program
 interactively in the environment Condor sets up, possibly tweaking
 environment variables first.

I haven't done that yet, and will test this next -- thanks for this
suggestion! I will report back if I can replicate the behavior.

Thanks,

Michael


Bug#667478: [condor-debian] Bug#667478: condor: RSS memory usage grows continuously for Condor jobs

2012-04-04 Thread Jaime Frey
On Apr 4, 2012, at 10:34 AM, Michael Hanke wrote:

 On Apr 4, 2012 5:11 PM, Jaime Frey jf...@cs.wisc.edu wrote:
 
  You can also try running condor_ssh_to_job while a job is running to get
 
  an interactive session with the same environment as your job. You can
  examine the environment variables, etc. for any odd settings. You even
  submit a sleep job, then use condor_ssh_to_job to start your program
  interactively in the environment Condor sets up, possibly tweaking
  environment variables first.
 
 I haven't done that yet, and will test this next -- thanks for this 
 suggestion! I will report back if I can replicate the behavior.
 


Also, can you confirm that it's the job itself that's bloating in size, and not 
say the condor_starter?

Thanks and regards,
Jaime Frey
UW-Madison Condor Team



Bug#667478: [condor-debian] Bug#667478: condor: RSS memory usage grows continuously for Condor jobs

2012-04-04 Thread Michael Hanke
On Wed, Apr 04, 2012 at 11:03:02AM -0500, Jaime Frey wrote:
 Also, can you confirm that it's the job itself that's bloating in
 size, and not say the condor_starter?

Yes, it is the job, not the starter.

Michael


-- 
Michael Hanke
http://mih.voxindeserto.de



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org