[gt-user] Condor-g problems

scott fletcher (BITS) Mon, 26 Nov 2007 04:00:14 -0800

Hi,

We are experiencing problems using Globus and Condor-g in our
environment, our setup is as follows:


We have a Condor cluster, the master is dual homed with Condor bound to
the external interface 10.15.x.x network using the following :

NETWORK_INTERFACE = 10.15.109.77

Globus is installed on the master both WS and pre WS, all of the
versions of Condor and Globus are installed from the latest stable
version of VDT  ( VDT 1.8.1 - Condor 6.8.6, globus 4.0.5). The
submission node also has condor and globus installed from the latest
stable version of VDT and submits via the masters external interface
10.15.x.x .

The execution nodes are in the 192.168.x.x range and communicate with
condor through the masters internal interface to its external interface
(the execution nodes are dedicated cluster machines on their own
subnet).

All cluster and submission machines are running RHEL ES4 Update 4.

Condor job submission without globus work fine, however when we are
submitting jobs via Globus we are seeing 2 problems.

Problem 1
=========
We are experiencing some stability issues with Condor, after submitting
several jobs using Condor-g we sometimes experience matching problems,
shown in the negotiator log as :

11/23 16:39:54     Over submitter resource limit (0) ... only consider
startd ranks
11/23 16:39:54     Sending SEND_JOB_INFO/eom
11/23 16:39:54     Getting reply from schedd ...
11/23 16:39:54     Got JOB_INFO command; getting classad/eom
11/23 16:39:54     Request 00017.00000:
11/23 16:39:54       Rejected 17.0 [EMAIL PROTECTED]
<10.15.109.77:65269>: no match found
11/23 16:39:54     Sending SEND_JOB_INFO/eom
11/23 16:39:54     Getting reply from schedd ...
11/23 16:39:54     Got NO_MORE_JOBS;  done negotiating
11/23 16:39:54   This schedd hit its scheddlimit.

At this point even if we revert to submitting jobs directly to Condor we
get the same message, the only thing that seems to fix it is a reboot.

Problem 2
=========
When we submit a job to the master node it gets there and runs as you
would expect and then exits, however on the submission node the job
appears idle until about a minute after the job has actually finished
(on short jobs lasting 10 secs, we have not really tried any long ones
yet), it then shows status as running (which takes several times the job
actually took to run) and then exits. On the master we see this in the
gram log which seems to be around the time the submission node should be
getting its status updated

11/26 09:32:41 JMI: poll: seeking:
https://example.com:64002/13793/1196069371/
11/26 09:32:41 JMI: poll_fast: ******** Failed to find
https://example.com/13793/1196069371/
11/26 09:32:41 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl
scripts)

We have also tried compiling the lastest stable version of Globus and
using the latest stable version of condor (i.e. non VDT)  but all with
the same results.

Does anyone have any idea what would cause either of these problems or
even an idea where to start looking? Please let me know if you need any
more logs/config files as there are lots and I didn't want to just
include a lot of non-helpful information.

Thanks,

Scott

-- 
Disclaimer: This e-mail and any attachments are confidential and intended 
solely for the use of the recipient(s) to whom they are addressed. If you have 
received it in error, please destroy all copies and inform the sender. This 
email and any attachments are believed to be free from viruses but BBSRC 
accepts no liability in connection therewith.

[gt-user] Condor-g problems

Reply via email to