Hi, We are experiencing problems using Globus and Condor-g in our environment, our setup is as follows:
We have a Condor cluster, the master is dual homed with Condor bound to the external interface 10.15.x.x network using the following : NETWORK_INTERFACE = 10.15.109.77 Globus is installed on the master both WS and pre WS, all of the versions of Condor and Globus are installed from the latest stable version of VDT ( VDT 1.8.1 - Condor 6.8.6, globus 4.0.5). The submission node also has condor and globus installed from the latest stable version of VDT and submits via the masters external interface 10.15.x.x . The execution nodes are in the 192.168.x.x range and communicate with condor through the masters internal interface to its external interface (the execution nodes are dedicated cluster machines on their own subnet). All cluster and submission machines are running RHEL ES4 Update 4. Condor job submission without globus work fine, however when we are submitting jobs via Globus we are seeing 2 problems. Problem 1 ========= We are experiencing some stability issues with Condor, after submitting several jobs using Condor-g we sometimes experience matching problems, shown in the negotiator log as : 11/23 16:39:54 Over submitter resource limit (0) ... only consider startd ranks 11/23 16:39:54 Sending SEND_JOB_INFO/eom 11/23 16:39:54 Getting reply from schedd ... 11/23 16:39:54 Got JOB_INFO command; getting classad/eom 11/23 16:39:54 Request 00017.00000: 11/23 16:39:54 Rejected 17.0 [EMAIL PROTECTED] <10.15.109.77:65269>: no match found 11/23 16:39:54 Sending SEND_JOB_INFO/eom 11/23 16:39:54 Getting reply from schedd ... 11/23 16:39:54 Got NO_MORE_JOBS; done negotiating 11/23 16:39:54 This schedd hit its scheddlimit. At this point even if we revert to submitting jobs directly to Condor we get the same message, the only thing that seems to fix it is a reboot. Problem 2 ========= When we submit a job to the master node it gets there and runs as you would expect and then exits, however on the submission node the job appears idle until about a minute after the job has actually finished (on short jobs lasting 10 secs, we have not really tried any long ones yet), it then shows status as running (which takes several times the job actually took to run) and then exits. On the master we see this in the gram log which seems to be around the time the submission node should be getting its status updated 11/26 09:32:41 JMI: poll: seeking: https://example.com:64002/13793/1196069371/ 11/26 09:32:41 JMI: poll_fast: ******** Failed to find https://example.com/13793/1196069371/ 11/26 09:32:41 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) We have also tried compiling the lastest stable version of Globus and using the latest stable version of condor (i.e. non VDT) but all with the same results. Does anyone have any idea what would cause either of these problems or even an idea where to start looking? Please let me know if you need any more logs/config files as there are lots and I didn't want to just include a lot of non-helpful information. Thanks, Scott -- Disclaimer: This e-mail and any attachments are confidential and intended solely for the use of the recipient(s) to whom they are addressed. If you have received it in error, please destroy all copies and inform the sender. This email and any attachments are believed to be free from viruses but BBSRC accepts no liability in connection therewith.
