Hi Charles,As you may recall, previously my GRAM2 test failed because the underlying Condor did not have a shared $HOME on NFS.
After configuring network shared accounts and shared $HOME (/home/xyz) for Condor, now read/write files on all shared hosts works fine (using bash) on different hosts accessing the shared $HOME.
Also, condor submit tests work OK with vanilla jobs without a need to specify file transfers.
However, the GRAM2 test now fails with a different result.I think that the GRAM2 jobmanager decides that the job is a failure (see gram_job_mgr_5378.log below) and returns the error before passing any job to Condor.
Previously all failed jobs were queued with Condor. So it might have been obvious that something was wrong with the Condor installation/ configuration. Now they are fixed and the error looks different. There is no Condor job queuing without being executed or any Condor log entires.
I have no clue what may be wrong. Can you help? Thank,s Yoichi ------------------------------------------------------------------------------------------------------------------------ DETAILS: ------------------------------------------------------------------------------------------------------------------------ As before: grid1: Condor Submit node grid2: Condor Manager node/ GRAM2 grid4: Condor Execute node ------------------------------------------------------------------------------------------------------------------------ # su - yoichi $ myproxy-logon -s grid2 Enter MyProxy pass phrase: A credential has been received for user yoichi in /tmp/x509up_u500. (just ping the gatekeeper) $ globusrun -a -r grid2.ramscommunity.org/jobmanager-condor GRAM Authentication test successful $ globus-job-run grid2.ramscommunity.org/jobmanager-condor /bin/hostnameGRAM Job submission failed because data transfer to the server failed (error code 10)
------------------------------------------------------------------------------------------------------------------------ Different from previous error, now the Condor queue is empty. ------------------------------------------------------------------------------------------------------------------------ $ condor_q-- Submitter: grid1.ramscommunity.org : <137.111.246.175:9629> : grid1.ramscommunity.org
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held ------------------------------------------------------------------------------------------------------------------------/home/yoichi/.globus/job/grid2.ramscommunity.org is empty (no log file in it).
/usr/local/globus/var/ on grid 1 (submit node) has not changed. /usr/local/globus/var/globus-gatekeeper.log on grid 2 (GRAM service) ------------------------------------------------------------------------------------------------------------------------ ... TIME: Sun Oct 19 18:06:26 2008PID: 5329 -- Notice: 6: globus-gatekeeper pid=5329 starting at Sun Oct 19 18:06:26 2008
TIME: Sun Oct 19 18:06:26 2008PID: 5329 -- Notice: 6: Got connection 137.111.246.175 at Sun Oct 19 18:06:26 2008
TIME: Sun Oct 19 18:06:26 2008PID: 5329 -- Notice: 5: Authenticated globus user: /O=Grid/ OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/ CN=Yoichi Takayama
TIME: Sun Oct 19 18:06:26 2008 PID: 5329 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6 TIME: Sun Oct 19 18:06:26 2008 PID: 5329 -- Notice: 5: Requested service: jobmanager-condor TIME: Sun Oct 19 18:06:26 2008 PID: 5329 -- Notice: 5: Authorized as local user: yoichi TIME: Sun Oct 19 18:06:26 2008 PID: 5329 -- Notice: 5: Authorized as local uid: 500 TIME: Sun Oct 19 18:06:26 2008 PID: 5329 -- Notice: 5: and local gid: 500 TIME: Sun Oct 19 18:06:26 2008PID: 5329 -- Notice: 0: executing /usr/local/globus/libexec/globus-job- manager
TIME: Sun Oct 19 18:06:26 2008 PID: 5329 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9 TIME: Sun Oct 19 18:06:26 2008 PID: 5329 -- Notice: 0: Child 5330 started ------------------------------------------------------------------------------------------------------------------------ Only the log I can find is e.g. /home/yoichi/gram_job_mgr_5378.log ------------------------------------------------------------------------------------------------------------------------ 10/19 18:18:59 JM: TARGET_GLOBUS_LOCATION = /usr/local/globus 10/19 18:18:59 JM: Security context imported10/19 18:18:59 JM: Adding new callback contact (url=https://grid1.ramscommunity.org:49812/ , mask=1048575)
10/19 18:18:59 JM: Added successfully10/19 18:18:59 Pre-parsed RSL string: &("rsl_substitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726" ) ) ("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
10/19 18:18:59 <<<<<Job Request RSL&("rsl_substitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job Request RSL 10/19 18:18:59 <<<<<Job Request RSL (canonical)&("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job Request RSL (canonical)10/19 18:18:59 JM: Evaluating RSL Value10/19 18:18:59 JM: Evaluated RSL Value to GLOBUSRUN_GASS_URL10/19 18:18:59 JM: Evaluating RSL Value10/19 18:18:59 JM: Evaluated RSL Value tohttps:// grid1.ramscommunity.org:3772610/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR
10/19 18:18:59 <<<<<Job RSL&("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" ) ) ("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job RSL 10/19 18:18:59 <<<<<Job RSL (post-eval)&("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" ) ) ("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726 " ) )("stderr" = "https://grid1.ramscommunity.org:37726/dev/stderr" ) ("stdout" = "https://grid1.ramscommunity.org:37726/dev/stdout" ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-eval) Adding default RSL of proxy_timeout = 60 Adding default RSL of dry_run = no Adding default RSL of gram_my_job = collective Adding default RSL of job_type = multiple Adding default RSL of count = 1 Adding default RSL of stdin = /dev/null Adding default RSL of directory = $(HOME) 10/19 18:18:59 <<<<<Job RSL (post-validation)&("directory" = $("HOME") )("stdin" = "/dev/null" )("count" = "1" ) ("job_type" = "multiple" )("gram_my_job" = "collective" )("dry_run" = "no" )("proxy_timeout" = "60" )("environment" = ("HOME" "/home/ yoichi" ) ("LOGNAME" "yoichi" ) )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726" ) ) ("stderr" = "https://grid1.ramscommunity.org:37726/dev/stderr" ) ("stdout" = "https://grid1.ramscommunity.org:37726/dev/stdout" ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-validation) 10/19 18:18:59 <<<<<Job RSL (post-validation-eval)&("directory" = "/home/yoichi" )("stdin" = "/dev/null" )("count" = "1" )("job_type" = "multiple" )("gram_my_job" = "collective" ) ("dry_run" = "no" )("proxy_timeout" = "60" )("environment" = ("HOME" "/ home/yoichi" ) ("LOGNAME" "yoichi" ) )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726" ) ) ("stderr" = "https://grid1.ramscommunity.org:37726/dev/stderr" ) ("stdout" = "https://grid1.ramscommunity.org:37726/dev/stdout" ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-validation-eval) 10/19 18:18:59 JMI: Getting RSL output value 10/19 18:18:59 JMI: Processing output positions 10/19 18:18:59 JMI: Getting RSL output value 10/19 18:18:59 JMI: Processing output positions10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_REMOTE_IO_FILE_CREATE
10/19 18:18:59 JM: Opening output destinations10/19 18:18:59 JM: stdout goes to /home/yoichi/.globus/job/ grid2.ramscommunity.org/5378.1224400739/stdout 10/19 18:18:59 JM: stderr goes to /home/yoichi/.globus/job/ grid2.ramscommunity.org/5378.1224400739/stderr
10/19 18:18:59 JM: Opening https://grid1.ramscommunity.org:37726/dev/stdout 10/19 18:18:59 JM: Opened GASS handle 1.10/19 18:18:59 JM: exiting globus_l_gram_job_manager_output_destination_open()
10/19 18:18:59 JM: Opening https://grid1.ramscommunity.org:37726/dev/stderr 10/19 18:18:59 JM: Opened GASS handle 2.10/19 18:18:59 JM: exiting globus_l_gram_job_manager_output_destination_open()
10/19 18:18:59 stdout or stderr is being used, starting to poll 10/19 18:18:59 JM: Finished opening output destinations10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED 10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT 10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP 10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP 10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP 10/19 18:18:59 JMI: testing job manager scripts for type condor exist and permissions are ok. 10/19 18:18:59 JMI: completed script validation: job manager type is condor.
10/19 18:18:59 JMI: cmd = cache_cleanup Sun Oct 19 18:18:59 2008 JM_SCRIPT: New Perl JobManager created.Sun Oct 19 18:18:59 2008 JM_SCRIPT: Using jm supplied job dir: /home/ yoichi/.globus/job/grid2.ramscommunity.org/5378.1224400739 Sun Oct 19 18:18:59 2008 JM_SCRIPT: Using jm supplied job dir: /home/ yoichi/.globus/job/grid2.ramscommunity.org/5378.1224400739
Sun Oct 19 18:18:59 2008 JM_SCRIPT: cache_cleanup(enter)Sun Oct 19 18:18:59 2008 JM_SCRIPT: Cleaning files in job dir /home/ yoichi/.globus/job/grid2.ramscommunity.org/5378.1224400739 Sun Oct 19 18:18:59 2008 JM_SCRIPT: Removed 3 files from /home/ yoichi/.globus/job/grid2.ramscommunity.org/5378.1224400739
Sun Oct 19 18:18:59 2008 JM_SCRIPT: cache_cleanup(exit)10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP 10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE
10/19 18:18:59 JM: before sending to client: rc=0 (Success)10/19 18:18:59 Job Manager State Machine (exiting): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
10/19 18:18:59 JM: in globus_gram_job_manager_reporting_file_remove()10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
10/19 18:18:59 JM: in globus_gram_job_manager_reporting_file_remove() 10/19 18:18:59 JM: exiting globus_gram_job_manager. ------------------------------------------------------------------------------------------------------------------------ All Condor logs seem to be empty about this job request.On grid2 (GRAM2 host), /usr/local/globus/var/globus-condor.log has no entry regarding to this job request
On grid4 (Execute node), there is no records regarding to this request in Condor StartLog.
-------------------------------------------------------------------------- Yoichi Takayama, PhD Senior Research Fellow RAMP Project MELCOE (Macquarie E-Learning Centre of Excellence) MACQUARIE UNIVERSITY Phone: +61 (0)2 9850 9073 Fax: +61 (0)2 9850 6527 www.mq.edu.au www.melcoe.mq.edu.au/projects/RAMP/ -------------------------------------------------------------------------- MACQUARIE UNIVERSITY: CRICOS Provider No 00002JThis message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie E-Learning Centre Of Excellence (MELCOE) or Macquarie University.
smime.p7s
Description: S/MIME cryptographic signature
