Hi Charles,

As you may recall, previously my GRAM2 test failed because the underlying Condor did not have a shared $HOME on NFS.

After configuring network shared accounts and shared $HOME (/home/xyz) for Condor, now read/write files on all shared hosts works fine (using bash) on different hosts accessing the shared $HOME.

Also, condor submit tests work OK with vanilla jobs without a need to specify file transfers.

However, the GRAM2 test now fails with a different result.

I think that the GRAM2 jobmanager decides that the job is a failure (see gram_job_mgr_5378.log below) and returns the error before passing any job to Condor.

Previously all failed jobs were queued with Condor. So it might have been obvious that something was wrong with the Condor installation/ configuration. Now they are fixed and the error looks different. There is no Condor job queuing without being executed or any Condor log entires.

I have no clue what may be wrong.

Can you help?

Thank,s
Yoichi


------------------------------------------------------------------------------------------------------------------------
DETAILS:
------------------------------------------------------------------------------------------------------------------------

As before:

grid1: Condor Submit node
grid2: Condor Manager node/ GRAM2
grid4: Condor Execute node


------------------------------------------------------------------------------------------------------------------------
# su - yoichi

$ myproxy-logon -s grid2

Enter MyProxy pass phrase:

A credential has been received for user yoichi in /tmp/x509up_u500.

(just ping the gatekeeper)
$ globusrun -a -r grid2.ramscommunity.org/jobmanager-condor

GRAM Authentication test successful

$ globus-job-run grid2.ramscommunity.org/jobmanager-condor /bin/hostname

GRAM Job submission failed because data transfer to the server failed (error code 10)
------------------------------------------------------------------------------------------------------------------------


Different from previous error, now the Condor queue is empty.


------------------------------------------------------------------------------------------------------------------------
$ condor_q

-- Submitter: grid1.ramscommunity.org : <137.111.246.175:9629> : grid1.ramscommunity.org
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held
------------------------------------------------------------------------------------------------------------------------


/home/yoichi/.globus/job/grid2.ramscommunity.org is empty (no log file in it).

/usr/local/globus/var/ on grid 1 (submit node) has not changed.

/usr/local/globus/var/globus-gatekeeper.log on grid 2 (GRAM service)

------------------------------------------------------------------------------------------------------------------------
...
TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 6: globus-gatekeeper pid=5329 starting at Sun Oct 19 18:06:26 2008

TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 6: Got connection 137.111.246.175 at Sun Oct 19 18:06:26 2008

TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 5: Authenticated globus user: /O=Grid/ OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/ CN=Yoichi Takayama
TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 5: Requested service: jobmanager-condor
TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 5: Authorized as local user: yoichi
TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 5: Authorized as local uid: 500
TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 5:           and local gid: 500
TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 0: executing /usr/local/globus/libexec/globus-job- manager
TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Sun Oct 19 18:06:26 2008
PID: 5329 -- Notice: 0: Child 5330 started
------------------------------------------------------------------------------------------------------------------------


Only the log I can find is e.g. /home/yoichi/gram_job_mgr_5378.log


------------------------------------------------------------------------------------------------------------------------
10/19 18:18:59 JM: TARGET_GLOBUS_LOCATION = /usr/local/globus
10/19 18:18:59 JM: Security context imported
10/19 18:18:59 JM: Adding new callback contact (url=https://grid1.ramscommunity.org:49812/ , mask=1048575)
10/19 18:18:59 JM: Added successfully
10/19 18:18:59 Pre-parsed RSL string: &("rsl_substitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726"; ) ) ("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
10/19 18:18:59
<<<<<Job Request RSL
&("rsl_substitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job Request RSL
10/19 18:18:59
<<<<<Job Request RSL (canonical)
&("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job Request RSL (canonical)
10/19 18:18:59 JM: Evaluating RSL Value10/19 18:18:59 JM: Evaluated RSL Value to GLOBUSRUN_GASS_URL10/19 18:18:59 JM: Evaluating RSL Value10/19 18:18:59 JM: Evaluated RSL Value tohttps:// grid1.ramscommunity.org:3772610/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR
10/19 18:18:59
<<<<<Job RSL
&("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" ) ) ("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726 " ) )("stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" = $ ("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/ hostname" )
>>>>>Job RSL
10/19 18:18:59
<<<<<Job RSL (post-eval)
&("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" ) ) ("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726 " ) )("stderr" = "https://grid1.ramscommunity.org:37726/dev/stderr"; ) ("stdout" = "https://grid1.ramscommunity.org:37726/dev/stdout"; ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-eval)
Adding default RSL of proxy_timeout = 60
Adding default RSL of dry_run = no
Adding default RSL of gram_my_job = collective
Adding default RSL of job_type = multiple
Adding default RSL of count = 1
Adding default RSL of stdin = /dev/null
Adding default RSL of directory = $(HOME)
10/19 18:18:59
<<<<<Job RSL (post-validation)
&("directory" = $("HOME") )("stdin" = "/dev/null" )("count" = "1" ) ("job_type" = "multiple" )("gram_my_job" = "collective" )("dry_run" = "no" )("proxy_timeout" = "60" )("environment" = ("HOME" "/home/ yoichi" ) ("LOGNAME" "yoichi" ) )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726"; ) ) ("stderr" = "https://grid1.ramscommunity.org:37726/dev/stderr"; ) ("stdout" = "https://grid1.ramscommunity.org:37726/dev/stdout"; ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-validation)
10/19 18:18:59
<<<<<Job RSL (post-validation-eval)
&("directory" = "/home/yoichi" )("stdin" = "/dev/null" )("count" = "1" )("job_type" = "multiple" )("gram_my_job" = "collective" ) ("dry_run" = "no" )("proxy_timeout" = "60" )("environment" = ("HOME" "/ home/yoichi" ) ("LOGNAME" "yoichi" ) )("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:37726"; ) ) ("stderr" = "https://grid1.ramscommunity.org:37726/dev/stderr"; ) ("stdout" = "https://grid1.ramscommunity.org:37726/dev/stdout"; ) ("executable" = "/bin/hostname" )
>>>>>Job RSL (post-validation-eval)
10/19 18:18:59 JMI: Getting RSL output value
10/19 18:18:59 JMI: Processing output positions
10/19 18:18:59 JMI: Getting RSL output value
10/19 18:18:59 JMI: Processing output positions
10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_REMOTE_IO_FILE_CREATE
10/19 18:18:59 JM: Opening output destinations
10/19 18:18:59 JM: stdout goes to /home/yoichi/.globus/job/ grid2.ramscommunity.org/5378.1224400739/stdout 10/19 18:18:59 JM: stderr goes to /home/yoichi/.globus/job/ grid2.ramscommunity.org/5378.1224400739/stderr
10/19 18:18:59 JM: Opening https://grid1.ramscommunity.org:37726/dev/stdout
10/19 18:18:59 JM: Opened GASS handle 1.
10/19 18:18:59 JM: exiting globus_l_gram_job_manager_output_destination_open()
10/19 18:18:59 JM: Opening https://grid1.ramscommunity.org:37726/dev/stderr
10/19 18:18:59 JM: Opened GASS handle 2.
10/19 18:18:59 JM: exiting globus_l_gram_job_manager_output_destination_open()
10/19 18:18:59 stdout or stderr is being used, starting to poll
10/19 18:18:59 JM: Finished opening output destinations
10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED 10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT 10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP 10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP 10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP 10/19 18:18:59 JMI: testing job manager scripts for type condor exist and permissions are ok. 10/19 18:18:59 JMI: completed script validation: job manager type is condor.
10/19 18:18:59 JMI: cmd = cache_cleanup
Sun Oct 19 18:18:59 2008 JM_SCRIPT: New Perl JobManager created.
Sun Oct 19 18:18:59 2008 JM_SCRIPT: Using jm supplied job dir: /home/ yoichi/.globus/job/grid2.ramscommunity.org/5378.1224400739 Sun Oct 19 18:18:59 2008 JM_SCRIPT: Using jm supplied job dir: /home/ yoichi/.globus/job/grid2.ramscommunity.org/5378.1224400739
Sun Oct 19 18:18:59 2008 JM_SCRIPT: cache_cleanup(enter)
Sun Oct 19 18:18:59 2008 JM_SCRIPT: Cleaning files in job dir /home/ yoichi/.globus/job/grid2.ramscommunity.org/5378.1224400739 Sun Oct 19 18:18:59 2008 JM_SCRIPT: Removed 3 files from /home/ yoichi/.globus/job/grid2.ramscommunity.org/5378.1224400739
Sun Oct 19 18:18:59 2008 JM_SCRIPT: cache_cleanup(exit)
10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP 10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE
10/19 18:18:59 JM: before sending to client: rc=0 (Success)
10/19 18:18:59 Job Manager State Machine (exiting): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
10/19 18:18:59 JM: in globus_gram_job_manager_reporting_file_remove()
10/19 18:18:59 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
10/19 18:18:59 JM: in globus_gram_job_manager_reporting_file_remove()
10/19 18:18:59 JM: exiting globus_gram_job_manager.
------------------------------------------------------------------------------------------------------------------------


All Condor logs seem to be empty about this job request.

On grid2 (GRAM2 host), /usr/local/globus/var/globus-condor.log has no entry regarding to this job request

On grid4 (Execute node), there is no records regarding to this request in Condor StartLog.


--------------------------------------------------------------------------
Yoichi Takayama, PhD
Senior Research Fellow
RAMP Project
MELCOE (Macquarie E-Learning Centre of Excellence)
MACQUARIE UNIVERSITY

Phone: +61 (0)2 9850 9073
Fax: +61 (0)2 9850 6527
www.mq.edu.au
www.melcoe.mq.edu.au/projects/RAMP/
--------------------------------------------------------------------------
MACQUARIE UNIVERSITY: CRICOS Provider No 00002J

This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient, please delete it and notify the sender. Views expressed in this message are those of the individual sender, and are not necessarily the views of Macquarie E-Learning Centre Of Excellence (MELCOE) or Macquarie University.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to