After opening GridFTP ports on grid1, I still get the same error message
with either globus-job-submit or globus-job-run.
grid1 submit -> grid2 GRAM server -> grid4 execute
$ globus-job-submit grid2.ramscommunity.org/jobmanager-condor
/bin/hostname
GRAM Job submission failed because data transfer to the server failed
(error code 10)
$ globus-job-run grid2.ramscommunity.org/jobmanager-condor /bin/hostname
GRAM Job submission failed because data transfer to the server failed
(error code 10)
I don't get the job ID back, so I cannot trace it with globus-job-status
although I can see that it polls until the job execute node becomes
available, and then deletes all files.
Although the error message sounds like that the request did not go to the
server or to the execute node, the job goes to grid2, because I see it on
the gatekeeper and jobmanager log. As the jobmanager log shows, it was
sent to grid4 for execution but nothing was returned. Finally, I can see
that the job was executed successfully on grid4 in the Condor StartLog.
On grid1, I see a gram log for this request fora very short time and it is
cleaned up. It leaves no file in
/home/yoichi/.glbous/job/grid2/ramscommuity.org, either. So, I can't tell
what's going on.
If I run globus-job-run, the gram log remains. So the attached gram log is
from it, not from the globus-job-submit. It is possibly showing the same
or a different error than what globus-job-submit may be causing, because
their behaviours seem to be different.
Any suggestion?
The last option I can try is to install Globus Toolkit on the Condor
Execute node, which I think was not necessary.
Thanks,
Yoichi
------------------------------------------------------------------------------------------------------------------------
DETAILS
------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
Firewall
Firewall on grid1:
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:gsiftp (2811)
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:7512 (myproxy-server) - should not be open
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:pcsync-https (8443)
ACCEPT udp -- anywhere anywhere state NEW udp
dpt:gsiftp (2811)
ACCEPT udp -- anywhere anywhere state NEW udp
dpt:pcsync-https (8443)
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpts:40000:41000
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:webcache (8080)
ACCEPT udp -- anywhere anywhere state NEW udp
dpt:webcache (8080)
Firewall on grid2
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:gsiftp (2811)
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:myproxy-server (7512)
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:pcsync-https (8443)
ACCEPT udp -- anywhere anywhere state NEW udp
dpt:gsiftp (2811)
ACCEPT udp -- anywhere anywhere state NEW udp
dpt:pcsync-https (8443)
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpts:40000:41000
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:gsigatekeeper (2119)
ACCEPT udp -- anywhere anywhere state NEW udp
dpt:gsigatekeeper (2119)
on grid4: (ports newly opened)
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpt:gsiftp (2811)
ACCEPT udp -- anywhere anywhere state NEW udp
dpt:gsiftp (2811)
ACCEPT tcp -- anywhere anywhere state NEW tcp
dpts:40000:41000
------------------------------------------------------------------------------------------------------------------------
GridFTP xinetd configuration on grid4: (newly created)
$ vi /etc/inetd.d/gridftp
service gsiftp
{
instances = 100
socket_type = stream
wait = no
user = root
env += GLOBUS_LOCATION=/usr/local/globus
env += LD_LIBRARY_PATH=/usr/local/globus/lib
env += GLOBUS_TCP_PORT_RANGE=40000,41000
server = /usr/local/globus/sbin/globus-gridftp-server
server_args = -i
log_on_success += DURATION
disable = no
}
------------------------------------------------------------------------------------------------------------------------
Gatekeeper log
On grid2:
$ vi /usr/local/globus/var/globus-gatekeeper.log
...
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 6: globus-gatekeeper pid=9827 starting at Tue Oct 21
13:13:04 2008
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 6: Got connection 137.111.246.175 at Tue Oct 21
13:13:04 2008
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 5: Authenticated globus user:
/O=Grid/OU=GlobusTest/OU=simpleCA-grid2.ramscommunity.org/OU=ramscommunity.org/CN=Yoichi
Takayama
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=6
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 5: Requested service: jobmanager-condor
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 5: Authorized as local user: yoichi
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 5: Authorized as local uid: 500
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 5: and local gid: 500
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 0: executing
/usr/local/globus/libexec/globus-job-manager
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=9
TIME: Tue Oct 21 13:13:04 2008
PID: 9827 -- Notice: 0: Child 9828 started
------------------------------------------------------------------------------------------------------------------------
jobmanager log
on grid2: (this is actually from globus-job-submit earlier. Somehow
globus-job-run does not make any entry)
$ vi /usr/local/globus/var/globus-condor.log
<c>
<a n="MyType"><s>SubmitEvent</s></a>
<a n="EventTypeNumber"><i>0</i></a>
<a n="MyType"><s>SubmitEvent</s></a>
<a n="EventTime"><s>2008-10-21T13:02:12</s></a>
<a n="Cluster"><i>40</i></a>
<a n="Proc"><i>0</i></a>
<a n="Subproc"><i>0</i></a>
<a n="SubmitHost"><s><137.111.246.176:9646></s></a>
</c>
<c>
<a n="MyType"><s>ExecuteEvent</s></a>
<a n="EventTypeNumber"><i>1</i></a>
<a n="MyType"><s>ExecuteEvent</s></a>
<a n="EventTime"><s>2008-10-21T13:02:15</s></a>
<a n="Cluster"><i>40</i></a>
<a n="Proc"><i>0</i></a>
<a n="Subproc"><i>0</i></a>
<a n="ExecuteHost"><s><137.111.246.250:9649></s></a>
</c>
<c>
<a n="MyType"><s>JobTerminatedEvent</s></a>
<a n="EventTypeNumber"><i>5</i></a>
<a n="MyType"><s>JobTerminatedEvent</s></a>
<a n="EventTime"><s>2008-10-21T13:02:15</s></a>
<a n="Cluster"><i>40</i></a>
<a n="Proc"><i>0</i></a>
<a n="Subproc"><i>0</i></a>
<a n="TerminatedNormally"><b v="t"/></a>
<a n="ReturnValue"><i>0</i></a>
<a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="TotalLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="TotalRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
<a n="SentBytes"><r>0.000000000000000E+00</r></a>
<a n="ReceivedBytes"><r>0.000000000000000E+00</r></a>
<a n="TotalSentBytes"><r>0.000000000000000E+00</r></a>
<a n="TotalReceivedBytes"><r>0.000000000000000E+00</r></a>
</c>
------------------------------------------------------------------------------------------------------------------------
Condor log
on grid4: the job seemed to have executed globus-job-submit successfully,
but globus-job-run does not seem to appear)
$ cat StartLog
...
10/21 13:02:12 match_info called
10/21 13:02:12 Received match <137.111.246.250:9649>#1224479079#12#...
10/21 13:02:12 State change: match notification protocol successful
10/21 13:02:12 Changing state: Unclaimed -> Matched
10/21 13:02:12 Request accepted.
10/21 13:02:12 Remote owner is [EMAIL PROTECTED]
10/21 13:02:12 State change: claiming protocol successful
10/21 13:02:12 Changing state: Matched -> Claimed
10/21 13:02:15 Got activate_claim request from shadow
(<137.111.246.176:9657>)
10/21 13:02:15 Remote job ID is 40.0
10/21 13:02:15 Got universe "VANILLA" (5) from request classad
10/21 13:02:15 State change: claim-activation protocol successful
10/21 13:02:15 Changing activity: Idle -> Busy
10/21 13:02:15 Called deactivate_claim_forcibly()
10/21 13:02:15 Starter pid 29441 exited with status 0
10/21 13:02:15 State change: starter exited
10/21 13:02:15 Changing activity: Busy -> Idle
10/21 13:02:15 State change: received RELEASE_CLAIM command
10/21 13:02:15 Changing state and activity: Claimed/Idle ->
Preempting/Vacating
10/21 13:02:15 State change: No preempting claim, returning to owner
10/21 13:02:15 Changing state and activity: Preempting/Vacating ->
Owner/Idle
10/21 13:02:15 State change: IS_OWNER is false
10/21 13:02:15 Changing state: Owner -> Unclaimed
10/21 13:15:20 State change: RunBenchmarks is TRUE
10/21 13:15:20 Changing activity: Idle -> Benchmarking
10/21 13:15:26 State change: benchmarks completed
10/21 13:15:26 Changing activity: Benchmarking -> Idle
StarterLog also shows globus-job-submit successfully was run successfully.
There is no sign of globus-job-run????
$ cat StarterLog
...
10/21 13:02:15 Using config source:
/nfs/software/condor/7.0.4/etc/condor_config
10/21 13:02:15 Using local config sources:
10/21 13:02:15 /scratch/condor/condor_config.local
10/21 13:02:15 DaemonCore: Command Socket at <137.111.246.250:9622>
10/21 13:02:15 Done setting resource limits
10/21 13:02:15 Communicating with shadow <137.111.246.176:9645>
10/21 13:02:15 Submitting machine is "grid2.ramscommunity.org"
10/21 13:02:15 setting the orig job name in starter
10/21 13:02:15 setting the orig job iwd in starter
10/21 13:02:15 Job 40.0 set to execute immediately
10/21 13:02:15 Starting a VANILLA universe job with ID: 40.0
10/21 13:02:15 IWD: /home/yoichi
10/21 13:02:15 Output file:
/home/yoichi/.globus/job/grid2.ramscommunity.org/9796.1224554532/stdout
10/21 13:02:15 Error file:
/home/yoichi/.globus/job/grid2.ramscommunity.org/9796.1224554532/stderr
10/21 13:02:15 About to exec /bin/hostname
10/21 13:02:15 Create_Process succeeded, pid=29442
10/21 13:02:15 Process exited, pid=29442, status=0
10/21 13:02:15 Got SIGQUIT. Performing fast shutdown.
10/21 13:02:15 ShutdownFast all jobs.
------------------------------------------------------------------------------------------------------------------------
GRAM log (this is from globus-job-run, not from globus-job-run.
globus-job-submit polls condor until it becomes available, but deletes the
files immediately after condor executed the request. So it is impossible
to see the gram log or output file for it )
On grid1: gram log
$ cat gram_job_mgr_9828.log
10/21 13:13:04 JM: TARGET_GLOBUS_LOCATION = /usr/local/globus
10/21 13:13:04 JM: Security context imported
10/21 13:13:04 JM: Adding new callback contact
(url=https://grid1.ramscommunity.org:56799/, mask=1048575)
10/21 13:13:04 JM: Added successfully
10/21 13:13:04 Pre-parsed RSL string: &("rsl_substitution" =
("GLOBUSRUN_GASS_URL" "https://grid1.ramscommunity.org:36421" ) )("stderr"
= $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" =
$("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/hostname" )
10/21 13:13:04
<<<<<Job Request RSL
&("rsl_substitution" = ("GLOBUSRUN_GASS_URL"
"https://grid1.ramscommunity.org:36421" ) )("stderr" =
$("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" =
$("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/hostname" )
Job Request RSL
10/21 13:13:04
<<<<<Job Request RSL (canonical)
&("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
"https://grid1.ramscommunity.org:36421" ) )("stderr" =
$("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" =
$("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/hostname" )
Job Request RSL (canonical)
10/21 13:13:04 JM: Evaluating RSL Value10/21 13:13:04 JM: Evaluated RSL
Value to GLOBUSRUN_GASS_URL10/21 13:13:04 JM: Evaluating RSL Value10/21
13:13:04 JM: Evaluated RSL Value to
https://grid1.ramscommunity.org:3642110/21 13:13:04 Job Manager State
Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR
10/21 13:13:04
<<<<<Job RSL
&("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" )
)("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
"https://grid1.ramscommunity.org:36421" ) )("stderr" =
$("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" =
$("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "/bin/hostname" )
Job RSL
10/21 13:13:04
<<<<<Job RSL (post-eval)
&("environment" = ("HOME" "/home/yoichi" ) ("LOGNAME" "yoichi" )
)("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
"https://grid1.ramscommunity.org:36421" ) )("stderr" =
"https://grid1.ramscommunity.org:36421/dev/stderr" )("stdout" =
"https://grid1.ramscommunity.org:36421/dev/stdout" )("executable" =
"/bin/hostname" )
Job RSL (post-eval)
Adding default RSL of proxy_timeout = 60
Adding default RSL of dry_run = no
Adding default RSL of gram_my_job = collective
Adding default RSL of job_type = multiple
Adding default RSL of count = 1
Adding default RSL of stdin = /dev/null
Adding default RSL of directory = $(HOME)
10/21 13:13:04
<<<<<Job RSL (post-validation)
&("directory" = $("HOME") )("stdin" = "/dev/null" )("count" = "1"
)("job_type" = "multiple" )("gram_my_job" = "collective" )("dry_run" =
"no" )("proxy_timeout" = "60" )("environment" = ("HOME" "/home/yoichi" )
("LOGNAME" "yoichi" ) )("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
"https://grid1.ramscommunity.org:36421" ) )("stderr" =
"https://grid1.ramscommunity.org:36421/dev/stderr" )("stdout" =
"https://grid1.ramscommunity.org:36421/dev/stdout" )("executable" =
"/bin/hostname" )
Job RSL (post-validation)
10/21 13:13:04
<<<<<Job RSL (post-validation-eval)
&("directory" = "/home/yoichi" )("stdin" = "/dev/null" )("count" = "1"
)("job_type" = "multiple" )("gram_my_job" = "collective" )("dry_run" =
"no" )("proxy_timeout" = "60" )("environment" = ("HOME" "/home/yoichi" )
("LOGNAME" "yoichi" ) )("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
"https://grid1.ramscommunity.org:36421" ) )("stderr" =
"https://grid1.ramscommunity.org:36421/dev/stderr" )("stdout" =
"https://grid1.ramscommunity.org:36421/dev/stdout" )("executable" =
"/bin/hostname" )
Job RSL (post-validation-eval)
10/21 13:13:04 JMI: Getting RSL output value
10/21 13:13:04 JMI: Processing output positions
10/21 13:13:04 JMI: Getting RSL output value
10/21 13:13:04 JMI: Processing output positions
10/21 13:13:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_REMOTE_IO_FILE_CREATE
10/21 13:13:04 JM: Opening output destinations
10/21 13:13:04 JM: stdout goes to
/home/yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184/stdout
10/21 13:13:04 JM: stderr goes to
/home/yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184/stderr
10/21 13:13:04 JM: Opening
https://grid1.ramscommunity.org:36421/dev/stdout
10/21 13:13:04 JM: Opened GASS handle 1.
10/21 13:13:04 JM: exiting
globus_l_gram_job_manager_output_destination_open()
10/21 13:13:04 JM: Opening
https://grid1.ramscommunity.org:36421/dev/stderr
10/21 13:13:04 JM: Opened GASS handle 2.
10/21 13:13:04 JM: exiting
globus_l_gram_job_manager_output_destination_open()
10/21 13:13:04 stdout or stderr is being used, starting to poll
10/21 13:13:04 JM: Finished opening output destinations
10/21 13:13:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED
10/21 13:13:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT
10/21 13:13:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP
10/21 13:13:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP
10/21 13:13:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP
10/21 13:13:04 JMI: testing job manager scripts for type condor exist and
permissions are ok.
10/21 13:13:04 JMI: completed script validation: job manager type is
condor.
10/21 13:13:04 JMI: cmd = cache_cleanup
Tue Oct 21 13:13:04 2008 JM_SCRIPT: New Perl JobManager created.
Tue Oct 21 13:13:04 2008 JM_SCRIPT: Using jm supplied job dir:
/home/yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184
Tue Oct 21 13:13:04 2008 JM_SCRIPT: Using jm supplied job dir:
/home/yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184
Tue Oct 21 13:13:04 2008 JM_SCRIPT: cache_cleanup(enter)
Tue Oct 21 13:13:04 2008 JM_SCRIPT: Cleaning files in job dir
/home/yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184
Tue Oct 21 13:13:04 2008 JM_SCRIPT: Removed 3 files from
/home/yoichi/.globus/job/grid2.ramscommunity.org/9828.1224555184
Tue Oct 21 13:13:04 2008 JM_SCRIPT: cache_cleanup(exit)
10/21 13:13:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP
10/21 13:13:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE
10/21 13:13:04 JM: before sending to client: rc=0 (Success)
10/21 13:13:04 Job Manager State Machine (exiting):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
10/21 13:13:04 JM: in globus_gram_job_manager_reporting_file_remove()
10/21 13:13:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
10/21 13:13:04 JM: in globus_gram_job_manager_reporting_file_remove()
10/21 13:13:04 JM: exiting globus_gram_job_manager.
--------------------------------------------------------------------------
Yoichi Takayama, PhD
Senior Research Fellow
RAMP Project
MELCOE (Macquarie E-Learning Centre of Excellence)
MACQUARIE UNIVERSITY
Phone: +61 (0)2 9850 9073
Fax: +61 (0)2 9850 6527
www.mq.edu.au
www.melcoe.mq.edu.au/projects/RAMP/
--------------------------------------------------------------------------
MACQUARIE UNIVERSITY: CRICOS Provider No 00002J
This message is intended for the addressee named and may contain
confidential information. If you are not the intended recipient, please
delete it and notify the sender. Views expressed in this message are those
of the individual sender, and are not necessarily the views of Macquarie
E-Learning Centre Of Excellence (MELCOE) or Macquarie University.
On 21/10/2008, at 2:18 AM, Charles Bacon wrote:
I don't know what's wrong. The error 155 in the gram log you show
suggests that it was unable to transfer the output back to the client,
but I don't know why it's showing up as an error 10 in the client instead
of the error 155 I see in the logs on the server side. It seems possible
that you've got a firewall that's preventing the jobmanager from
contacting the client. You could test that theory by using
globus-job-submit instead of globus-job-run, then running
globus-job-status to see the results. That method shouldn't involve
callbacks. If that works, you could then try globus-job-get-output to
retrieve the results.
Charles