On Mar 6, 2008, at 11:44 AM, Silviu Popescu wrote:
Hi,
I've installed gt4.0.6 containers on several hosts ( 15 ) on Fedora
8 and there is a strange behavior on 2 of the hosts when submitting
gram jobs.
The only problem is it takes too long to return the job status.
e.g: a /bind/date takes 2 minutes (and 4 minutes with staging) on
the 2 hosts while on the other hosts takes about 2 seconds. GridFtp
works fine
When submitting between the 2 hosts it works fine ( or from those
hosts to other hosts ).
[EMAIL PROTECTED] time globusrun-ws -submit -F https://141.85.1.217:8443/wsrf/services/ManagedJobFactoryService
-c /bin/date
Submitting job...Done.
Job ID: uuid:0116ac5e-eb9b-11dc-a8db-0018f39fc34f
Termination time: 03/07/2008 16:33 GMT
--------------------here I wait long time------------------------
Current job state: Done
Destroying job...Done.
real 2m0.403s
[EMAIL PROTECTED] time globusrun-ws -submit -F https://141.85.1.208:8443/wsrf/services/ManagedJobFactoryService
-c /bin/date
Submitting job...Done.
Job ID: uuid:2b7b5fea-eb9a-11dc-bb86-0018f39fc34f
Termination time: 03/07/2008 16:27 GMT
Current job state: Active
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
real 0m1.639s
I think it may be some configuration error because I was configuring
the 2 hosts at the same time.
If anybody has any idea please help me it might be a stupid thing
that I can't figure it out.
Thanks,
Silviu Popescu
What's likely happening is that the container is failing to get
notifications to your client globusrun-ws program. As a fallback
behavior, the globusrun-ws program polls the containier for job status
every minute, so that the client can normally recover from not
receiving the notifications (which is why it seems like job
submissions are slow---it's poll-based client behavior instead of
event driven service behavior).
There are a few possible causes for notification failures (off the top
of my head):
- globusrun-ws is using host authorization on the notification
consumer and the container is connecting with an IP address that
doesn't resolve to the name that is in the container's certificate
- globusrun-ws is using a host name or address in the EPR of its
notification consumer which the container cannot contact. (Has an
internal IP address or is behind a firewall)
If it's the first case, you should be able to work around by
explicitly setting the expected subject name to be that of the
container's certificate with the -subject command-line option. You can
figure out what the container's cert is by using globusrun-ws -submit -
self -F .... and use the name of the
remote entity in the error message to be the argument to the -subject
command-line option.
If it's the second case, you should be able to work around by setting
GLOBUS_HOSTNAME to an externally visible IP address for the host and
maybe setting GLOBUS_TCP_PORT_RANGE as needed to work through a
firewall. (See http://dev.globus.org/wiki/FirewallHowTo)
Joe