I just found this page:
http://www.globus.org/api/c-globus-4.0/globus_scheduler_event_generator/html/group__seg__api.html
So it *is* a failure code... wow. Okay, now can someone point me to
a list of such failure codes so I can choose something more generic
or appropriate than what we're currently using? Thanks again,
Adam
On Fri, Apr 4, 2008 at 10:39 AM, Adam Bazinet
<[EMAIL PROTECTED]> wrote:
Thanks for the replies, guys. I think it is the case that there is
an actual underlying job failure (which is my fault for not finding
earlier) and that the error message is a complete red herring.
Basically, we have a SEG that we wrote from scratch for BOINC jobs,
and in the case that a job fails for some reason the following code
gets executed:
//job failed
time_t ts;
globus_scheduler_event_failed(ts, wu_name, 5);
fprintf(fp, "%s : Set %s to failed\n",asctime(t), wu_name);
break;
Now, it's my guess that the "Invalid executable path" error
generated before is just some kind of default error message, mainly
because I don't see any place in our code where we specify the error
message, unless that "5" as the last argument to
globus_scheduler_event_failed means something specific. Does anyone
know what that last argument means? I looked quickly here (http://www.globus.org/toolkit/docs/development/3.9.4/execution/wsgram/developer/scheduler-tutorial-seg.html
) but didn't see it.
Thanks,
Adam
On Wed, Apr 2, 2008 at 9:54 AM, Stuart Martin <[EMAIL PROTECTED]>
wrote:
Adam,
If we are dealing with a tmp dir, a cluster, and many jobs/lots of
activity on the cluster, this could be a problem with the shared
file system. Maybe occasionally, the compute host where the job is
run cannot see the tmp dir / executable?
-Stu
On Apr 1, 2008, at Apr 1, 11:41 PM, [EMAIL PROTECTED] wrote:
Hi Adam,
i can't say much about it right now, but at first glance it looks
to me that the application causes the problem. I have to add that
i don't know BOINC and probably i didn't understand all details.
Can you describe the role of hmmpfam a bit more:
* this is not the main executable, right?
* it is called by the main executable somehow under certain
conditions?
* if so: you said that hmmpfam actually should not be used at all.
in what situation could the executable call hmmpfam (i love
that word ... :-) ).
Martin
Hi,
We are experiencing a strange problem that is causing jobs to fail,
albeit
somewhat randomly and infrequently. At any given time, we may have
100
active GRAM jobs on a given resource. All of these jobs submit fine
and
there are usually no immediate failures. However, every so often, one
will
fail, and this can be days after it was submitted, with the following
error:
[EMAIL PROTECTED]:/export/grid_files/260600020.09477316738932795>
globusrun-ws -status -j jobEPR.txt
Current job state: Failed
globusrun-ws: Job failed: Invalid executable path
"/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/
hmmpfam".
ProcessDied
Now, the resource we are submitting to is unique in that we are not
actually
transferring in the hmmpfam executable; that is just a dummy path, and
our
custom BOINC job manager does not attempt to make use of it, as BOINC
executables live elsewhere. So until recently, the executable
specified
on
that path never existed, and so the error *kinda* made sense; what
didn't
make sense is why it happened randomly. In an attempt to make this
problem
go away, I now have the BOINC job manager create a dummy executable
on that
path when the job is submitted, but it doesn't look like that has
helped
because the error is still popping up. *Now* the error message
certainly
doesn't make sense if taken at face value, because that has been a
"valid
path", technically speaking, for the lifetime of the job in question
-- yet
the job still failed. If it helps, I'll attach debug output below,
though
I
wasn't able to glean any additional information from it. Does anyone
have
a
guess as to why this would happen so randomly and infrequently, or
happen
in
the first place?
This one is costing us big time because when a job fails, Globus
deletes
all
the output collected thus far, and these are large batches of work.
Thanks!
Adam
[EMAIL PROTECTED]:/export/grid_files/260600020.09477316738932795>
globusrun-ws -debug -status -j jobEPR.txt
=== REQUEST MESSAGE (length 816) (time 1206627634.506184000) ===
<ns00:Envelope
xmlns:ns00="http://schemas.xmlsoap.org/soap/
envelope/"><ns00:Header></
ns00:Header><ns00:Body><ns01:GetMultipleResourceProperties
xmlns:ns01="
http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties-1.2-draft-01.xsd
"><ns01:ResourceProperty
xmlns:ns02="http://www.globus.org/namespaces/2004/10/gram/job/
types">ns02:state</ns01:ResourceProperty><ns01:ResourceProperty
xmlns:ns02="http://www.globus.org/namespaces/2004/10/gram/job/
types">ns02:holding</ns01:ResourceProperty><ns01:ResourceProperty
xmlns:ns03="http://www.globus.org/namespaces/2004/10/gram/job/
faults">ns03:fault</ns01:ResourceProperty><ns01:ResourceProperty
xmlns:ns02="http://www.globus.org/namespaces/2004/10/gram/job/types
">ns02:exitCode</ns01:ResourceProperty></
ns01:GetMultipleResourceProperties></ns00:Body></ns00:Envelope>
----------------------------------------------
=== RESPONSE MESSAGE (length 6399) (time 1206627634.546965000) ===
<soapenv:Envelope
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance" xmlns:wsa="
http://schemas.xmlsoap.org/ws/2004/03/
addressing"><soapenv:Header><wsa:MessageID
soapenv:mustUnderstand="0">uuid:f6d14500-fc08-11dc-
a90a-8a2384ad991f</wsa:MessageID><wsa:To
soapenv:mustUnderstand="0">
http://schemas.xmlsoap.org/ws/2004/03/addressing/role/anonymous</
wsa:To><wsa:Action
soapenv:mustUnderstand="0">
http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties/GetMultipleResourcePropertiesResponse
</wsa:Action><wsa:From
soapenv:mustUnderstand="0" xmlns:ns4="
http://www.globus.org/namespaces/2004/10/gram/job"><wsa:Address>
https://128.8.120.35:8443/wsrf/services/ManagedExecutableJobService</
wsa:Address><wsa:ReferenceProperties><ns4:ResourceID
xmlns:ns4="http://www.globus.org/namespaces/2004/10/gram/
job">a1253c20-fa72-11dc-a908-8a2384ad991f</ns4:ResourceID></
wsa:ReferenceProperties></wsa:From><wsa:RelatesTo
RelationshipType="wsa:Reply"
soapenv:mustUnderstand="0">uuid:f6d04560-fc08-11dc-
b962-000f1f66888a</wsa:RelatesTo></
soapenv:Header><soapenv:Body><GetMultipleResourcePropertiesResponse
xmlns="
http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties-1.2-draft-01.xsd
"><ns1:state
xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job/
types">Failed</ns1:state><ns2:holding
xmlns:ns2="http://www.globus.org/namespaces/2004/10/gram/job/
types">false</ns2:holding><ns3:fault
xmlns:ns3="http://www.globus.org/namespaces/2004/10/gram/job/
faults"><ns3:invalidPathFault><ns4:Timestamp
xmlns:ns4="
http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd
">2008-03-27T06:09:37.203Z</ns4:Timestamp><ns5:Originator xmlns:ns5="
http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd
"><wsa:Address>
https://128.8.120.35:8443/wsrf/services/ManagedJobFactoryService</
wsa:Address><wsa:ReferenceProperties><ns6:ResourceID
xmlns:ns6="http://www.globus.org/namespaces/2004/10/gram/
job">a1253c20-fa72-11dc-a908-8a2384ad991f</ns6:ResourceID></
wsa:ReferenceProperties><wsa:ReferenceParameters/></
ns5:Originator><ns7:Description
xmlns:ns7="
http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd
">Invalid
executable path
"/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/
hmmpfam".</ns7:Description><ns8:FaultCause
xmlns:ns8="
http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd
"><ns8:Timestamp>2008-03-27T06:09:37.203Z</
ns8:Timestamp><ns8:ErrorCode
dialect="http://www.globus.org/fault/stacktrace">
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(
NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
DelegatingConstructorAccessorImpl.java:27)
at
java.lang.reflect.Constructor.newInstance(Constructor.java:494)
at java.lang.Class.newInstance0(Class.java:350)
at java.lang.Class.newInstance(Class.java:303)
at
org.globus.exec.utils.FaultUtils.makeFault(FaultUtils.java:485)
at org.globus.exec.utils.FaultUtils.createInvalidPathFault(
FaultUtils.java:129)
at
org.globus.exec.service.exec.StateMachine.createFaultFromErrorCode(
StateMachine.java:3184)
at
org
.globus
.exec.service.exec.StateMachine.processWaitingForStateChangesState
(StateMachine.java:1652)
at sun.reflect.GeneratedMethodAccessor6202.invoke(Unknown
Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.globus.exec.service.exec.StateMachine.processState(
StateMachine.java:328)
at org.globus.exec.service.exec.RunThread.run(RunThread.java:85)
</ns8:ErrorCode><ns8:Description>
org.globus.exec.generated.InvalidPathFaultType</ns8:Description></
ns8:FaultCause><ns9:FaultCause
xmlns:ns9="
http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd
"><ns9:Timestamp>2008-03-27T06:09:37.203Z
</ns9:Timestamp><ns9:Description>ProcessDied</
ns9:Description><ns9:FaultCause><ns9:Timestamp>2008-03-27T06:09:
37.207Z</ns9:Timestamp><ns9:ErrorCode dialect="
http://www.globus.org/fault/stacktrace">java.lang.Exception:
ProcessDied
at
org.globus.exec.service.exec.StateMachine.createFaultFromErrorCode(
StateMachine.java:3127)
at
org
.globus
.exec.service.exec.StateMachine.processWaitingForStateChangesState
(StateMachine.java:1652)
at sun.reflect.GeneratedMethodAccessor6202.invoke(Unknown
Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.globus.exec.service.exec.StateMachine.processState(
StateMachine.java:328)
at org.globus.exec.service.exec.RunThread.run(RunThread.java:85)
</ns9:ErrorCode><ns9:Description>java.lang.Exception
</ns9:Description></
ns9:FaultCause><ns9:FaultCause><ns9:Timestamp>2008-03-27T06:09:
37.207Z</ns9:Timestamp><ns9:ErrorCode dialect="
http://www.globus.org/fault/stacktrace">
at
org.globus.wsrf.utils.FaultHelper.toBaseFault(FaultHelper.java
:282)
at
org.globus.exec.utils.FaultUtils.makeFault(FaultUtils.java:505)
at org.globus.exec.utils.FaultUtils.createInvalidPathFault(
FaultUtils.java:129)
at
org.globus.exec.service.exec.StateMachine.createFaultFromErrorCode(
StateMachine.java:3184)
at
org
.globus
.exec.service.exec.StateMachine.processWaitingForStateChangesState
(StateMachine.java:1652)
at sun.reflect.GeneratedMethodAccessor6202.invoke(Unknown
Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.globus.exec.service.exec.StateMachine.processState(
StateMachine.java:328)
at org.globus.exec.service.exec.RunThread.run(RunThread.java:85)
</
ns9:ErrorCode><ns9:Description>org.oasis.wsrf.faults.BaseFaultType</
ns9:Description></ns9:FaultCause></
ns9:FaultCause><ns3:stateWhenFailureOccurred>Active</
ns3:stateWhenFailureOccurred><ns3:command>submit</
ns3:command><ns3:gt2ErrorCode>5</
ns3:gt2ErrorCode><ns3:attribute>executable</ns3:attribute><ns3:path>/
export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/
hmmpfam</ns3:path></ns3:invalidPathFault></ns3:fault><ns10:exitCode
xmlns:ns10="http://www.globus.org/namespaces/2004/10/gram/job/types
">5</ns10:exitCode></GetMultipleResourcePropertiesResponse></
soapenv:Body></soapenv:Envelope>
----------------------------------------------
Current job state: Failed
globusrun-ws: Job failed: Invalid executable path
"/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/
hmmpfam".
ProcessDied