Thanks for the replies, guys.  I think it is the case that there is an
actual underlying job failure (which is my fault for not finding earlier)
and that the error message is a complete red herring.  Basically, we have a
SEG that we wrote from scratch for BOINC jobs, and in the case that a job
fails for some reason the following code gets executed:

//job failed
time_t ts;
globus_scheduler_event_failed(ts, wu_name, 5);
fprintf(fp, "%s : Set %s to failed\n",asctime(t), wu_name);
break;

Now, it's my guess that the "Invalid executable path" error generated before
is just some kind of default error message, mainly because I don't see any
place in our code where we specify the error message, unless that "5" as the
last argument to globus_scheduler_event_failed means something specific.
Does anyone know what that last argument means?  I looked quickly here (
http://www.globus.org/toolkit/docs/development/3.9.4/execution/wsgram/developer/scheduler-tutorial-seg.html)
but didn't see it.

Thanks,
Adam


On Wed, Apr 2, 2008 at 9:54 AM, Stuart Martin <[EMAIL PROTECTED]> wrote:

> Adam,
>
> If we are dealing with a tmp dir, a cluster, and many jobs/lots of
> activity on the cluster, this could be a problem with the shared file
> system.  Maybe occasionally, the compute host where the job is run cannot
> see the tmp dir / executable?
>
> -Stu
>
>
> On Apr 1, 2008, at Apr 1, 11:41 PM, [EMAIL PROTECTED] wrote:
>
>  Hi Adam,
> >
> > i can't say much about it right now, but at first glance it looks
> > to me that the application causes the problem. I have to add that
> > i don't know BOINC and probably i didn't understand all details.
> >
> > Can you describe the role of hmmpfam a bit more:
> > * this is not the main executable, right?
> > * it is called by the main executable somehow under certain conditions?
> > * if so: you said that hmmpfam actually should not be used at all.
> >  in what situation could the executable call hmmpfam (i love
> >  that word ... :-) ).
> >
> > Martin
> >
> >
> >  Hi,
> > >
> > > We are experiencing a strange problem that is causing jobs to fail,
> > >
> > albeit
> >
> > > somewhat randomly and infrequently.  At any given time, we may have
> > > 100
> > >
> > active GRAM jobs on a given resource.  All of these jobs submit fine and
> > there are usually no immediate failures.  However, every so often, one
> > will
> >
> > > fail, and this can be days after it was submitted, with the following
> > >
> > error:
> >
> > >
> > > [EMAIL PROTECTED]:/export/grid_files/260600020.09477316738932795>
> > >
> > globusrun-ws -status -j jobEPR.txt
> >
> > > Current job state: Failed
> > > globusrun-ws: Job failed: Invalid executable path
> > >
> > > "/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/hmmpfam".
> > >
> > ProcessDied
> >
> > >
> > > Now, the resource we are submitting to is unique in that we are not
> > >
> > actually
> >
> > > transferring in the hmmpfam executable; that is just a dummy path, and
> > >
> > our
> >
> > > custom BOINC job manager does not attempt to make use of it, as BOINC
> > >
> > executables live elsewhere.  So until recently, the executable specified
> > on
> >
> > > that path never existed, and so the error *kinda* made sense; what
> > >
> > didn't
> >
> > > make sense is why it happened randomly.  In an attempt to make this
> > > problem
> > > go away, I now have the BOINC job manager create a dummy executable on
> > > that
> > > path when the job is submitted, but it doesn't look like that has
> > > helped
> > >
> > because the error is still popping up.  *Now* the error message
> > certainly
> >
> > > doesn't make sense if taken at face value, because that has been a
> > >
> > "valid
> >
> > > path", technically speaking, for the lifetime of the job in question
> > > -- yet
> > > the job still failed.  If it helps, I'll attach debug output below,
> > >
> > though
> >
> > > I
> > > wasn't able to glean any additional information from it.  Does anyone
> > >
> > have
> >
> > > a
> > > guess as to why this would happen so randomly and infrequently, or
> > >
> > happen
> >
> > > in
> > > the first place?
> > >
> > > This one is costing us big time because when a job fails, Globus
> > > deletes
> > >
> > all
> >
> > > the output collected thus far, and these are large batches of work.
> > >
> > > Thanks!
> > > Adam
> > >
> > > [EMAIL PROTECTED]:/export/grid_files/260600020.09477316738932795>
> > >
> > globusrun-ws -debug -status -j jobEPR.txt
> >
> > >
> > > === REQUEST MESSAGE (length 816) (time 1206627634.506184000) ===
> > >
> > <ns00:Envelope
> >
> > > xmlns:ns00="http://schemas.xmlsoap.org/soap/
> > > envelope/"><ns00:Header></ns00:Header><ns00:Body><ns01:GetMultipleResourceProperties
> > >
> > xmlns:ns01="
> >
> > >
> > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties-1.2-draft-01.xsd
> > > "><ns01:ResourceProperty
> > >
> > xmlns:ns02="http://www.globus.org/namespaces/2004/10/gram/job/
> > types">ns02:state</ns01:ResourceProperty><ns01:ResourceProperty
> > xmlns:ns02="http://www.globus.org/namespaces/2004/10/gram/job/
> > types">ns02:holding</ns01:ResourceProperty><ns01:ResourceProperty
> > xmlns:ns03="http://www.globus.org/namespaces/2004/10/gram/job/
> > faults">ns03:fault</ns01:ResourceProperty><ns01:ResourceProperty
> > xmlns:ns02="http://www.globus.org/namespaces/2004/10/gram/job/types
> >
> > ">ns02:exitCode</ns01:ResourceProperty></ns01:GetMultipleResourceProperties></ns00:Body></ns00:Envelope>
> > ----------------------------------------------
> >
> > >
> > > === RESPONSE MESSAGE (length 6399) (time 1206627634.546965000) ===
> > >
> > <soapenv:Envelope
> >
> > > xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/";
> > > xmlns:xsd="http://www.w3.org/2001/XMLSchema"; xmlns:xsi="
> > > http://www.w3.org/2001/XMLSchema-instance"; xmlns:wsa="
> > > http://schemas.xmlsoap.org/ws/2004/03/
> > > addressing"><soapenv:Header><wsa:MessageID
> > >
> >
> > soapenv:mustUnderstand="0">uuid:f6d14500-fc08-11dc-a90a-8a2384ad991f</wsa:MessageID><wsa:To
> > soapenv:mustUnderstand="0">
> >
> > > http://schemas.xmlsoap.org/ws/2004/03/addressing/role/anonymous
> > > </wsa:To><wsa:Action
> > >
> > soapenv:mustUnderstand="0">
> >
> > >
> > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties/GetMultipleResourcePropertiesResponse
> > > </wsa:Action><wsa:From
> > >
> > soapenv:mustUnderstand="0" xmlns:ns4="
> >
> > > http://www.globus.org/namespaces/2004/10/gram/job";><wsa:Address>
> > >
> > https://128.8.120.35:8443/wsrf/services/ManagedExecutableJobService
> > </wsa:Address><wsa:ReferenceProperties><ns4:ResourceID
> > xmlns:ns4="http://www.globus.org/namespaces/2004/10/gram/
> > job">a1253c20-fa72-11dc-a908-8a2384ad991f</ns4:ResourceID></wsa:ReferenceProperties></wsa:From><wsa:RelatesTo
> > RelationshipType="wsa:Reply"
> >
> > >
> > > soapenv:mustUnderstand="0">uuid:f6d04560-fc08-11dc-b962-000f1f66888a</wsa:RelatesTo></soapenv:Header><soapenv:Body><GetMultipleResourcePropertiesResponse
> > >
> > xmlns="
> >
> > >
> > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties-1.2-draft-01.xsd
> > > "><ns1:state
> > >
> > xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job/
> > types">Failed</ns1:state><ns2:holding
> > xmlns:ns2="http://www.globus.org/namespaces/2004/10/gram/job/
> > types">false</ns2:holding><ns3:fault
> > xmlns:ns3="http://www.globus.org/namespaces/2004/10/gram/job/
> > faults"><ns3:invalidPathFault><ns4:Timestamp
> > xmlns:ns4="
> >
> > >
> > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd
> > >
> > ">2008-03-27T06:09:37.203Z</ns4:Timestamp><ns5:Originator xmlns:ns5="
> >
> > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd
> > "><wsa:Address>
> >
> > > https://128.8.120.35:8443/wsrf/services/ManagedJobFactoryService
> > > </wsa:Address><wsa:ReferenceProperties><ns6:ResourceID
> > >
> > xmlns:ns6="http://www.globus.org/namespaces/2004/10/gram/
> > job">a1253c20-fa72-11dc-a908-8a2384ad991f</ns6:ResourceID></wsa:ReferenceProperties><wsa:ReferenceParameters/></ns5:Originator><ns7:Description
> > xmlns:ns7="
> >
> > >
> > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd
> > > ">Invalid
> > >
> > executable path
> >
> > >
> > > &quot;/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/hmmpfam&quot;.</ns7:Description><ns8:FaultCause
> > >
> > xmlns:ns8="
> >
> > >
> > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd
> > >
> > "><ns8:Timestamp>2008-03-27T06:09:37.203Z</ns8:Timestamp><ns8:ErrorCode
> > dialect="http://www.globus.org/fault/stacktrace";>
> >
> > >       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > Method)
> > >       at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> > > NativeConstructorAccessorImpl.java:39)
> > >       at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> > > DelegatingConstructorAccessorImpl.java:27)
> > >       at
> > >
> > java.lang.reflect.Constructor.newInstance(Constructor.java:494)
> >
> > >       at java.lang.Class.newInstance0(Class.java:350)
> > >       at java.lang.Class.newInstance(Class.java:303)
> > >       at
> > >
> > org.globus.exec.utils.FaultUtils.makeFault(FaultUtils.java:485)
> >
> > >       at org.globus.exec.utils.FaultUtils.createInvalidPathFault(
> > > FaultUtils.java:129)
> > >       at
> > > org.globus.exec.service.exec.StateMachine.createFaultFromErrorCode(
> > >
> > StateMachine.java:3184)
> >
> > >       at
> > >
> > > org.globus.exec.service.exec.StateMachine.processWaitingForStateChangesState
> > >
> > (StateMachine.java:1652)
> >
> > >       at sun.reflect.GeneratedMethodAccessor6202.invoke(Unknown
> > >
> > Source)
> >
> > >       at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > DelegatingMethodAccessorImpl.java:25)
> > >       at java.lang.reflect.Method.invoke(Method.java:585)
> > >       at org.globus.exec.service.exec.StateMachine.processState(
> > > StateMachine.java:328)
> > >       at org.globus.exec.service.exec.RunThread.run(RunThread.java:85)
> > > </ns8:ErrorCode><ns8:Description>
> > >
> > > org.globus.exec.generated.InvalidPathFaultType</ns8:Description></ns8:FaultCause><ns9:FaultCause
> > >
> > xmlns:ns9="
> >
> > >
> > > http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd
> > >
> > "><ns9:Timestamp>2008-03-27T06:09:37.203Z
> >
> > >
> > > </ns9:Timestamp><ns9:Description>ProcessDied</ns9:Description><ns9:FaultCause><ns9:Timestamp>2008-03-27T06:09:
> > >
> > 37.207Z</ns9:Timestamp><ns9:ErrorCode dialect="
> >
> > > http://www.globus.org/fault/stacktrace";>java.lang.Exception:
> > > ProcessDied
> > >       at
> > > org.globus.exec.service.exec.StateMachine.createFaultFromErrorCode(
> > >
> > StateMachine.java:3127)
> >
> > >       at
> > >
> > > org.globus.exec.service.exec.StateMachine.processWaitingForStateChangesState
> > >
> > (StateMachine.java:1652)
> >
> > >       at sun.reflect.GeneratedMethodAccessor6202.invoke(Unknown
> > >
> > Source)
> >
> > >       at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > DelegatingMethodAccessorImpl.java:25)
> > >       at java.lang.reflect.Method.invoke(Method.java:585)
> > >       at org.globus.exec.service.exec.StateMachine.processState(
> > > StateMachine.java:328)
> > >       at org.globus.exec.service.exec.RunThread.run(RunThread.java:85)
> > > </ns9:ErrorCode><ns9:Description>java.lang.Exception
> > >
> > > </ns9:Description></ns9:FaultCause><ns9:FaultCause><ns9:Timestamp>2008-03-27T06:09:
> > >
> > 37.207Z</ns9:Timestamp><ns9:ErrorCode dialect="
> >
> > > http://www.globus.org/fault/stacktrace";>
> > >       at
> > >
> > org.globus.wsrf.utils.FaultHelper.toBaseFault(FaultHelper.java
> >
> > > :282)
> > >       at
> > >
> > org.globus.exec.utils.FaultUtils.makeFault(FaultUtils.java:505)
> >
> > >       at org.globus.exec.utils.FaultUtils.createInvalidPathFault(
> > > FaultUtils.java:129)
> > >       at
> > > org.globus.exec.service.exec.StateMachine.createFaultFromErrorCode(
> > >
> > StateMachine.java:3184)
> >
> > >       at
> > >
> > > org.globus.exec.service.exec.StateMachine.processWaitingForStateChangesState
> > >
> > (StateMachine.java:1652)
> >
> > >       at sun.reflect.GeneratedMethodAccessor6202.invoke(Unknown
> > >
> > Source)
> >
> > >       at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > > DelegatingMethodAccessorImpl.java:25)
> > >       at java.lang.reflect.Method.invoke(Method.java:585)
> > >       at org.globus.exec.service.exec.StateMachine.processState(
> > > StateMachine.java:328)
> > >       at org.globus.exec.service.exec.RunThread.run(RunThread.java:85)
> > >
> > > </ns9:ErrorCode><ns9:Description>org.oasis.wsrf.faults.BaseFaultType</ns9:Description></ns9:FaultCause></ns9:FaultCause><ns3:stateWhenFailureOccurred>Active</ns3:stateWhenFailureOccurred><ns3:command>submit</ns3:command><ns3:gt2ErrorCode>5</ns3:gt2ErrorCode><ns3:attribute>executable</ns3:attribute><ns3:path>/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/hmmpfam</ns3:path></ns3:invalidPathFault></ns3:fault><ns10:exitCode
> > >
> > xmlns:ns10="http://www.globus.org/namespaces/2004/10/gram/job/types
> >
> > ">5</ns10:exitCode></GetMultipleResourcePropertiesResponse></soapenv:Body></soapenv:Envelope>
> > ----------------------------------------------
> >
> > > Current job state: Failed
> > > globusrun-ws: Job failed: Invalid executable path
> > >
> > > "/export/scratch/applications/a5671f0138bc65dc700001aa80a3f378/hmmpfam".
> > >
> > ProcessDied
> >
> > >
> > >
> >
> >
> >
> >
>

Reply via email to