Re: [gt-user] job unsubmitted problem

Yuriy Wed, 27 Aug 2008 14:02:59 -0700

> I'm quite sure that this solves your problem. Please let me know.
Yes, it does. Thank you.


Regards,
Yuriy

On Wed, Aug 27, 2008 at 08:55:05AM -0500, Martin Feller wrote:
> Ok, i think i see it now. You are hitting a combination of generous
> locking and a potential for an infinite loop in which your container
> happily cycles.
>
> This situation can happen if your job wants to fetch a non-existing
> credential (probably destroyed earlier) from the delegation service
> and then, because the credentail does not exist anymore, tries to
> delete the user proxy file created from that credential earlier, which
> does not exist either, because it was probably deleted when the
> credential was destroyed.
> A not completely uncommon situation i guess, and we handle that badly.
>
> I'll have to check how this should be fixed best. This fix should
> then also find it's ways into the VDT. I'll open a bug for that.
>
> A quick fix for you to go on is:
> Replace
>
>     -delete)
>         # proxyfile should exist
>         exec rm "$PROXYFILE"
>         exit $?
>         ;;
>
> by
>
>     -delete)
>         if [ -e "$PROXYFILE" ]; then
>             exec rm "$PROXYFILE"
>             exit $?
>         else
>             exit 0
>         fi
>         ;;
>
> in $GLOBUS_LOCATION/libexec/globus-gram-local-proxy-tool
>
> (A patch would have been nicer, but i don't know if our versions of
> that file are the same)
>
> I'm quite sure that this solves your problem. Please let me know.
>
> Martin
>
>
> Yuriy wrote:
>>> A process per thread, what operation system is that?
>>
>> CentOS 4.4 paravirtualized.
>>
>>> Do you see a thread-dump in the container log after stopping the container?
>>
>> Thread dump seemed to appear in container.log instead of
>> container-real.log
>>
>> Regards,
>> Yuriy
>>
>> On Mon, Aug 25, 2008 at 11:55:26PM -0500, Martin Feller wrote:
>>> Ok, unfortunately the plus of logging did not provide a plus of insight.
>>>
>>> After the following line the processing for the job stops:
>>> 2008-08-26 16:21:23,909 DEBUG utils.DelegatedCredential 
>>> [RunQueueThread_13,getDelegatedCredential:116] checking for existing 
>>> credential listener
>>>
>>> I don't see the thread-dump in the container log.
>>> A process per thread, what operation system is that?
>>> I once debugged Gram on such a system, and the thread-dump didn't show up in
>>> the container log either, but was only printed when the JVM went down.
>>> Did you stop the container before you grabbed the logfile?
>>>
>>> So, sorry for that, we need to retry, but this time please try
>>> to stop the container after the kill -QUIT before you grab the logfile:
>>>
>>> Same as always:
>>>  * stop the container
>>>  * clean up the persistence directory
>>>  * put your persisted job in place
>>>  * start the container (keep the additional debugging, it does not hurt)
>>>  * submit a job
>>>  * when it hangs do the kill -QUIT
>>>  * stop the container
>>>
>>> Do you see a thread-dump in the container log after stopping the container?
>>> If so, please send.
>>> If not, ... i hope you'll see it
>>>
>>> Martin
>>>
>>>
>>> Yuriy wrote:
>>>> Ok, new log attached. I had 168 java processes running, all with the
>>>> same command line:
>>>>
>>>> /opt/vdt/jdk1.5/bin/java
>>>> -Dlog4j.configuration=container-log4j.properties -Xmx512M
>>>> -Dorg.globus.wsrf.container.persistence.dir=/opt/vdt/vdt-app-data/globus/persisted
>>>> -DGLOBUS_LOCATION=/opt/vdt/globus
>>>> -Djava.endorsed.dirs=/opt/vdt/globus/endorsed
>>>> -DX509_CERT_DIR=/opt/vdt/globus/TRUSTED_CA
>>>> -DGLOBUS_TCP_PORT_RANGE=40000,41000
>>>> -Djava.security.egd=file:///dev/urandom -classpath
>>>> /opt/vdt/globus/lib/bootstrap.jar:/opt/vdt/globus/lib/cog-url.jar:/opt/vdt/globus/lib/axis-url.jar
>>>> org.globus.bootstrap.Bootstrap
>>>> org.globus.wsrf.container.ServiceContainer -p 8443
>>>>
>>>> Not sure which one is container id, so I executed
>>>>
>>>> pkill -QUIT java
>>>>
>>>>
>>>> Regards,
>>>> Yuriy
>>>>
>>>>
>>>> On Mon, Aug 25, 2008 at 10:48:05PM -0500, Martin Feller wrote:
>>>>> Hm, processing all of a sudden seems to stop.
>>>>>
>>>>> i need more information i think:
>>>>> Please stop the container and add the following line to
>>>>> $GLOBUS_LOCATION/container-log4j.properties
>>>>>
>>>>>     log4j.category.org.globus=DEBUG
>>>>>
>>>>> Then put your problematic persistence data in place, restart the 
>>>>> container and
>>>>> submit a job. If the job hangs again, please create a thread-dump of the 
>>>>> container
>>>>> by calling
>>>>>
>>>>>     kill -QUIT <containerProcessId>
>>>>>
>>>>> and send the container logfile once more.
>>>>>
>>>>> Thanks, Martin
>>>>>
>>>>> Yuriy wrote:
>>>>>> Sorry, last time I edited log4.properties instead of   
>>>>>> container-log4.properties New log attached.
>>>>>>
>>>>>> On Mon, Aug 25, 2008 at 08:35:35AM -0500, Martin Feller wrote:
>>>>>>> Gram was not started in debug mode as far as i can tell from the 
>>>>>>> logfile you attached.
>>>>>>> I can't see much from that log.
>>>>>>>
>>>>>>> Are you sure you have a line like this in 
>>>>>>> $GLOBUS_LOCATION/container-log4j.properties?
>>>>>>>    log4j.category.org.globus.exec.service=DEBUG
>>>>>>>
>>>>>>> (no # at the beginning of the line)
>>>>>>>
>>>>>>> Looks like this is GT from VDT. I'm not 100% sure if you enable debug 
>>>>>>> logging for GRAM
>>>>>>> in a different way, but i don't think so. Check with the admins if you 
>>>>>>> are not sure about
>>>>>>> that.
>>>>>>> As an example I attached a container log with 1 persisted job in the
>>>>>>> persistence directory. That's how it should look like if GRAM has debug 
>>>>>>> logging enabled.
>>>>>>> Please retry and send the log again.
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>> Yuriy wrote:
>>>>>>>> On Fri, Aug 22, 2008 at 01:40:00PM -0500, Martin Feller wrote:
>>>>>>>>> Please try the following:
>>>>>>>>>
>>>>>>>>> 1. In the situation when the job hangs:
>>>>>>>>>    How about submitting a job in batch mode (globusrun-ws -submit -b 
>>>>>>>>> -o job.epr ...)
>>>>>>>>>    and query for job status instead of listening for notifications
>>>>>>>>>    (globusrun-ws -status -j job.epr)
>>>>>>>>>    Does the job status change after a while? (I don't expect it, but 
>>>>>>>>> just to make sure)
>>>>>>>>>
>>>>>>>> No, still "unsubmitted"
>>>>>>>>
>>>>>>>>
>>>>>>>>> 2. Shut down the container, enable debug logging in Gram4
>>>>>>>>>    (uncomment # log4j.category.org.globus.exec.service=DEBUG in
>>>>>>>>>    $GLOBUS_LOCATION/container-log4j.properties), clean up the 
>>>>>>>>> persistence directory,
>>>>>>>>>    move the problematic persisted job into the persistence data, 
>>>>>>>>> start the container,
>>>>>>>>>    submit a job.
>>>>>>>>>    Please send the container logfile then.
>>>>>>>>>
>>>>>>>> log file attached. I had to increase termination time of that job to
>>>>>>>> 26th, otherwise that file is silently removed and jobs can be submitted
>>>>>>>> as usual. 
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Yuriy
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks, Martin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yuriy wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am having very strange problems with globus GRAM.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Submission of job with globusrun-ws hangs on "Job Unsubmitted"
>>>>>>>>>> message. I tried to submit job from two different machines with the
>>>>>>>>>> same result. 
>>>>>>>>>>
>>>>>>>>>>  globusrun-ws     -submit    -J -S -F ng2.auckland.ac.nz:8443 -Ft 
>>>>>>>>>> Fork -o test.epr    -c /bin/echo "hello"
>>>>>>>>>> Delegating user credentials...Done.
>>>>>>>>>> Submitting job...Done.
>>>>>>>>>> Job ID: uuid:6eeadb2c-6ffa-11dd-a2f7-00163e000005
>>>>>>>>>> Termination time: 08/23/2008 03:28 GMT
>>>>>>>>>> Current job state: Unsubmitted
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Sample java program (attached) and CoG client
>>>>>>>>>> (cog-job-submit) work normally. 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Globus restart does not help, unless I remove persisted
>>>>>>>>>> directory. Persisted is on local partition.  I figured that single
>>>>>>>>>> file in ManagedExecutableJobResourceStateType causes the problem (xml
>>>>>>>>>> attached). When I remove this file and restart globus, globusws-run
>>>>>>>>>> works normally. When I copy this file into
>>>>>>>>>> persisted/ManagedExecutableJobResourceState, and restart globus, it
>>>>>>>>>> breaks again. My globus breaks every 3-7 days so there are other job
>>>>>>>>>> resouces that cause this problem. 
>>>>>>>>>>
>>>>>>>>>> globus version is 4.0.7 from VDT 1.10
>>>>>>>>>>
>>>>>>>>>> What is going on here?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Yuriy
>>>>>>>>>>

Re: [gt-user] job unsubmitted problem

Reply via email to