Ok, i think i see it now. You are hitting a combination of generous
locking and a potential for an infinite loop in which your container
happily cycles.

This situation can happen if your job wants to fetch a non-existing
credential (probably destroyed earlier) from the delegation service
and then, because the credentail does not exist anymore, tries to
delete the user proxy file created from that credential earlier, which
does not exist either, because it was probably deleted when the
credential was destroyed.
A not completely uncommon situation i guess, and we handle that badly.

I'll have to check how this should be fixed best. This fix should
then also find it's ways into the VDT. I'll open a bug for that.

A quick fix for you to go on is:
Replace

    -delete)
        # proxyfile should exist
        exec rm "$PROXYFILE"
        exit $?
        ;;

by

    -delete)
        if [ -e "$PROXYFILE" ]; then
            exec rm "$PROXYFILE"
            exit $?
        else
            exit 0
        fi
        ;;

in $GLOBUS_LOCATION/libexec/globus-gram-local-proxy-tool

(A patch would have been nicer, but i don't know if our versions of
that file are the same)

I'm quite sure that this solves your problem. Please let me know.

Martin


Yuriy wrote:
A process per thread, what operation system is that?

CentOS 4.4 paravirtualized.

Do you see a thread-dump in the container log after stopping the container?

Thread dump seemed to appear in container.log instead of
container-real.log

Regards,
Yuriy

On Mon, Aug 25, 2008 at 11:55:26PM -0500, Martin Feller wrote:
Ok, unfortunately the plus of logging did not provide a plus of insight.

After the following line the processing for the job stops:
2008-08-26 16:21:23,909 DEBUG utils.DelegatedCredential 
[RunQueueThread_13,getDelegatedCredential:116] checking for existing credential 
listener

I don't see the thread-dump in the container log.
A process per thread, what operation system is that?
I once debugged Gram on such a system, and the thread-dump didn't show up in
the container log either, but was only printed when the JVM went down.
Did you stop the container before you grabbed the logfile?

So, sorry for that, we need to retry, but this time please try
to stop the container after the kill -QUIT before you grab the logfile:

Same as always:
 * stop the container
 * clean up the persistence directory
 * put your persisted job in place
 * start the container (keep the additional debugging, it does not hurt)
 * submit a job
 * when it hangs do the kill -QUIT
 * stop the container

Do you see a thread-dump in the container log after stopping the container?
If so, please send.
If not, ... i hope you'll see it

Martin


Yuriy wrote:
Ok, new log attached. I had 168 java processes running, all with the
same command line:

/opt/vdt/jdk1.5/bin/java
-Dlog4j.configuration=container-log4j.properties -Xmx512M
-Dorg.globus.wsrf.container.persistence.dir=/opt/vdt/vdt-app-data/globus/persisted
-DGLOBUS_LOCATION=/opt/vdt/globus
-Djava.endorsed.dirs=/opt/vdt/globus/endorsed
-DX509_CERT_DIR=/opt/vdt/globus/TRUSTED_CA
-DGLOBUS_TCP_PORT_RANGE=40000,41000
-Djava.security.egd=file:///dev/urandom -classpath
/opt/vdt/globus/lib/bootstrap.jar:/opt/vdt/globus/lib/cog-url.jar:/opt/vdt/globus/lib/axis-url.jar
org.globus.bootstrap.Bootstrap
org.globus.wsrf.container.ServiceContainer -p 8443

Not sure which one is container id, so I executed

pkill -QUIT java


Regards,
Yuriy


On Mon, Aug 25, 2008 at 10:48:05PM -0500, Martin Feller wrote:
Hm, processing all of a sudden seems to stop.

i need more information i think:
Please stop the container and add the following line to
$GLOBUS_LOCATION/container-log4j.properties

    log4j.category.org.globus=DEBUG

Then put your problematic persistence data in place, restart the container and
submit a job. If the job hangs again, please create a thread-dump of the 
container
by calling

    kill -QUIT <containerProcessId>

and send the container logfile once more.

Thanks, Martin

Yuriy wrote:
Sorry, last time I edited log4.properties instead of container-log4.properties New log attached.

On Mon, Aug 25, 2008 at 08:35:35AM -0500, Martin Feller wrote:
Gram was not started in debug mode as far as i can tell from the logfile you 
attached.
I can't see much from that log.

Are you sure you have a line like this in 
$GLOBUS_LOCATION/container-log4j.properties?
   log4j.category.org.globus.exec.service=DEBUG

(no # at the beginning of the line)

Looks like this is GT from VDT. I'm not 100% sure if you enable debug logging 
for GRAM
in a different way, but i don't think so. Check with the admins if you are not 
sure about
that.
As an example I attached a container log with 1 persisted job in the
persistence directory. That's how it should look like if GRAM has debug logging 
enabled.
Please retry and send the log again.

Martin

Yuriy wrote:
On Fri, Aug 22, 2008 at 01:40:00PM -0500, Martin Feller wrote:
Please try the following:

1. In the situation when the job hangs:
   How about submitting a job in batch mode (globusrun-ws -submit -b -o job.epr 
...)
   and query for job status instead of listening for notifications
   (globusrun-ws -status -j job.epr)
   Does the job status change after a while? (I don't expect it, but just to 
make sure)

No, still "unsubmitted"


2. Shut down the container, enable debug logging in Gram4
   (uncomment # log4j.category.org.globus.exec.service=DEBUG in
   $GLOBUS_LOCATION/container-log4j.properties), clean up the persistence 
directory,
   move the problematic persisted job into the persistence data, start the 
container,
   submit a job.
   Please send the container logfile then.

log file attached. I had to increase termination time of that job to
26th, otherwise that file is silently removed and jobs can be submitted
as usual.
Regards,
Yuriy




Thanks, Martin


Yuriy wrote:
Hi,

I am having very strange problems with globus GRAM.


Submission of job with globusrun-ws hangs on "Job Unsubmitted"
message. I tried to submit job from two different machines with the
same result.
 globusrun-ws     -submit    -J -S -F ng2.auckland.ac.nz:8443 -Ft Fork -o test.epr    -c 
/bin/echo "hello"
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:6eeadb2c-6ffa-11dd-a2f7-00163e000005
Termination time: 08/23/2008 03:28 GMT
Current job state: Unsubmitted


Sample java program (attached) and CoG client
(cog-job-submit) work normally.


Globus restart does not help, unless I remove persisted
directory. Persisted is on local partition.  I figured that single
file in ManagedExecutableJobResourceStateType causes the problem (xml
attached). When I remove this file and restart globus, globusws-run
works normally. When I copy this file into
persisted/ManagedExecutableJobResourceState, and restart globus, it
breaks again. My globus breaks every 3-7 days so there are other job
resouces that cause this problem.
globus version is 4.0.7 from VDT 1.10

What is going on here?

Regards,
Yuriy


Reply via email to