Ok, i think i see it now. You are hitting a combination of generous
locking and a potential for an infinite loop in which your container
happily cycles.
This situation can happen if your job wants to fetch a non-existing
credential (probably destroyed earlier) from the delegation service
and then, because the credentail does not exist anymore, tries to
delete the user proxy file created from that credential earlier, which
does not exist either, because it was probably deleted when the
credential was destroyed.
A not completely uncommon situation i guess, and we handle that badly.
I'll have to check how this should be fixed best. This fix should
then also find it's ways into the VDT. I'll open a bug for that.
A quick fix for you to go on is:
Replace
-delete)
# proxyfile should exist
exec rm "$PROXYFILE"
exit $?
;;
by
-delete)
if [ -e "$PROXYFILE" ]; then
exec rm "$PROXYFILE"
exit $?
else
exit 0
fi
;;
in $GLOBUS_LOCATION/libexec/globus-gram-local-proxy-tool
(A patch would have been nicer, but i don't know if our versions of
that file are the same)
I'm quite sure that this solves your problem. Please let me know.
Martin
Yuriy wrote:
A process per thread, what operation system is that?
CentOS 4.4 paravirtualized.
Do you see a thread-dump in the container log after stopping the container?
Thread dump seemed to appear in container.log instead of
container-real.log
Regards,
Yuriy
On Mon, Aug 25, 2008 at 11:55:26PM -0500, Martin Feller wrote:
Ok, unfortunately the plus of logging did not provide a plus of insight.
After the following line the processing for the job stops:
2008-08-26 16:21:23,909 DEBUG utils.DelegatedCredential
[RunQueueThread_13,getDelegatedCredential:116] checking for existing credential
listener
I don't see the thread-dump in the container log.
A process per thread, what operation system is that?
I once debugged Gram on such a system, and the thread-dump didn't show up in
the container log either, but was only printed when the JVM went down.
Did you stop the container before you grabbed the logfile?
So, sorry for that, we need to retry, but this time please try
to stop the container after the kill -QUIT before you grab the logfile:
Same as always:
* stop the container
* clean up the persistence directory
* put your persisted job in place
* start the container (keep the additional debugging, it does not hurt)
* submit a job
* when it hangs do the kill -QUIT
* stop the container
Do you see a thread-dump in the container log after stopping the container?
If so, please send.
If not, ... i hope you'll see it
Martin
Yuriy wrote:
Ok, new log attached. I had 168 java processes running, all with the
same command line:
/opt/vdt/jdk1.5/bin/java
-Dlog4j.configuration=container-log4j.properties -Xmx512M
-Dorg.globus.wsrf.container.persistence.dir=/opt/vdt/vdt-app-data/globus/persisted
-DGLOBUS_LOCATION=/opt/vdt/globus
-Djava.endorsed.dirs=/opt/vdt/globus/endorsed
-DX509_CERT_DIR=/opt/vdt/globus/TRUSTED_CA
-DGLOBUS_TCP_PORT_RANGE=40000,41000
-Djava.security.egd=file:///dev/urandom -classpath
/opt/vdt/globus/lib/bootstrap.jar:/opt/vdt/globus/lib/cog-url.jar:/opt/vdt/globus/lib/axis-url.jar
org.globus.bootstrap.Bootstrap
org.globus.wsrf.container.ServiceContainer -p 8443
Not sure which one is container id, so I executed
pkill -QUIT java
Regards,
Yuriy
On Mon, Aug 25, 2008 at 10:48:05PM -0500, Martin Feller wrote:
Hm, processing all of a sudden seems to stop.
i need more information i think:
Please stop the container and add the following line to
$GLOBUS_LOCATION/container-log4j.properties
log4j.category.org.globus=DEBUG
Then put your problematic persistence data in place, restart the container and
submit a job. If the job hangs again, please create a thread-dump of the
container
by calling
kill -QUIT <containerProcessId>
and send the container logfile once more.
Thanks, Martin
Yuriy wrote:
Sorry, last time I edited log4.properties instead of
container-log4.properties New log attached.
On Mon, Aug 25, 2008 at 08:35:35AM -0500, Martin Feller wrote:
Gram was not started in debug mode as far as i can tell from the logfile you
attached.
I can't see much from that log.
Are you sure you have a line like this in
$GLOBUS_LOCATION/container-log4j.properties?
log4j.category.org.globus.exec.service=DEBUG
(no # at the beginning of the line)
Looks like this is GT from VDT. I'm not 100% sure if you enable debug logging
for GRAM
in a different way, but i don't think so. Check with the admins if you are not
sure about
that.
As an example I attached a container log with 1 persisted job in the
persistence directory. That's how it should look like if GRAM has debug logging
enabled.
Please retry and send the log again.
Martin
Yuriy wrote:
On Fri, Aug 22, 2008 at 01:40:00PM -0500, Martin Feller wrote:
Please try the following:
1. In the situation when the job hangs:
How about submitting a job in batch mode (globusrun-ws -submit -b -o job.epr
...)
and query for job status instead of listening for notifications
(globusrun-ws -status -j job.epr)
Does the job status change after a while? (I don't expect it, but just to
make sure)
No, still "unsubmitted"
2. Shut down the container, enable debug logging in Gram4
(uncomment # log4j.category.org.globus.exec.service=DEBUG in
$GLOBUS_LOCATION/container-log4j.properties), clean up the persistence
directory,
move the problematic persisted job into the persistence data, start the
container,
submit a job.
Please send the container logfile then.
log file attached. I had to increase termination time of that job to
26th, otherwise that file is silently removed and jobs can be submitted
as usual.
Regards,
Yuriy
Thanks, Martin
Yuriy wrote:
Hi,
I am having very strange problems with globus GRAM.
Submission of job with globusrun-ws hangs on "Job Unsubmitted"
message. I tried to submit job from two different machines with the
same result.
globusrun-ws -submit -J -S -F ng2.auckland.ac.nz:8443 -Ft Fork -o test.epr -c
/bin/echo "hello"
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:6eeadb2c-6ffa-11dd-a2f7-00163e000005
Termination time: 08/23/2008 03:28 GMT
Current job state: Unsubmitted
Sample java program (attached) and CoG client
(cog-job-submit) work normally.
Globus restart does not help, unless I remove persisted
directory. Persisted is on local partition. I figured that single
file in ManagedExecutableJobResourceStateType causes the problem (xml
attached). When I remove this file and restart globus, globusws-run
works normally. When I copy this file into
persisted/ManagedExecutableJobResourceState, and restart globus, it
breaks again. My globus breaks every 3-7 days so there are other job
resouces that cause this problem.
globus version is 4.0.7 from VDT 1.10
What is going on here?
Regards,
Yuriy