> I'm quite sure that this solves your problem. Please let me know. Yes, it does. Thank you.
Regards, Yuriy On Wed, Aug 27, 2008 at 08:55:05AM -0500, Martin Feller wrote: > Ok, i think i see it now. You are hitting a combination of generous > locking and a potential for an infinite loop in which your container > happily cycles. > > This situation can happen if your job wants to fetch a non-existing > credential (probably destroyed earlier) from the delegation service > and then, because the credentail does not exist anymore, tries to > delete the user proxy file created from that credential earlier, which > does not exist either, because it was probably deleted when the > credential was destroyed. > A not completely uncommon situation i guess, and we handle that badly. > > I'll have to check how this should be fixed best. This fix should > then also find it's ways into the VDT. I'll open a bug for that. > > A quick fix for you to go on is: > Replace > > -delete) > # proxyfile should exist > exec rm "$PROXYFILE" > exit $? > ;; > > by > > -delete) > if [ -e "$PROXYFILE" ]; then > exec rm "$PROXYFILE" > exit $? > else > exit 0 > fi > ;; > > in $GLOBUS_LOCATION/libexec/globus-gram-local-proxy-tool > > (A patch would have been nicer, but i don't know if our versions of > that file are the same) > > I'm quite sure that this solves your problem. Please let me know. > > Martin > > > Yuriy wrote: >>> A process per thread, what operation system is that? >> >> CentOS 4.4 paravirtualized. >> >>> Do you see a thread-dump in the container log after stopping the container? >> >> Thread dump seemed to appear in container.log instead of >> container-real.log >> >> Regards, >> Yuriy >> >> On Mon, Aug 25, 2008 at 11:55:26PM -0500, Martin Feller wrote: >>> Ok, unfortunately the plus of logging did not provide a plus of insight. >>> >>> After the following line the processing for the job stops: >>> 2008-08-26 16:21:23,909 DEBUG utils.DelegatedCredential >>> [RunQueueThread_13,getDelegatedCredential:116] checking for existing >>> credential listener >>> >>> I don't see the thread-dump in the container log. >>> A process per thread, what operation system is that? >>> I once debugged Gram on such a system, and the thread-dump didn't show up in >>> the container log either, but was only printed when the JVM went down. >>> Did you stop the container before you grabbed the logfile? >>> >>> So, sorry for that, we need to retry, but this time please try >>> to stop the container after the kill -QUIT before you grab the logfile: >>> >>> Same as always: >>> * stop the container >>> * clean up the persistence directory >>> * put your persisted job in place >>> * start the container (keep the additional debugging, it does not hurt) >>> * submit a job >>> * when it hangs do the kill -QUIT >>> * stop the container >>> >>> Do you see a thread-dump in the container log after stopping the container? >>> If so, please send. >>> If not, ... i hope you'll see it >>> >>> Martin >>> >>> >>> Yuriy wrote: >>>> Ok, new log attached. I had 168 java processes running, all with the >>>> same command line: >>>> >>>> /opt/vdt/jdk1.5/bin/java >>>> -Dlog4j.configuration=container-log4j.properties -Xmx512M >>>> -Dorg.globus.wsrf.container.persistence.dir=/opt/vdt/vdt-app-data/globus/persisted >>>> -DGLOBUS_LOCATION=/opt/vdt/globus >>>> -Djava.endorsed.dirs=/opt/vdt/globus/endorsed >>>> -DX509_CERT_DIR=/opt/vdt/globus/TRUSTED_CA >>>> -DGLOBUS_TCP_PORT_RANGE=40000,41000 >>>> -Djava.security.egd=file:///dev/urandom -classpath >>>> /opt/vdt/globus/lib/bootstrap.jar:/opt/vdt/globus/lib/cog-url.jar:/opt/vdt/globus/lib/axis-url.jar >>>> org.globus.bootstrap.Bootstrap >>>> org.globus.wsrf.container.ServiceContainer -p 8443 >>>> >>>> Not sure which one is container id, so I executed >>>> >>>> pkill -QUIT java >>>> >>>> >>>> Regards, >>>> Yuriy >>>> >>>> >>>> On Mon, Aug 25, 2008 at 10:48:05PM -0500, Martin Feller wrote: >>>>> Hm, processing all of a sudden seems to stop. >>>>> >>>>> i need more information i think: >>>>> Please stop the container and add the following line to >>>>> $GLOBUS_LOCATION/container-log4j.properties >>>>> >>>>> log4j.category.org.globus=DEBUG >>>>> >>>>> Then put your problematic persistence data in place, restart the >>>>> container and >>>>> submit a job. If the job hangs again, please create a thread-dump of the >>>>> container >>>>> by calling >>>>> >>>>> kill -QUIT <containerProcessId> >>>>> >>>>> and send the container logfile once more. >>>>> >>>>> Thanks, Martin >>>>> >>>>> Yuriy wrote: >>>>>> Sorry, last time I edited log4.properties instead of >>>>>> container-log4.properties New log attached. >>>>>> >>>>>> On Mon, Aug 25, 2008 at 08:35:35AM -0500, Martin Feller wrote: >>>>>>> Gram was not started in debug mode as far as i can tell from the >>>>>>> logfile you attached. >>>>>>> I can't see much from that log. >>>>>>> >>>>>>> Are you sure you have a line like this in >>>>>>> $GLOBUS_LOCATION/container-log4j.properties? >>>>>>> log4j.category.org.globus.exec.service=DEBUG >>>>>>> >>>>>>> (no # at the beginning of the line) >>>>>>> >>>>>>> Looks like this is GT from VDT. I'm not 100% sure if you enable debug >>>>>>> logging for GRAM >>>>>>> in a different way, but i don't think so. Check with the admins if you >>>>>>> are not sure about >>>>>>> that. >>>>>>> As an example I attached a container log with 1 persisted job in the >>>>>>> persistence directory. That's how it should look like if GRAM has debug >>>>>>> logging enabled. >>>>>>> Please retry and send the log again. >>>>>>> >>>>>>> Martin >>>>>>> >>>>>>> Yuriy wrote: >>>>>>>> On Fri, Aug 22, 2008 at 01:40:00PM -0500, Martin Feller wrote: >>>>>>>>> Please try the following: >>>>>>>>> >>>>>>>>> 1. In the situation when the job hangs: >>>>>>>>> How about submitting a job in batch mode (globusrun-ws -submit -b >>>>>>>>> -o job.epr ...) >>>>>>>>> and query for job status instead of listening for notifications >>>>>>>>> (globusrun-ws -status -j job.epr) >>>>>>>>> Does the job status change after a while? (I don't expect it, but >>>>>>>>> just to make sure) >>>>>>>>> >>>>>>>> No, still "unsubmitted" >>>>>>>> >>>>>>>> >>>>>>>>> 2. Shut down the container, enable debug logging in Gram4 >>>>>>>>> (uncomment # log4j.category.org.globus.exec.service=DEBUG in >>>>>>>>> $GLOBUS_LOCATION/container-log4j.properties), clean up the >>>>>>>>> persistence directory, >>>>>>>>> move the problematic persisted job into the persistence data, >>>>>>>>> start the container, >>>>>>>>> submit a job. >>>>>>>>> Please send the container logfile then. >>>>>>>>> >>>>>>>> log file attached. I had to increase termination time of that job to >>>>>>>> 26th, otherwise that file is silently removed and jobs can be submitted >>>>>>>> as usual. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Yuriy >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Thanks, Martin >>>>>>>>> >>>>>>>>> >>>>>>>>> Yuriy wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am having very strange problems with globus GRAM. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Submission of job with globusrun-ws hangs on "Job Unsubmitted" >>>>>>>>>> message. I tried to submit job from two different machines with the >>>>>>>>>> same result. >>>>>>>>>> >>>>>>>>>> globusrun-ws -submit -J -S -F ng2.auckland.ac.nz:8443 -Ft >>>>>>>>>> Fork -o test.epr -c /bin/echo "hello" >>>>>>>>>> Delegating user credentials...Done. >>>>>>>>>> Submitting job...Done. >>>>>>>>>> Job ID: uuid:6eeadb2c-6ffa-11dd-a2f7-00163e000005 >>>>>>>>>> Termination time: 08/23/2008 03:28 GMT >>>>>>>>>> Current job state: Unsubmitted >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Sample java program (attached) and CoG client >>>>>>>>>> (cog-job-submit) work normally. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Globus restart does not help, unless I remove persisted >>>>>>>>>> directory. Persisted is on local partition. I figured that single >>>>>>>>>> file in ManagedExecutableJobResourceStateType causes the problem (xml >>>>>>>>>> attached). When I remove this file and restart globus, globusws-run >>>>>>>>>> works normally. When I copy this file into >>>>>>>>>> persisted/ManagedExecutableJobResourceState, and restart globus, it >>>>>>>>>> breaks again. My globus breaks every 3-7 days so there are other job >>>>>>>>>> resouces that cause this problem. >>>>>>>>>> >>>>>>>>>> globus version is 4.0.7 from VDT 1.10 >>>>>>>>>> >>>>>>>>>> What is going on here? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Yuriy >>>>>>>>>>
