Re: [gt-user] excessive latency

Arthur Carlson Tue, 22 Jul 2008 05:55:34 -0700

Steve White wrote:

Art,


What happened when you set the iptables rule

My iptables on gavosrv1 were pristine, so I ran

iptables -A OUTPUT -p tcp -m tcp --dport 113 --tcp-flags SYN,RST,ACKSYN -j REJECT --reject-with tcp-reset


That seemed to get rid of the worst cases.

Before -> after:

from gavosrv1.mpe.mpg.de to gavosrv1.mpe.mpg.de:
0m30.865s -> 0m13.191s

from gavosrv1.mpe.mpg.de to udo-gt03.grid.tu-dortmund.de:
0m14.026s -> 0m14.516s

from udo-gt03 to gavosrv1.mpe.mpg.de:
1m0.671s -> 0m13.384s

from udo-gt03 to udo-gt03.grid.tu-dortmund.de:
0m8.521s -> 0m11.580s

This is a huge improvement. Instead of globusrun-ws being 100 timesslower than gsissh, now it is a mere 20 times slower.


Thanks.
Art

On 21.07.08, Arthur Carlson wrote:
Ioan Raicu wrote:
Hi,
You are forgetting that in real Grid deployments, the majority of thewait time will be in queue wait times in batch schedulers.
Actually, I'm not. I said, "For production of my application even aminute of latency is not a big deal, but it's a pain during developmentand debugging."
For example, in some logs I looked at from 2005 from SDSC, I recallseeing queue wait times of 6 hours on average over a 1 year period.So, having some extra latency on the order of 1~60 seconds is not abig deal when your average job lengths are hours, or more. Now, whatyou are asking for is interactive response times (ideally <1sec).
All I want is what other guys get. Denny is getting 3-4 s. Stu isgetting 1.5 s. So why do I have to wait between 10 and 60 seconds to do"nothing"? Your figures of 10 to 60 on a PBS system don't seem toorelevant. I get the longest latencies on gavosrv1, which is a simpleworkstation that hardly sees any traffic. Finally, the fact that I cando "nothing" using gsissh in less than a second tells me that somethingis rotten with globusrun-ws.
The only way you will achieve that kind of response time is throughmulti-level scheduling, or via dedicated resources (where theresources are always on and ready to serve your requests). Themulti-level scheduling is referring to acquiring resources in bulk,where the latency is not so critical, but then managing thoseresources with more latency sensitive techniques. In our work withFalkon, we are able to get sub 1 second latencies for fine grainedapplications via this multi-level scheduling approach. Othersprobably have other similar techniques to enable this. With the highcost of submitting a GRAM job, from the GT security overheads, to thepolling intervals of GRAM, to the batch scheduler overheads, to thepolling intervals of the LRM, to the queue times due to contention, Idon't believe you will be able to use GRAM in a naive sense forinteractive applications, where the response you need is in the sub 1sec range. If you want more info on Falkon, seehttp://dev.globus.org/wiki/Incubator/Falkon.
I'm sure Falkon is a nifty system, but it's not appropriate for myneeds. At the present time, I don't see any reason to even useglobusrun-ws when I can get the job done up to 100 times faster with gsissh.
Regards,
Art
Ioan

Arthur Carlson wrote:
In the thread "Globus not for real-time application?", a number ofusers discuss whether it is realistic or not to get latencies below 1second. Sounds like paradise. I am seeing latencies of up to a minute!
My workstation, gavosrv1.mpe.mpg.de, not the newest anymore, has GTK4.0.5 installed. When I use globusrun-ws to go from this machine backto itself, ... but just look:
[EMAIL PROTECTED] ~]$ time globusrun-ws -submit -s -F gavosrv1 -c/bin/true
 Delegating user credentials...Done.
 Submitting job...Done.
 Job ID: uuid:52f0f962-54e1-11dd-a56f-0007e914d571
 Termination time: 07/19/2008 15:51 GMT
 Current job state: Active
 Current job state: CleanUp-Hold
 Current job state: CleanUp
 Current job state: Done
 Destroying job...Done.
 Cleaning up any delegated credentials...Done.

 real    0m24.327s
 user    0m1.242s
 sys     0m0.113s
Note that "user" and "sys" times are reasonable. Almost all of thistime passes between "CleanUp" and "Done". It can't just be checkingcredentials because gsissh is done in a jiffy:
 [EMAIL PROTECTED] ~]$ time gsissh -p 2222 gavosrv1
 /bin/true                        real    0m0.649s
 user    0m0.134s
 sys     0m0.020s
Maybe that is already enough for someone to see where the problemlies. I can also point out that all (at least many) of the machinesin our grid (AstroGrid-D) seem to be affected, but to varyingdegrees. Here is a little matrix of tests:
from gavosrv1.mpe.mpg.de to gavosrv1.mpe.mpg.de: 0m27.235s
from gavosrv1.mpe.mpg.de to titan.ari.uni-heidelberg.de: 0m14.324s
from gavosrv1.mpe.mpg.de to udo-gt03.grid.tu-dortmund.de: 0m8.823s
from titan to gavosrv1.mpe.mpg.de: 0m57.208s
from titan to titan.ari.uni-heidelberg.de: 0m16.875s
from titan to udo-gt03.grid.tu-dortmund.de: 0m27.225s
from udo-gt03 to gavosrv1.mpe.mpg.de: 1m5.221s
from udo-gt03 to titan.ari.uni-heidelberg.de: 0m12.905s
from udo-gt03 to udo-gt03.grid.tu-dortmund.de: 0m6.952s
Please tell me I am doing something really stupid. For production ofmy application even a minute of latency is not a big deal, but it's apain during development and debugging. Right now I am using gsisshinstead of globusrun-ws just to work around this.
Thank for the lift,
Art Carlson
AstroGrid-D Project
Max-Planck-Institute für extraterrestrische Physik, Garching, Germany


--
Dr. Arthur Carlson
Max-Planck-Institut fuer extraterrestrische Physik
Giessenbachstrasse, 85748 Garching, Germany
Phone: (+49 89) 30000-3357
E-Mail: [EMAIL PROTECTED]

Re: [gt-user] excessive latency

Reply via email to