Re: [gt-user] excessive latency

Arthur Carlson Tue, 22 Jul 2008 09:22:11 -0700


Martin Feller wrote:

If the job is the one from below (globusrun-ws -submit -s -F gavosrv1-c /bin/true)then no file-cleanup with RFT is involved. I found a similar behaviorsometimes on one ofthe ISSGC08 machines, where simple jobs seemed to take a nap betweencleanup and done.Unfortunately i didn't follow up on this and cannot reproduce it rightnow, but i
remember that the file-system on the headnode was very slow at that time.
The only thing i can think of is the cache cleanup step, where certainfiles are removed.
Arthur:
* Is the filesystem on the headnode that contains the user homes veryslow (busy NFS or so)?

I don't know any reason the file system should be slow. (I also don'tknow how to test it.) Note that several different machines are affectedin a similar way.

* Do you get a better performance if you run
     globusrun-ws -submit -s -F gavosrv1 -f job.xml
  with the following job description stored in job.xml ?

<job>
  <executable>/bin/true</executable>
  <stdout>/tmp/stdout</stdout>
  <stderr>/tmp/stderr</stderr>
</job>


Yes, that helps noticeably, another factor of 2:

from gavosrv1.mpe.mpg.de to gavosrv1.mpe.mpg.de: 0m5.815s
from gavosrv1.mpe.mpg.de to udo-gt03.grid.tu-dortmund.de: 0m4.712s
from udo-gt03 to gavosrv1.mpe.mpg.de: 0m5.692s
from udo-gt03 to udo-gt03.grid.tu-dortmund.de: 0m3.661s

This is no longer terribly far removed from numbers other people werereporting, and is getting to a range where I probably wouldn't bothercomplaining. What is happening here? I get output back through streamingin either case (with /bin/date instead of /bin/true). And I still don'tundertand why gsissh, which also has to do authentication, is ten timesfaster.


--Art Carlson

Martin


Charles Bacon wrote:
Interesting - I remember some discussion like that on this list, Ithink, but what does that rule achieve?
For the original user: delays like that are not normal. The activityin the Cleanup->Done phase is an RFT job that deletes the filesassociated with the job. Is your GRAM server configured to use alocal RFT server? Is the GridFTP server local to the machine runningthe container? Do you notice slow results using globus-url-copy fromthe machine to itself?
Charles

On Jul 21, 2008, at 4:39 AM, Steve White wrote:
Art,
As I understand it, your application runs a single process in the"fork"
job manager.  So you are referring to the latency in running a single
simple process, rather than to that in submission to a batch system.

I now remember that last September, Thomas Brüsemeister pointed out to
us a work-around for a similar problem, at least regarding filetransfers.
It was to add the following 'iptables' rule:
iptables -A OUTPUT -p tcp --syn --dport 113 -j REJECT --reject-withtcp-reset
We implemented this on many of our systems at AIP, and observed a big
improvement in some kinds of latencey.  Now I see that on some of them
the setting has been lost (after system upgrades, etc.)

Would this improve things for your application?

Cheers!


On 20.07.08, Arthur Carlson wrote:
In the thread "Globus not for real-time application?", a number ofusersdiscuss whether it is realistic or not to get latencies below 1second.
Sounds like paradise. I am seeing latencies of up to a minute!

My workstation, gavosrv1.mpe.mpg.de, not the newest anymore, has GTK
4.0.5 installed. When I use globusrun-ws to go from this machineback to
itself, ... but just look:
[EMAIL PROTECTED] ~]$ time globusrun-ws -submit -s -F gavosrv1 -c/bin/true
  Delegating user credentials...Done.
  Submitting job...Done.
  Job ID: uuid:52f0f962-54e1-11dd-a56f-0007e914d571
  Termination time: 07/19/2008 15:51 GMT
  Current job state: Active
  Current job state: CleanUp-Hold
  Current job state: CleanUp
  Current job state: Done
  Destroying job...Done.
  Cleaning up any delegated credentials...Done.

  real    0m24.327s
  user    0m1.242s
  sys     0m0.113s
Note that "user" and "sys" times are reasonable. Almost all of thistime
passes between "CleanUp" and "Done". It can't just be checking
credentials because gsissh is done in a jiffy:

  [EMAIL PROTECTED] ~]$ time gsissh -p 2222 gavosrv1
  /bin/true

  real    0m0.649s
  user    0m0.134s
  sys     0m0.020s
Maybe that is already enough for someone to see where the problemlies.
I can also point out that all (at least many) of the machines in our
grid (AstroGrid-D) seem to be affected, but to varying degrees.Here is
a little matrix of tests:

from gavosrv1.mpe.mpg.de to gavosrv1.mpe.mpg.de: 0m27.235s
from gavosrv1.mpe.mpg.de to titan.ari.uni-heidelberg.de: 0m14.324s
from gavosrv1.mpe.mpg.de to udo-gt03.grid.tu-dortmund.de: 0m8.823s

from titan to gavosrv1.mpe.mpg.de: 0m57.208s
from titan to titan.ari.uni-heidelberg.de: 0m16.875s
from titan to udo-gt03.grid.tu-dortmund.de: 0m27.225s

from udo-gt03 to gavosrv1.mpe.mpg.de: 1m5.221s
from udo-gt03 to titan.ari.uni-heidelberg.de: 0m12.905s
from udo-gt03 to udo-gt03.grid.tu-dortmund.de: 0m6.952s
Please tell me I am doing something really stupid. For productionof myapplication even a minute of latency is not a big deal, but it's apainduring development and debugging. Right now I am using gsisshinstead of
globusrun-ws just to work around this.

Thank for the lift,
Art Carlson
AstroGrid-D Project
Max-Planck-Institute für extraterrestrische Physik, Garching, Germany
--
| - - - - - - - - - - - - - - - - - - - - - -- - -| Steve White+49(331)7499-202| e-Science / AstroGrid-D Zi. 35Bg. 20| - - - - - - - - - - - - - - - - - - - - - -- - -
| Astrophysikalisches Institut Potsdam (AIP)
| An der Sternwarte 16, D-14482 Potsdam
|
| Vorstand: Prof. Dr. Matthias Steinmetz, Peter A. Stolz
|
| Stiftung privaten Rechts, Stiftungsverzeichnis Brandenburg:III/7-71-026| - - - - - - - - - - - - - - - - - - - - - -- - -


--
Dr. Arthur Carlson
Max-Planck-Institut fuer extraterrestrische Physik
Giessenbachstrasse, 85748 Garching, Germany
Phone: (+49 89) 30000-3357
E-Mail: [EMAIL PROTECTED]

Re: [gt-user] excessive latency

Reply via email to