Let me see if I understand this right. Your setup is such that you are
running a task farming grid, where each compute resource (i.e. 1 CPU, or
1 node) has GRAM installed and waiting to receive work? Or do you have
a gateway node that has GRAM configured and is waiting for work, which
then gets passed down to another LRM (i.e. BOINC) to dispatch out to the
remote CPUs? So, there are 2 paradigms here: 1 to 1 GRAM submission,
and 1 to many GRAM submission. The 1-1 GRAM submission is what I was
referring to below, when I said that its OK to have 1~60 sec latencies
if your jobs are hours long each. Note that GRAM parallelizes quite
well, so if your submission client is multi-threaded, you should be able
to get around 1 job/sec throughput (which translates to about 1 sec
amortized latencies).
The 1-many GRAM submission is the trickier one. Instead of running GRAM
on each remote CPU (i.e. a server), in Falkon we decided to make the
remote CPUs clients, which communicated back to a GT4 instance to
collect work. This also avoided us having to run a full GT4 on each
remote CPU.
Can you give us more details about your deployment, such as network
topology, how many CPUs, LRM used (i.e. BOINC, PBS, Condor), is the
client submitting to GRAM multi-threaded, number of jobs injected into
the system over a period of time, min/average/max job run times, and how
much control you have over the various pieces (which ones you have
control of changing if you need to)? Better understanding your
deployment will help us better point you to a solution that is right for
you!
Ioan
Alexander Beck-Ratzka wrote:
On Sonntag, 20. Juli 2008 18:19:57 Ioan Raicu wrote:
Hi,
You are forgetting that in real Grid deployments, the majority of the
wait time will be in queue wait times in batch schedulers. For example,
in some logs I looked at from 2005 from SDSC, I recall seeing queue wait
times of 6 hours on average over a 1 year period. So, having some extra
latency on the order of 1~60 seconds is not a big deal when your average
job lengths are hours, or more.
This might be write for your usecase. However, there are also other usecases
around in the grid world. We are running [EMAIL PROTECTED] as a task farming
application on the ressourece of D-Grid, and we consume per day about 100000
CPU hours. So it is really a productive application. Because we are
submitting hundred of jobs, the latency cannot be neglected, and it wold be
really helpful to reduce it to a time below 1 second. If you're looking into
the net traffic caused by globusrun-ws -submit, you can see thereare a lot of
communication cicles (I think it are 9) between the submitting and the
execution host. Is this really necessary? SOAP only requires one...
So please note: there is no "real Grid deployment" in that way, you've
mentioned it. I think this problem will get still more bothersome, if a
scheduler as e.g. Gridway is coming into the game.
Cheers
Alexander
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: [EMAIL PROTECTED]
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================