Hi Denny,
With the multi-level scheduling approach (what Stu called "pilot" jobs)
that Falkon uses (which builds on top of GT4, and makes extensive use of
web services), you can get single task (aka job) latencies of
100ms~500ms depending on the security used, but things parallelize quite
well, which means that if you submit many short tasks, the amortized
latencies are on the order of 1~10 ms. We have run workloads with 100ms
task execution times on 100s of CPUs with extremely good efficiency
(90%+). By the time you hit 1 second tasks, we can get 95% utilization
(on 100s of CPUs), and with several second tasks, we can get 99%+
utilization. Our original paper on Falkon has a nice figure that shows
efficiency as a function of number of CPUs and task lengths
(http://people.cs.uchicago.edu/~iraicu/publications/2007_SC07_Falkon.pdf,
Figure 6). Also, Falkon's web page with all related papers, mailing
lists, and source can be found at
http://dev.globus.org/wiki/Incubator/Falkon.
If you have any other questions, let me know.
Ioan
Stuart Martin wrote:
Hi Denny,
For a simple /bin/date job without delegation, staging, cleanup,
submitted to Fork, our performance measurements for 4.0.7 were ~1.5
seconds. So you are close to our results. The difference could be
the testing hosts. Another possibility is that the first job
submitted to a container incurs some service activation costs. So
subsequent jobs should perform better. Was the below job the first
one submitted to the container?
Authentication is costly, but also the gram service maintains the job
info/state in a file on disk. And then there is the execution of the
application. When profiling, we have not seen any obvious
bottlenecks. So, I think 1.5 seconds is the cost of the gram service.
I'm not sure if this fits your scenario, but for a client that is
managing 1K/10K/100K <1 second execution jobs, methods have been
implemented to submit a "pilot" job through gram. The pilot job
starts up under the user account on the remote compute resource and
connects back to the client. The client then sends jobs directly to
the pilot service (not through gram). gram is used to bootstrap this
service on the remote compute resource. Condor-G does this through
glide-ins. Falkon is another implementation that has proven to scale
very well and has some impressive results. More can be read here:
http://dev.globus.org/wiki/Incubator/Falkon
Cheers,
-Stu
On Jul 10, 2008, at Jul 10, 1:50 AM, <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]> wrote:
Hi all,
I found that it costs 3-4s by average for Globus to execute a simple
job, and a little longer when there are data stage-in and stage out.
As in the example below, the real cost time is 0m2.510s, but the user
CPU time is just 0m0.430s. How do you think the extra time is used
for, Globus authentication? Communication of network?
My other question is, does this mean Globus is not suitable for
real-time application (less than 1s response time)?
Example:
-bash-3.00$ time globusrun-ws -submit -c /bin/true
Submitting job...Done.
Job ID: uuid:a877ba4c-4e47-11dd-9443-224466880045
Termination time: 07/11/2008 06:15 GMT
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
real 0m2.510s
user 0m0.430s
sys 0m0.030s
Regards,
Denny