Hi Denny,
With the multi-level scheduling approach (what Stu called "pilot" jobs) that Falkon uses (which builds on top of GT4, and makes extensive use of web services), you can get single task (aka job) latencies of 100ms~500ms depending on the security used, but things parallelize quite well, which means that if you submit many short tasks, the amortized latencies are on the order of 1~10 ms. We have run workloads with 100ms task execution times on 100s of CPUs with extremely good efficiency (90%+). By the time you hit 1 second tasks, we can get 95% utilization (on 100s of CPUs), and with several second tasks, we can get 99%+ utilization. Our original paper on Falkon has a nice figure that shows efficiency as a function of number of CPUs and task lengths (http://people.cs.uchicago.edu/~iraicu/publications/2007_SC07_Falkon.pdf, Figure 6). Also, Falkon's web page with all related papers, mailing lists, and source can be found at http://dev.globus.org/wiki/Incubator/Falkon.

If you have any other questions, let me know.

Ioan

Stuart Martin wrote:
Hi Denny,

For a simple /bin/date job without delegation, staging, cleanup, submitted to Fork, our performance measurements for 4.0.7 were ~1.5 seconds. So you are close to our results. The difference could be the testing hosts. Another possibility is that the first job submitted to a container incurs some service activation costs. So subsequent jobs should perform better. Was the below job the first one submitted to the container?

Authentication is costly, but also the gram service maintains the job info/state in a file on disk. And then there is the execution of the application. When profiling, we have not seen any obvious bottlenecks. So, I think 1.5 seconds is the cost of the gram service.

I'm not sure if this fits your scenario, but for a client that is managing 1K/10K/100K <1 second execution jobs, methods have been implemented to submit a "pilot" job through gram. The pilot job starts up under the user account on the remote compute resource and connects back to the client. The client then sends jobs directly to the pilot service (not through gram). gram is used to bootstrap this service on the remote compute resource. Condor-G does this through glide-ins. Falkon is another implementation that has proven to scale very well and has some impressive results. More can be read here: http://dev.globus.org/wiki/Incubator/Falkon

Cheers,
-Stu

On Jul 10, 2008, at Jul 10, 1:50 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote:

Hi all,

I found that it costs 3-4s by average for Globus to execute a simple job, and a little longer when there are data stage-in and stage out. As in the example below, the real cost time is 0m2.510s, but the user CPU time is just 0m0.430s. How do you think the extra time is used for, Globus authentication? Communication of network?

My other question is, does this mean Globus is not suitable for real-time application (less than 1s response time)?

Example:
-bash-3.00$ time globusrun-ws -submit -c /bin/true
Submitting job...Done.
Job ID: uuid:a877ba4c-4e47-11dd-9443-224466880045
Termination time: 07/11/2008 06:15 GMT
Current job state: CleanUp
Current job state: Done
Destroying job...Done.

real    0m2.510s
user    0m0.430s
sys     0m0.030s

Regards,
Denny




Reply via email to