Hi,
demingyin wrote:
Hi all,
Thank you all for your reply.
I'm using the default GRAM-fork.
My idea of using Grid to achieve real-time result originates from Google
service. It's said that every search request is processed by about 1,000
machines located in Google's data centres. And the point is the request
result will be return with one second (most time).
My situation here is the cost 3-4s by average is got on one Grid node (one
supercomputer, based on an IGTF approved x509 Certificate Authority) of Grid
Australia. And it costs 2-3s on my PC - Globus 4.0.5 Binary under Debian 3.1
r0a "Sarge". But I think the difference may come from the communication
cost.
But the point here is that it seems cost at least 2s.
This is when you use GRAM4 with fork, but fork will execute the search
request on the local machine that runs GRAM. In practice, GRAM would
interface with other lower level LRMs, such as Condor, PBS, SGE, etc...
which means that the latency increases further. In an idle local
cluster that runs GRAM4 and PBS, we see latencies in the 10 seconds to
60 seconds to execute a no-op program (i.e. sleep 0). This gets
compounded even more when the cluster is busy, and jobs have to wait in
the LRM's queue. I have seen traces from various clusters that shows
the job queue time range in the 7+ hours. All this makes the use of
Grids difficult for applications that require real time low latency
interactions.
The result of
Falkon-like light-weight multi-level scheduling approach is really good. But
my question is, since,
Authentication cost still exist (I can't change the security
solution)
The application execution time is to some extent fixed
Can Falkon reduce the schedule time dramatically by submitting GRAM with the
light-weight scheduler to 1 second including the authentication cost and
application execution time?
That is exactly what it can do for you! You incur the higher cost once,
at the time that you get the initial set of resources via GRAM and your
favorite LRM. Then, once Falkon is started and managing your resources,
then any single request with a work payload of a few hundred
milliseconds, should complete end to end in less than a second. The
actual overheads will vary with the security mechanism you use and CPU
speed of the machines used, but the overheads should be all less than 1
second in all cases in an idle system. As the load and concurrency
increases, you might see overheads increase as well. If you need to
support some QoS (i.e. requests handled in less than 1 sec), then you
might need to implement some way to reject some requests once there are
too many concurrent requests.
I also have tested Condor as the local scheduler, but it seems it's quite
high-throughput, but not high-efficiency with medium-scale data volume.
It probably still cannot give you the sub 1 second latency that you are
looking for.
Does someone know some other light-weight Grid middleware which can do the
security and scheduling jobs?
Falkon and Condor glide-ins are the only generic methods to let you do
multi-level scheduling. There might be other solutions out there, but
they usually tightly coupled to some specific application.
Cheers,
Ioan
Regards,
Denny (Deming Yin)
-----Original Message-----
From: Stuart Martin [mailto:[EMAIL PROTECTED]
Sent: Friday, 11 July 2008 12:51 AM
To: <[EMAIL PROTECTED]>
Cc: Stuart Martin; [email protected]
Subject: Re: [gt-user] Globus not for real-time application?
Hi Denny,
For a simple /bin/date job without delegation, staging, cleanup,
submitted to Fork, our performance measurements for 4.0.7 were ~1.5
seconds. So you are close to our results. The difference could be
the testing hosts. Another possibility is that the first job
submitted to a container incurs some service activation costs. So
subsequent jobs should perform better. Was the below job the first
one submitted to the container?
Authentication is costly, but also the gram service maintains the job
info/state in a file on disk. And then there is the execution of the
application. When profiling, we have not seen any obvious
bottlenecks. So, I think 1.5 seconds is the cost of the gram service.
I'm not sure if this fits your scenario, but for a client that is
managing 1K/10K/100K <1 second execution jobs, methods have been
implemented to submit a "pilot" job through gram. The pilot job
starts up under the user account on the remote compute resource and
connects back to the client. The client then sends jobs directly to
the pilot service (not through gram). gram is used to bootstrap this
service on the remote compute resource. Condor-G does this through
glide-ins. Falkon is another implementation that has proven to scale
very well and has some impressive results. More can be read here:
http://dev.globus.org/wiki/Incubator/Falkon
Cheers,
-Stu
On Jul 10, 2008, at Jul 10, 1:50 AM, <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]
> wrote:
Hi all,
I found that it costs 3-4s by average for Globus to execute a simple
job, and a little longer when there are data stage-in and stage out.
As in the example below, the real cost time is 0m2.510s, but the
user CPU time is just 0m0.430s. How do you think the extra time is
used for, Globus authentication? Communication of network?
My other question is, does this mean Globus is not suitable for real-
time application (less than 1s response time)?
Example:
-bash-3.00$ time globusrun-ws -submit -c /bin/true
Submitting job...Done.
Job ID: uuid:a877ba4c-4e47-11dd-9443-224466880045
Termination time: 07/11/2008 06:15 GMT
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
real 0m2.510s
user 0m0.430s
sys 0m0.030s
Regards,
Denny
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: [EMAIL PROTECTED]
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================