Re: [gridengine users] At what point does the network overhead of adding additional nodes to a queue offset the benefit?

Jesse Becker Fri, 25 Sep 2015 13:26:12 -0700

We have a modest cluster of about 540 compute nodes running upwards of
15k jobs/day.


The dedicated qmaster pushes about 10-20MB/sec on average for everything.
Assuming a perfectly even workload on all nodes, that works out to a
pretty small amount of traffic per compute node.

The node count recently grew by about 20%, which had a minor impact on
overall traffic.  A lot of this traffic also comes from DRMAA connections
in polling loops, along side the job dispatch traffic.  Various update
settings aren't terribly aggressive:  40 seconds for load reports,
5 minutes max_unheard, etc.


While the qmaster is multi-threaded, parts of it, like the scheduling
loop, that are CPU bound, and are more likely to be a bottleneck before
network traffic.

On the compute node side, the data link also carries the minimal SGE
traffic.

Basically:  I think you don't have to worry unless you have a really
big cluster.

The reporting/accounting logs can be disk-heavy if you push a lot of
jobs, and get quite large.  On the other hand, disk space is cheap and
the logs compress well

(Note that the reporting log appears to be a super-set of the
accounting logs, so you could just enable the reporting logs.
Unfortunately, most of the scripts and tools people have written assume
native accounting logs)

On Fri, Sep 25, 2015 at 08:48:49AM -0700, Skylar Thompson wrote:

The only time I've ever seen GE itself consume a significant amount of
resources is when the reporting log is on and you're trying to push through
thousands of jobs per second. The reporting log is of marginal value for us
so I just turned it off. The accounting log is still a bit of disk load but
it's useful enough that I left it on.

I don't think there's any practical upper limit to the number of nodes you
want in a queue or in the cluster as a whole. It's really up to the tasks
you're trying to run, and the policies/SLAs you've set for the cluster.

On Thu, Sep 24, 2015 at 10:04:06PM +0000, Lane, William wrote:

If a cluster is running on a relatively slow speed networking backbone (say 
gigabit ethernet or
10 Gib ethernet as opposed to inifiniband), is there any commonly accepted 
point at which increasing the number
of nodes in a queue negatively affects the performance of the queue? Is there 
any general
rule about how many nodes to have in a queue based on a given network backbone?



--
-- Skylar Thompson (skyl...@u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


--
Jesse Becker (Contractor)
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] At what point does the network overhead of adding additional nodes to a queue offset the benefit?

Reply via email to