Am 06.09.2007 um 18:27 schrieb Chris Dagdigian:
{ Declaration of bias; I run the http://gridengine.info site in my
spare time ... }
I'm quite familiar with both LSF and SGE, using both products in my
professional work and helping clients with queue system selection,
deployment, application integration and training. I'm less
familiar with PBS/Torque/etc. having only run those in small
virtualized lab environments. At the time when I was looking at
open source solutions, none of the PBS variants supported array
jobs so I went with SGE and never looked back.
Another thing is Tight Integration of parallel runs, which is
available in PBS/Torque for LAM-MPI and OpenMPI, but not for HP-MPI,
Linda or PVM. You can use it with these queuing systems of course,
but the slave processes are not controlled by them, nor will you get
a correct accounting. SGE offers an rsh replacement called qrsh which
will support these parallel environments.
-- Reuti
The current state of the art is quite good. For 90% of use cases
and end-user requirements you really can't go wrong with any of the
available products.
Everything out there (open source or commercial) is capable of
doing the standard sort of "policy based resource management on
distributed systems" that we all care about.
So with all products capable of doing just about everything you
would need, making an actual product selection comes down to areas
other than the functionality of the queueing core.
Things like:
- Administrative burden (if keeping PBS from falling over requires
a full time employee; the cost of LSF looks far more attractive for
instance ...)
- Cost
- Quality of support
- Quality of technical documentation
- Quality of training / professional services
- Layered products that enhance base functionality
Platform LSF is the gold standard. Low administrative burden, great
documentation/support and resiliency features that competitors
still have a tough time matching and all wrapped up with additional
(at extra cost of course) layered products that nobody else can
really touch. The downside? Cost of course. In particular the
current Linux pricing model punishes you for putting more than 4GB
of RAM into a compute node or using a non X86/X86_64 architecture
-- in both cases you'll get bounced out of the "cheap" license
category and into a far more expensive one where the cost of the
software license is in the same ballpark as the cost of the server
hardware.
Platform will happily sell you additional layered products that can
do things like:
- Tight integration with FlexLM license servers; more powerful than
the standard load sensor (SGE) and elim (LSF) methods that people
do "for free"
- Seriously hardcore reporting and analytic tools suitable for the
largest enterprises
- Tight integration with parallel environments and high speed
interconnects (plus support for these environments which is non-
trivial)
- SLA-aware scheduling
- Multi-cluster aware scheduling
- etc. etc.
The base version of LSF also ships with a basic reporting module
and a tomcat-driven web interface that is suitable for users
(submit and monitor jobs) as well as admins (manage queues and
hosts). SGE in particular does not really have anything like this
except for ARCo on the reporting side and ARCo is no match for even
the "free" reporting module you get with LSF 7.x
That said though, it's been my experience that a vast majority of
the "market" does not need and will not likely ever need some of
the advanced/enterprise level add-ons that integrate so cleanly
with the base Platform LSF products.
So this drops me back down into my original argument that just
about any of the available products will perform well at doing what
you need. The key advice I have is to understand that everyone is
pretty good at the basic functions so you'll have to make your
selection decision based on some of the other criteria I tried to
list above.
My general rule of thumb for new projects is to start with the
assumption that I'll be using Grid Engine. Then, after more formal
understanding of the work-flows and customer requirements are
achieved it may become clear that Platform LSF is a better choice.
For all of 2007 I'd probably take a guess at saying that I've
worked on 20+ Grid Engine systems and deployed LSF just once for a
large enterprise customer.
My $.02 of course!
Regards,
Chris (posting from my non-corporate address)
On Sep 6, 2007, at 5:30 AM, andrew holway wrote:
Hi,
We are trying to work out the differences between these queue
systems.
Can anyone shed any light? Pros and Cons...
SGE, Torque (with Maui), PBSPro and LSF
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf