Hi Mark,

I'll respond via separate e-mail to you because I do not want to misuse this mailing list for too much commercial messaging. On behalf of other readers who may have similar requirements I would, however, like to provide a brief overview of Univa's product line-up and how it relates to your requirements:

 * On top of Univa Grid Engine (which the Sun Grid Engine team now
   working for Univa has evolved over the past 5 years) we offer
   Universal Resource Broker (URB) as an add-on. It allows you to run
   "frameworks" such as the Jenkins Mesos framework on top of Univa
   Grid Engine. This gives you the flexibility and dynamicity of these
   frameworks while providing full Univa Grid Engine policy control and
   the ability to mix and match diverse workloads (inside and outside
   such frameworks)
 * You can also consider direct Grid Engine Jenkins integrations like
   the one John McGhee has pointed out
 * In May/June we are going to release Univa Grid Engine Container
   Edition which will allow you to run Docker containers as a
   first-class workload in a Grid Engine cluster
 * We also provide an enhanced version of Kubernetes called Navops
   (navops.io) which augments Google's Kubernetes with sophisticated
   policy management derived from our scheduling IP. Navops is targeted
   towards micro-service architectures. Sharing resources between a
   Navops and Univa Grid Engine environment will also be possible to
   allow for blending of micro-service and more traditional workloads
 * If you have a large amount of workload tasks with very short runtime
   like is typical for certain test use cases then Univa Short Jobs
   might be worth a look. It allows to run extreme throughput workloads
   with high efficiency on top of Univa Grid Engine. Tasks can have
   run-times down to a few milliseconds and you can run 20,000 and more
   tasks per second even in a relatively small cluster
 * All products can run inside of VMs or on cloud nodes and we have a
   product call UniCloud which can flex cluster sizes dynamically or
   support automated cloud bursting capability. It seems you are
   covering part of this with your use of Vagrant + Ansible, however

Hope this helps and if there are questions of generic interest then we can certainly discuss them here. I will be in touch with you directly for anything else.

Cheers,

Fritz

Dr. Mark Asbach schrieb:
Hi S(o)GE users,

I need some advice :-)

During my Ph.D. times, I discovered Sun Grid Engine and used it to run 
distributed machine learning jobs on a (then) medium sized cluster (96 CPUs). I 
liked it. Now, a couple of years later, I am again looking for a scheduling and 
resource allocation system like SGE for a similar purpose. Unfortunately, SGE 
seems to be pretty dead. In addition, I have similar but not identical needs 
stemming from continuous integration and from running (micro-)web services. 
Ideally, I would like a simple, integrated solution and not a complex monster 
built from many large parts.

Here's what I'm trying to accomplish:

- Run custom jobs for machine learning / data analysis. When I have an idea, I 
write a job and run it. Usually, the same job is only run a few times. Jobs will 
span multiple hosts and might require OpenMP + MPI. This is where SGE was really 
good in the past. The crowd seems to have shifted to run everything on Hadoop 
although this setup would be really ineffective for my purposes. I usually just 
need a couple of CPUs (<  100).

- Run frequent identical jobs for continous integration. We have a Jenkins 
running, but it is lacking in some regards. Resource allocation and scheduling 
is more or less non-existent. For example, I cannot define resources for things 
like attached mobile devices that can be used only by one job of a multi-core 
Mac at the same time. These are things already solved with SGE, but SGE itself 
does not cover the main aspects of CI, i.e. the collection and analysis of the 
build data.

- Run (micro-)services. We have a couple of services that need run 
continuously. Some need to be scaled up and down regarding the number of 
parallel instances. This is where people are now using Docker and (also quite 
complex) resource allocation and scheduling systems like kubernetes.

All three sorts of tasks compete for the same resources and suffer the same 
problem of provisioning/configuring the workers to fulfill a job's 
requirements. We're using Vagrant + ansible to provision VMs for our machine 
learning tasks and I would like to extend this to the other problems as well. 
The resource allocation is still somewhat manual in our case. I would really 
like to cut down the complexity of our setup.

It would be great if you can point to me any helpful information, ideas, 
projects that could help me solve this.

Best,
Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

--

UnivaFritz Ferstl | CTO and Business Development, EMEA
Univa Corporation <http://www.univa.com/> | The Data Center Optimization Company
E-Mail: [email protected] | Mobile: +49.170.819.7390





_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to