Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

Mark Dixon Fri, 25 May 2012 04:29:30 -0700

On Thu, 24 May 2012, Rayson Ho wrote:
...

1) It's not a "drop in" replacement. If upgrading gridengine on anexisting system, activating your cgroup code will cause an immediatechange in behaviour of jobs, without the user altering their submissionflags. People don't tend to like that sort of thing.
Can you let me know (base on your understanding of memsw) thedifference?

The existing h_vmem implementation uses the setrlimit definition ofvirtual memory. It restricts the amount of virtual address space used by ajob (each process individually via setrlimit, all processes together usingthe PDC to enforce via a default 1 second polling interval). In principle,this isn't a limited shared resource that jobs need to fight over, but inpractice is the best cross-platform, enforceable estimate of usage of ashared resource.

memsw does it's best to consider total actual memory and swap usage, asreported by the kernel. Anything that's shared between processes (e.g.shared libraries) gets accounted against the cgroup that originally put itinto memory. This is a limited shared resources that jobs need to fightover.


Different.


...

2) It's removing functionality. The old behaviour allows you to ensure that
a job will fail if it mallocs something that's too big. Either it runs or it
doesn't - and provides a decent error code you can handle rather than just
die. That could be important to some people. In contrast, the new behaviour
will permit the malloc, but then stop at some less predictable point in the
future when you use more than the permitted amount of the memory.


When you allocate memory, the kernel should back the allocation by
something in the non-overcommiting case (not the default). If the
kernel is overcommiting the VM and the system runs out of memory
pages, the Linux OOM picks a process and kills it.

So even if malloc returns a non-NULL pointer, the kernel can still
kill the process due to VM overcommit.

Absolutely, the kernel is perfectly able to enforce either case. I'm justworried about how this appears to userspace.

On a cluster with a mixed workload, it may or may not be appropriate toconfigure the kernel so that it doesn't overcommit memory. Being able toset the virtual memory rlimit means you can do this on a per-job basis.More importantly, it means you can do it without having to get an admininvolved (which I think is everyone's best-case scenario).

...

In our cgroups code, we are setting the Unix rlimit as well - as I
have seen code that queries the current process limit and acts
slightly differently based on the limit. So, may be it is a matter of
giving some small headroom to handle the job shells, etc??

For a normal single process job, the memsw usage should be very close
to that of the h_vmem of each individual process.


If you'll allow me to be frank, this sounds like the worst of both worlds.

memsw usage and virtual address space usage can wildly diverge underfairly common use cases:

* Processes under 64-bit mode. Comparing the VIRT column in "top" with the"RES", "SHR" and "SWAP" columns clearly demonstrates that an ordinaryprocess can be easily out by ~100Mb. This becomes worse, the more sharedlibraries are loaded. This is significant in a world where cores per CPUare increasing and so memory per core is under pressure to drop.


* Processes that mmap files.

* System V IPC shared memory.

I'm sure that clever people can point out more.

When there's a better way to limit usage of shared resources, we've got nobusiness enforcing a limited virtual memory address space on all processesin all cases, as it isn't the shared resource that jobs have to fightover.

I absolutely agree that sometimes it is appropriate and there should be amechanism to do so, but I think it should be a separate knob for users andadmins to tweak. Otherwise, you're conflating two distinct concepts thatdo different things.

This feature has great promise in getting our clusters to make far betteruse of our RAM and CPUs. I'm not saying there aren't other causes as well,but I've often looked at clusters and seen CPUs not being used becauseextra h_vmem has been requested to get more virtual address space. As anadmin of a cluster with around 500 users, it can be scary to increase themaximum h_vmem much beyond the amount of RAM that exists, because theproportion by which h_vmem differs from memsw can vary wildly from job tojob.

But the upshot is that, I generally sigh when I look at cluster gangliagraphs.

...

In my view, using a new set of attributes (I don't care what they're
called), rather than overloading old ones, avoids all of these issues.


Agreed. But we don't know what the new resource limits should be
called, and their precise definitions.

It may not be defined by POSIX, but we're free to make up a pragmaticworking definition :)


It could be:

Actual memory usage (typically RAM+swap), as measured by the availableoperating-system specific mechanism.

With this definition, we are free to use whatever mechanism is available.If a better one appears, we can migrate to it.


The name we call it is a detail.


...

You clearly have a more complete and advanced implementation than what I've
done and were intending to do. You therefore have priority and I presumably
have little to offer you (believe it or not, this is great news for me!)


Yup, we have been planning for the cgroups integration and
architecting the structure since 2009 :)

Don't get me wrong, I'm extremely pleased that you're doing all this, butI wonder why the user interface hasn't been publicly discussed until now.

I know that your primary concern is to satisfy your clients, and that'sgood and proper, but the great thing about this list is that there'sgenerally someone on it who has already hit the next issue you'll have todeal with.

That's why we left out the SSH integration code and we only checked in
a workaround for the 4GB memory accounting fix - we did those
deliberately.


Out of curiosity, why do people still want the SSH integration code?


...

Don't worry about it - I guess we did not advertise our business model
as we were just system integrators and consultants.

And I thought everyone here knew us pretty well - even when Grid
Engine was under Sun's control, we have been contributing code and
suggestions, and answering questions on the list, and even reviewed
Sun's design docs, code, and feature changes.


Oh, I'm aware of you chaps :)

I've seen lots of great stuff and it's appreciated - I just have aterrible memory for keeping track who is using what specific businessmodel. I guess it comes from working for a university ;)

...

We are getting the words out, and besides the Gompute User Group
Meeting, in other conferences people mention Grid Scheduler as the
open source Grid Engine in their talks. There are other ways to make $
but since Oracle gave us the maintainership to maintain open source
Grid Engine we have the obligation to maintain it in an *Open Source*
way.

...

I'm very reluctant to end on this, but here goes; that paragraph shouldhave at least one of:


  s/as the$/as an/

Or:

  s/^open source/commercial open source/

Otherwise, regardless of the other details surrounding the situation, itcould be interpreted as being discourteous.


Sincere best wishes,

Mark
--
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.di...@leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

Reply via email to