On Thu, 24 May 2012, Rayson Ho wrote:
...
1) It's not a "drop in" replacement. If upgrading gridengine on an existing system, activating your cgroup code will cause an immediate change in behaviour of jobs, without the user altering their submission flags. People don't tend to like that sort of thing.

Can you let me know (base on your understanding of memsw) the difference?

The existing h_vmem implementation uses the setrlimit definition of virtual memory. It restricts the amount of virtual address space used by a job (each process individually via setrlimit, all processes together using the PDC to enforce via a default 1 second polling interval). In principle, this isn't a limited shared resource that jobs need to fight over, but in practice is the best cross-platform, enforceable estimate of usage of a shared resource.

memsw does it's best to consider total actual memory and swap usage, as reported by the kernel. Anything that's shared between processes (e.g. shared libraries) gets accounted against the cgroup that originally put it into memory. This is a limited shared resources that jobs need to fight over.

Different.


...
2) It's removing functionality. The old behaviour allows you to ensure that
a job will fail if it mallocs something that's too big. Either it runs or it
doesn't - and provides a decent error code you can handle rather than just
die. That could be important to some people. In contrast, the new behaviour
will permit the malloc, but then stop at some less predictable point in the
future when you use more than the permitted amount of the memory.

When you allocate memory, the kernel should back the allocation by
something in the non-overcommiting case (not the default). If the
kernel is overcommiting the VM and the system runs out of memory
pages, the Linux OOM picks a process and kills it.

So even if malloc returns a non-NULL pointer, the kernel can still
kill the process due to VM overcommit.

Absolutely, the kernel is perfectly able to enforce either case. I'm just worried about how this appears to userspace.

On a cluster with a mixed workload, it may or may not be appropriate to configure the kernel so that it doesn't overcommit memory. Being able to set the virtual memory rlimit means you can do this on a per-job basis. More importantly, it means you can do it without having to get an admin involved (which I think is everyone's best-case scenario).


...
In our cgroups code, we are setting the Unix rlimit as well - as I
have seen code that queries the current process limit and acts
slightly differently based on the limit. So, may be it is a matter of
giving some small headroom to handle the job shells, etc??

For a normal single process job, the memsw usage should be very close
to that of the h_vmem of each individual process.

If you'll allow me to be frank, this sounds like the worst of both worlds.

memsw usage and virtual address space usage can wildly diverge under fairly common use cases:

* Processes under 64-bit mode. Comparing the VIRT column in "top" with the "RES", "SHR" and "SWAP" columns clearly demonstrates that an ordinary process can be easily out by ~100Mb. This becomes worse, the more shared libraries are loaded. This is significant in a world where cores per CPU are increasing and so memory per core is under pressure to drop.

* Processes that mmap files.

* System V IPC shared memory.

I'm sure that clever people can point out more.


When there's a better way to limit usage of shared resources, we've got no business enforcing a limited virtual memory address space on all processes in all cases, as it isn't the shared resource that jobs have to fight over.

I absolutely agree that sometimes it is appropriate and there should be a mechanism to do so, but I think it should be a separate knob for users and admins to tweak. Otherwise, you're conflating two distinct concepts that do different things.


This feature has great promise in getting our clusters to make far better use of our RAM and CPUs. I'm not saying there aren't other causes as well, but I've often looked at clusters and seen CPUs not being used because extra h_vmem has been requested to get more virtual address space. As an admin of a cluster with around 500 users, it can be scary to increase the maximum h_vmem much beyond the amount of RAM that exists, because the proportion by which h_vmem differs from memsw can vary wildly from job to job.

But the upshot is that, I generally sigh when I look at cluster ganglia graphs.


...
In my view, using a new set of attributes (I don't care what they're
called), rather than overloading old ones, avoids all of these issues.

Agreed. But we don't know what the new resource limits should be
called, and their precise definitions.

It may not be defined by POSIX, but we're free to make up a pragmatic working definition :)

It could be:

Actual memory usage (typically RAM+swap), as measured by the available operating-system specific mechanism.

With this definition, we are free to use whatever mechanism is available. If a better one appears, we can migrate to it.

The name we call it is a detail.


...
You clearly have a more complete and advanced implementation than what I've
done and were intending to do. You therefore have priority and I presumably
have little to offer you (believe it or not, this is great news for me!)

Yup, we have been planning for the cgroups integration and
architecting the structure since 2009 :)

Don't get me wrong, I'm extremely pleased that you're doing all this, but I wonder why the user interface hasn't been publicly discussed until now.

I know that your primary concern is to satisfy your clients, and that's good and proper, but the great thing about this list is that there's generally someone on it who has already hit the next issue you'll have to deal with.


That's why we left out the SSH integration code and we only checked in
a workaround for the 4GB memory accounting fix - we did those
deliberately.

Out of curiosity, why do people still want the SSH integration code?


...
Don't worry about it - I guess we did not advertise our business model
as we were just system integrators and consultants.

And I thought everyone here knew us pretty well - even when Grid
Engine was under Sun's control, we have been contributing code and
suggestions, and answering questions on the list, and even reviewed
Sun's design docs, code, and feature changes.

Oh, I'm aware of you chaps :)

I've seen lots of great stuff and it's appreciated - I just have a terrible memory for keeping track who is using what specific business model. I guess it comes from working for a university ;)


...
We are getting the words out, and besides the Gompute User Group
Meeting, in other conferences people mention Grid Scheduler as the
open source Grid Engine in their talks. There are other ways to make $
but since Oracle gave us the maintainership to maintain open source
Grid Engine we have the obligation to maintain it in an *Open Source*
way.
...

I'm very reluctant to end on this, but here goes; that paragraph should have at least one of:

  s/as the$/as an/

Or:

  s/^open source/commercial open source/

Otherwise, regardless of the other details surrounding the situation, it could be interpreted as being discourteous.

Sincere best wishes,

Mark
--
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.di...@leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to