Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

Rayson Ho Thu, 24 May 2012 11:50:58 -0700

On Thu, May 24, 2012 at 11:57 AM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote:
> My concern about this mainly centres around the fact that "virtual memory
> size" already has a very specific meaning. It does not mean RAM usage + swap
> and so isn't the same as memsw.
>
> This does impact on things:
>
> 1) It's not a "drop in" replacement. If upgrading gridengine on an existing
> system, activating your cgroup code will cause an immediate change in
> behaviour of jobs, without the user altering their submission flags. People
> don't tend to like that sort of thing.


Can you let me know (base on your understanding of memsw) the difference?


> 2) It's removing functionality. The old behaviour allows you to ensure that
> a job will fail if it mallocs something that's too big. Either it runs or it
> doesn't - and provides a decent error code you can handle rather than just
> die. That could be important to some people. In contrast, the new behaviour
> will permit the malloc, but then stop at some less predictable point in the
> future when you use more than the permitted amount of the memory.

When you allocate memory, the kernel should back the allocation by
something in the non-overcommiting case (not the default). If the
kernel is overcommiting the VM and the system runs out of memory
pages, the Linux OOM picks a process and kills it.

So even if malloc returns a non-NULL pointer, the kernel can still
kill the process due to VM overcommit.


> 3) We've all got lots of users already using the old code. No matter what we
> say to them, when presented with a new system most will ignore any
> documentation written by admins and just copy their old job scripts across
> and keep using them. Since the memory usage as measured by the cgroup PDC is
> likely to be much lower (but more accurate) than that by the traditional
> PDC, we've not forced users to read the documentation and reassess their
> memory needs (e.g. a JSV rejecting the use of h_vmem unless you've also have
> a "-l yes_I_really_mean_h_vmem"). So we don't get an immediate big
> improvement in throughput.

Not trying to argue here - in some cases the PDC counts more usage -
ie. when a process forks, the shared pages are not accounted properly.

In our cgroups code, we are setting the Unix rlimit as well - as I
have seen code that queries the current process limit and acts
slightly differently based on the limit. So, may be it is a matter of
giving some small headroom to handle the job shells, etc??

For a normal single process job, the memsw usage should be very close
to that of the h_vmem of each individual process.


> Personally, my concern is mainly centered around (3).
>
> In my view, using a new set of attributes (I don't care what they're
> called), rather than overloading old ones, avoids all of these issues.

Agreed. But we don't know what the new resource limits should be
called, and their precise definitions.



>> http://www.scalablelogic.com/scalable-grid-engine-support
>
>
> That is fantastic news (sorry, I keep losing track of people's commercial
> models): my sincere thanks for this.

Thanks :-D


>> Let me start another thread to follow up with this specific topic.
>
> Sure :)

We have ISVs using our code, and an EC2 SaaS provider indirectly using
our code, and thus it is not just us, but companies that depend on
Open Grid Scheduler. We need to be IP clean.

More on that later... (Today is a busy day)


> You clearly have a more complete and advanced implementation than what I've
> done and were intending to do. You therefore have priority and I presumably
> have little to offer you (believe it or not, this is great news for me!)

Yup, we have been planning for the cgroups integration and
architecting the structure since 2009 :)

That's why we left out the SSH integration code and we only checked in
a workaround for the 4GB memory accounting fix - we did those
deliberately.

Note that the 4GB workaround would break when a process uses 8,388,608
TB of memory - but then when a single process consumes 8,388,608 TB of
memory, it is likely going to be way after year 2050 even if Moore's
law were to continue (IIRC, Silcon-based VLSI circuits will not get
the Moore's law free-ride beyond 2020 - and even if we were to use
other materials the laws of physics will still stop us from continuing
this sort of scaling - I am not sure if we will be running the same
code when we upgrade to a quantum computer or smart-phone.) So we have
the real deadline to switch to cgroups - ie. around or before Yr 2100
I believe... I am too lazy to do the math :)


> At this stage about this specific feature, I'm hoping there can be a
> discussion about how it can be best presented to the end user (once I was
> sure I was capable of doing a cgroup feature, my next port of call was going
> to be this list to start that conversation).

That's very important - the user interface, etc... are really useful -
including William's GID suggestion.


> PS Forgot to say in my previous emails (where are my manners?) -

Don't worry about it - I guess we did not advertise our business model
as we were just system integrators and consultants.

And I thought everyone here knew us pretty well - even when Grid
Engine was under Sun's control, we have been contributing code and
suggestions, and answering questions on the list, and even reviewed
Sun's design docs, code, and feature changes.

In terms of code enhancements, besides the AIX, HP-UX (and the Darwin
early stage code) PDCs, we also added the poll(2) code in Grid Engine
such that one can support over 1,000 nodes in a single Grid Engine
cluster (one needed to hack the system include file before our
contribution when compiling Grid Engine).

Also, Grid Engine has dynamic spooling by default because we asked Sun
to enable support for it - note that Sun's decision was to switch to
BerkeleyDB and when shadow master is needed, the spooling directory
needed to be on NFSv4. (The dynamic spooling code was written by Sun,
but the decision was not to use it for the pre-compiled binaries.)
Luckily Andy (who is still at Oracle) listened. I could not count the
number of sites using Sun's pre-compiled binaries & using classic
spooling.

(We basically have a track record of contributing to Grid Engine even
when we did not get any $ out of it.)


We knew how useful Grid Engine is, and thus we keep *Open Source* Grid
Engine alive & active.


> congratulations to all on the imminent release and the development work that
> has gone into it, and thanks again: it's very pleasing to see this sort of
> thing done under an open source model.

We are getting the words out, and besides the Gompute User Group
Meeting, in other conferences people mention Grid Scheduler as the
open source Grid Engine in their talks. There are other ways to make $
but since Oracle gave us the maintainership to maintain open source
Grid Engine we have the obligation to maintain it in an *Open Source*
way.

Rayson



>
>
> --
> -----------------------------------------------------------------
> Mark Dixon                       Email    : m.c.di...@leeds.ac.uk
> HPC/Grid Systems Support         Tel (int): 35429
> Information Systems Services     Tel (ext): +44(0)113 343 5429
> University of Leeds, LS2 9JT, UK
> -----------------------------------------------------------------

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] cgroups Integration in OGS/GE 2011.11 update 1

Reply via email to