On Fri, Jun 1, 2012 at 6:19 AM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote: > My underlying concern is that sometimes it is appropriate to set an address > space limit and sometimes it isn't, for the reasons we both put forward > previously in this thread. Users should therefore have some control over it. > > I hope we agree on this much?
Yes. It's good that we have your feedback (and William Hay's) *before* we release this new feature! Also, I brought it up a while ago that Grid Engine itself also kills the job if the memory usage exceeds the limit. So the end result is likely very similar, ie. the user finds that the job gets killed. > Ah-ha, that's the missing piece that's been confusing me! > > The options here seem to be: > > 1) Ask the kernel people nicely to give us a per-cgroup address space limit. > I don't think they will see much point of this. I am working with other software that uses cgroups, and someone pointed me to email yesterday: http://www.spinics.net/lists/cgroups/msg02622.html I don't think the per-cgroup AS limit is that hard - in the end the "memory.memsw.usage_in_bytes" file already shows the recorded max memory+Swap usage. I believe the kernel does its own internally accounting when memory is allocated, so it may be a matter of enforcing the limit in a different way. http://www.kernel.org/doc/Documentation/cgroups/memory.txt Of course, the difficult part is that the kernel overcommits memory, so it's likely that actual memory is not allocated & accounted by "memory.memsw.usage_in_bytes" until page faults occur - which was what we've discussed previously... I was thinking about this issue again last night: ================================== First of all, in the end a job's "h_vmem" (reported by Grid Engine's PDC) should always be greater than or equal to "memory.memsw.usage_in_bytes" reported by the kernel. May be I should also add a test case to test for it. But can you think of a case where: ( h_vmem >= memory.memsw.usage_in_bytes ) is FALSE? It should work something like this: - initially, when a job allocates memory, the h_vmem should go up immediately but not the accounted virtual size / address space usage in "memory.memsw.usage_in_bytes" - so in theory if the pages are not faulted at that time, "memory.memsw.usage_in_bytes" should be very close to zero. - as more and more pages are used (and thus causing page faults), the memory usage reported by "memory.memsw.usage_in_bytes" should get closer & closer to h_vmem. - then in the extreme case, when every page single page is used, then h_vmem == memory.memsw.usage_in_bytes . If my logic is correct, then it shouldn't be any issue related to jobs getting killed due to this change (which is more important than anything - killing innocent jobs is like killing innocent people. While Jack Bauer kills a few innocent bad guys in every season of 24, his first priority is always about saving innocent people... and we should do the same as well!). And the 2nd part is related to accounting - ie. when the job's "real" h_vmem is greater than the reported usage in the "memory.memsw.usage_in_bytes" file. Would we get different behavior than the procfs based PDC?? IMO, if we still poll the /proc filesystem for the h_vmem (ie. sum of h_vmem of all processes in a job) periodically but less as frequent, then it should not be a real issue. If a process exceeds the h_vmem limit, then it also means that it exceeds the limit imposed by setrlimit(2), which is also set even when OGS/GE is using cgroups. So with /procfs PDC or cgroups PDC, the process would get the same treatment by the kernel... But if the sum of h_vmem of all processes of a job exceeds the h_vmem, then the periodic procfs poll would still catch this case, and the action taken would be the same for both cases. I am less concerned about the 2nd part... and we should be more lenient. In the end, it does not hurt the system if it is just the virtual size usage exceeding the h_vmem temporary. In the end, h_vmem is nothing but the max. bound of valid address space of the job, NOT physical memory pages and NOT even any space in the swap (Linux VM by default overcommits memory - and in the non-overcommit case, then the logic in the first part should handle it nicely). As long as innocent jobs don't get killed, and system performance is not hurting due to cgroups integration, then everyone is happy... Not sure if I have already covered all cases... or am I still missing something?? Rayson > > 2) As well as using setrlimit, enforce a per-cgroup address space limit by > the PDC periodically polling just the processes in that cgroup. Does s_rss, > s_stack, etc. do anything in gridengine these days - do you already have a > such a poll loop to deliver that functionality? > > 3) Bring the definitions of h_vmem / s_vmem into line with the likes of > h_stack, h_rss, etc. - interpret them in terms of setrlimit only and make no > attempt to enforce per-job limits. > > Even if successful, I agree that (1) sounds like a major headache. (2) gives > the greatest backwards compatibility. If you don't already have a poll loop > and want to avoid putting one in, (3) should be sufficient to avoid loss of > functionality. > > > >> (May be I should have clarified the above point in my previous email - >> but I was really busy these days, working on the GE2011.11u1 release, >> handling outside of the mailing list user support, and talking to >> hardware vendors, etc...) > > > Thanks for continuing this conversation, I appreciate (and apologise for) > the time you're putting into it. I've obviously not done a very good job at > being clear and concise. > > All the best, > > > Mark > -- > ----------------------------------------------------------------- > Mark Dixon Email : m.c.di...@leeds.ac.uk > HPC/Grid Systems Support Tel (int): 35429 > Information Systems Services Tel (ext): +44(0)113 343 5429 > University of Leeds, LS2 9JT, UK > ----------------------------------------------------------------- > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users