Hi,

there is an ongoing discussion initiated by Max whether Apache::SizeLimit does 
the right thing in reporting the current amount of RAM a process does not 
share with any other process as

  unshared_size = total_size - shared_size

Max suggests to change that definition to

  unshared_size = rss - shared_size

Beside the fact that that change should be announced very loudly, perhaps by a 
new major version, because it requires current installations to adjust their 
parameters I am not sure whether it is the right way.

(I am talking about Linux here)

What does that mean?
====================

The total size of a process comprises its complete address space. Normally, by 
far not everything of this space is present in RAM. When the process accesses 
a part of its address space that is not present the CPU generates an interrupt 
and the operating system reads in that piece from disk or allocates an empty 
page and thus makes the accessed page present. Then the operation is repeated 
and this time it succeeds. The process normally is not aware of all this.

The part of the process that is really present in RAM is the RSS.

Now, Linux comes with the /proc/$PID/smaps device that reports sizes for 
shared and private portions of the process' address space.

How does that work?
===================

Linux organizes the RAM and address spaces in fixed size chunks, so called 
pages (normally 4kb). Now, a single page of RAM can belongs to only one 
process or it can be used by multiple processes (for example because they use 
the same algorithmic part of the C library that is read-only). So, each page 
has a reference count. If that refcount is 1 the page is used by only one 
process and hence private to it. If the refcount is >1 the page is shared 
among multiple processes.

When /proc/$PID/smaps is read for a process Linux walks all pages of the 
process and classifies them in 3 groups:

- the page is present in RAM and has a refcount==1
    ==> add it to the process total size and to the private portion

- the page is present in RAM and has a refcount>1
    ==> add it to the process total size and to the shared portion

- the page is not present in RAM
    ==> add it to the process total size

The point here is, for a page that is not present Linux cannot read the 
refcount because that count is also not present in RAM. So, to decide if a 
page is shared or not it would have to read in the page. This is too expensive 
an operation only to read the refcount.

So, while in theory a page is either used by only one process and hence 
private or by multiple and hence shared in practice we have

  total_size = private + shared + notpresent

where notpresent is either shared or private, we cannot know.

How processes are created?
==========================

Under Linux a process is create by the fork() or clone() system calls. In 
theory the operating system duplicates the complete address space of the 
process calling fork. One copy belongs to the original process (the parent)
the other is for the new process (the child).

But if we really had to copy the whole address space fork() would be a really 
expensive operation. In fact, only a table holding pointers to the pages that 
comprise the parent's address space is duplicated. And all pages are marked 
read-only and their reference count is incremented.

Now, if one of the processes wants to write to a page the CPU again generates 
an interrupt because the page is marked as read-only. The operating system 
catches that interrupt. And only now the actual page is duplicated. One page 
for the writing process and one for the others. The refcount of the new page 
becomes 1 that of the old is decremented. This working pattern is called copy-
on-write.

With the apache web server we have one parent process that spawns many 
children. At first, almost all of the child's address space is shared with the 
parent due to copy-on-write. Over its lifetime the child's private address 
space grows by 2 means:

* it allocates more memory
  ==> total_size grows, unshared grows but shared stays the same.
* it writes to portions that were initially shared with the parent
  ==> unshared grows, shared shrinks but total_size does not change.

Now, what is the goal of Apache::SizeLimit?
===========================================

If the overall working set of all apache children becomes larger than the 
available RAM the system then the operating system has to fetch from the disk 
code and/or data for each request and by doing so it has to evict pages that 
will be needed by the next request shortly after.

Apache::SizeLimit (ASL hereafter) tries to avoid this situation.

Note, there is no problem with large process sizes and heavy swap space usage 
if the data remains there and is normally not used.

ASL can monitor a few values per apache child, the total size of the process, 
the RSS portion, the portion that is reported as shared and the private part.

As for the "unshared" value above it can be defined as:
(all rvalues are reported by /proc/$PID/smaps)

 1) unshared = size - shared

 2) unshared = rss - shared

 3) unshared = private
    this is just the same as 2) because rss = shared + private

What are the implications?
==========================

1) since size = shared + private + notpresent unshared becomes

  unshared = private + notpresent

Of notpresent we know nothing for sure. It is a mix of shared and private 
pages.

Now, if an administrator turns off swapping (swapoff /dev/...) the part of 
notpresent that has been there becomes present. So, we can expect shared to 
grow considerably. unshared will shrink by the same amount.

As for absolute values, unshared in this case is quite large a number because 
it is on top of notpresent.

2) here unshared lacks the notpresent part. So the actual number is much less 
than it would  be in case 1.

But if an administrator turns off swapping now a part of notpresent will be 
added to unshared. unshared may suddenly jump over the limit in all apache 
children.

Well, an administrator doing that on a busy web server should be converted 
into a httpd ...

So, I am quite undecided what to do.

Please comment!

See also
  http://foertsch.name/ModPerl-Tricks/Measuring-memory-consumption/index.shtml


What else could be done to hit ASL's goal?
==========================================

There are a few other status fields that can be possibly used:

- "Swap" in /proc/$PID/smaps
  don't know for sure what that means but sounds good. Need to inspect the
  kernel code

- "Referenced" in /proc/$PID/smaps
  can be used to find out how many RAM a process has accessed since the last
  reset of the counter. We could reset it in PerlPostReadRequestHandler and
  read in a $r->pool cleanup.

- "Pss" in /proc/$PID/smaps
  segment size divided by the refcount

- "VmSwap" in /proc/$PID/status
  for example: terminate if the process starts to use swap space

Certainly more.

Torsten Förtsch

-- 
Need professional modperl support? Hire me! (http://foertsch.name)

Like fantasy? http://kabatinte.net

Reply via email to