[DNG] Mail writing interfaces / processes on Linux (was: systemd in wheezy)

Laurent Bercot Thu, 09 Jul 2015 15:33:26 -0700

On 09/07/2015 19:36, Steve Litt wrote:

I know what you mean. In the past 9 months I've seen a huge uptick
in ambuification in emails, to the point where many times, you don't
know who said what, and it looks like the person is arguing with
himself, with temporal dislocations thrown in as people top post with
words like "it" instead of exactly what they mean, or "I agree" in a
thread with twelve different assertions.


 Blame the tool designers. Most users read far more than they write, so
tools are optimized for reading, and not much work goes into UIs for
writing. Users are lazy - that's nothing new - and simply don't put in
the necessary effort to properly format what they write; but a good UI
should make it easier for them, or even do it in their stead.
Unfortunately, they are few and far between.

 GMail is a prime example of this sad state of the art. The GMail Web UI
is optimized for "conversation reading", i.e. it will display all the
mails in a thread at once. But the way GMail can do that is that when
you reply to a "conversation", it automatically quotes the *whole*
conversation in your mail, and forces you to top-post, so the UI can
hide the quoted part that's below your answer.
 This is great for readers who use the GMail web interface. And it is
absolutely horrible for people who don't.

 My own lists only accept plain text - I consider that if you want to
communicate via a mailing-list, you should be able to handle plain text;
if you want HTML, go to a web forum. But obviously, not everyone agrees.

By the way, I have no personal knowledge of how many actor sockets a
listener socket can spawn off, but if I had to guess, I'd imagine 50
would be way too low a number, if for no other reason than none of
my current and former ISPs would have been able to serve httpd to
the masses if 50 was the limit.


 If you're interested in the "how many simultaneous clients can I handle ?"
question, a fundamental reference page is: http://www.kegel.com/c10k.html
 It was essentially written between 1999 and 2003, but parts of it have
been maintained until today, and most of it is still pretty accurate. The
underlying APIs or algorithms have not changed that much. TL;DR: if you
use the proper APIs, you can serve around the order of 10000 clients
simultaneously on one socket. And that was already true in 1999.

 That is for heavy network servers. For services where you don't
expect 10k clients, you can use the fork/exec model just fine - that's
what inetd and tcpserver do, and it works pretty well. I expect you
could serve several hundreds without a problem, and in certain cases,
you could probably reach one or two thousands before experiencing
noticeable slowdowns.
 The first problem you'll encounter when doing that will probably be
the amount of resources, especially RAM, that you need to keep several
hundreds concurrent servers running. Most servers are not designed to
be especially thrifty with RAM, and if every instance is using a few
megabytes of private data, you're looking at a few gigabytes of RAM
if you even want to serve 1000 clients.

 Now, the original point was "What is the maximum number of processes
you can run on a system". Well, for all practical intents and purposes,
the answer really is "As many as you want". As I usually put it:
processes are not a scarce resource. Let me repeat for emphasis:
*processes are not a scarce resource.*

 I don't know what the scheduler algorithm was pre-Linux 2.6, but
in Linux 2.6, the scheduler was in O(1), meaning it scheduled your
processes in constant time, no matter how many you had. How awesome
is that ?
 They changed it for some reason, in some version of Linux 3.0 or
something around it. Now it's in O(log n), which is still incredibly
good: unless you have billions of billions of processes, you are not
going to noticeably slow down the scheduler. Fact is, you're going to
fill up the process table way before having scheduler trouble.

 Go ahead and make your fork bomb. You *will* notice a system slowdown,
but that will be because all the processes in your fork bomb are
perpetually runnable, so it's just that you will be hogging the CPU
with a potential infinity of runnable processes, and anything else
will have no timeslice left. You will see immediately that your shell
becomes unable to fork other commands - your fork bomb has filled up
the process table. But the system is still running, as best as it can
with all CPUs at perma-100% and a full process table.

 Historically, pid_t was 16 bits, and 32k processes won't kill your
scheduler. Nowadays, pid_t is 32 bits, and although there definitely
are limits that prevent you from having 2G processes, the sheer number
of processes isn't it.

 On a typical machine, the constrained resources are RAM and CPU. Those
are the resources you'll run out of first; and a process will consume
more of one or more of the other, depending on what it does and how it
is used.

 A Linux process takes some kernel memory (not sure exactly how much,
probably 8k or 12k), and about 16k of userspace memory plus what is used by
the process itself. Virtual memory makes it hard to tell exactly how much
is used. Let's say that the absolute minimal amount of real memory used by
a process is 64k total - this is a very, very large estimate. So if all
your processes are small and use very little more than that, you can have
up to almost a million processes on a 64 GB machine. More realistically, my
main server generally has about 300 processes running at all times;
sometimes it goes up to 400. And it still has tons of free RAM, because
most of those processes are very small.

 A Linux process, on a 2 GHz x86_64 machine, takes about 1-2 milliseconds
to fork() and about 1 millisecond to execve(). Then it can take a lot more
time resolving dynamic symbols, if it is a dynamically linked executable
(one of the reasons why I prefer static linking). Those numbers depend on a
lot of factors, of course, including the size of the process, whether the
executable is in the disk cache, etc. But let's say it takes about 2.5
milliseconds to fork+exec on average, to get very rough numbers.
 Well, if you're using a super-server to serve 1k clients, you're already
spending more than 2 seconds just creating your 1k processes. This is
expensive. You probably don't want to go over 1k clients if you're going
to spawn a process per connection; and the heavier a process is, the more
expensive it is to spawn - after the execve() and the dynamic linking, you
have the configuration, etc. etc. No wonder Apache wants to pre-fork its
server processes.

 On the other hand, once a process has been created, there's no upkeep for
it aside from the kernel RAM it's using. If the process sleeps all day long,
it's not going to hurt anything - its userspace memory can even be swapped
out and the RAM reclaimed until it wakes up. The 300ish processes on my
server are all I/O-bound: they're waiting on some I/O that basically never
comes, so they're sleeping all the time. They're just there, ready to react
when something comes their way, and in the meantime, they don't hurt. My
load average, unless I perform a compilation or something, is rigorously 0.00.
(The http://skarnet.org/ site definitely needs more visitors. XD)

 Conclusion: the number of processes on a system, or even the number of
processes used to perform a given task, is a meaningless metric. Processes
are a tool in a Unix programmer's toolbox, and a pretty cheap (unless you
fork millions of them all the time) and good tool at that; don't be afraid
to see some task fork zillions of processes. It's really all about what all
those processes do, how they're written, how they interact with the system.

 Better have 50 well-behaved processes using exactly the resources they need
to perform their job than one big memory hog or CPU hog.

--
 Laurent

_______________________________________________
Dng mailing list
Dng@lists.dyne.org
https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/dng

[DNG] Mail writing interfaces / processes on Linux (was: systemd in wheezy)

Reply via email to