Distinguish real vs. virtual CPUs?

2005-03-21 Thread Dan Maas
Is there a canonical way for user-space software to determine how many
real CPUs are present in a system (as opposed to HyperThreaded or
otherwise virtual CPUs)?

We have an application that for performance reasons wants to run one
process per CPU. However, on a HyperThreaded system /proc/cpuinfo
lists two CPUs, and running two processes in this case is the wrong
thing to do. (Hyperthreading ends up degrading our performance,
perhaps due to cache or bus contention).

Please CC replies.

Thanks,
Dan Maas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VM Requirement Document - v0.0

2001-07-04 Thread Dan Maas

> Getting the user's "interactive" programs loaded back
> in afterwards is a separate, much more difficult problem
> IMHO, but no doubt still has a reasonable solution.

Possibly stupid suggestion... Maybe the interactive/GUI programs should wake
up once in a while and touch a couple of their pages? Go too far with this
and you'll just get in the way of performance, but I don't think it would
hurt to have processes waking up every couple of minutes and touching glibc,
libqt, libgtk, etc so they stay hot in memory... A very slow incremental
"caress" of the address space could eliminate the
"I-just-logged-in-this-morning-and-dammit-everything-has-been-paged-out"
problem.

Regards,
Dan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: A signal fairy tale

2001-06-28 Thread Dan Maas

> Signals are a pretty dopey API anyway - so instead of trying to patch
> them up, why not think of something better for AIO?

I have to agree, in a way... At some point we need to swallow our pride,
admit that UNIX has a crappy event model, and implement something like Win32
GetMessage =)...

I've been having trouble finding situations where asynchronous signals are
really the most appropriate technique, aside from delivering
life-threatening things like SIGTERM, SIGKILL, and SIGSEGV. The mutation
into queued, information-carrying siginfo signals just shows how badly we
need a more robust event model... (what would truly kick butt is a unified
interface that could deliver everything from fd events to AIO completions to
semaphore/msgqueue events, etc, with explicit binding between event queues
and threads).

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: VM Requirement Document - v0.0

2001-06-26 Thread Dan Maas

> Windows NT/2000 has flags that can be for each CreateFile operation
> ("open" in Unix terms), for instance
>
>   FILE_ATTRIBUTE_TEMPORARY
>   FILE_FLAG_WRITE_THROUGH
>   FILE_FLAG_NO_BUFFERING
>   FILE_FLAG_RANDOM_ACCESS
>   FILE_FLAG_SEQUENTIAL_SCAN
>

There is a BSD-originated convention for this - madvise().

If you look in the Linux VM code there is a bit of explicit code for
different madvise access patterns, but I'm not sure if it's 100% supported.

Drop-behind would be really, really nice to have for my multimedia
applications. I routinely deal with very large video files (several times
larger than my RAM). When I sequentially read though such files a bit at a
time, I do NOT want the old pages sitting there in RAM while all of my other
running programs are rudely paged out...

(hrm, maybe I could hack up my own manual read-ahead/drop-behind with mmap()
and memory locking...)

Regards,
Dan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: threading question

2001-06-16 Thread Dan Maas

> Is there a user-space implemenation (library?) for 
> coroutines that would work from C?

Here is another one:

http://oss.sgi.com/projects/state-threads/


Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: forcibly unmap pages in driver?

2001-06-06 Thread Dan Maas

Just an update to my situation... I've implemented my idea of clearing the
associated PTE's when I need to free the DMA buffer, then re-filling them in
nopage(). This seems to work fine; if the user process tries anything fishy,
it gets a SIGBUS instead of accessing the old mapping.

I encountered two difficulties with the implementation:

1) zap_page_range(), flush_cache_range(), and flush_tlb_range() are not
exported to drivers. I basically copied the guts of zap_page_range() into my
driver, which seems to work OK on x86, but I know it will have trouble with
architectures that require special treatment of PTE manipulation...

2) the state of mm->mmap_sem is unknown when my file_operations->release()
function is called. If release() is called when the last FD closes, then
mm->mmap_sem is not taken. But if release() is called from do_munmap, then
mmap_sem has already been taken. So, it is risky to mess with vma's inside
of release()...

Regards,
Dan

> >> Later, the program calls the ioctl() again to set a smaller
> >> buffer size, or closes the file descriptor. At this point
> >> I'd like to shrink the buffer or free it completely. But I
> >> can't assume that the program will be nice and munmap() the
> >> region for me
>
> > Look at drivers/char/drm, for example.  At mmap time they allocate a
> > vm_ops to the address space.  With that you catch changes to the vma
> > structure initiated by a user mmap, munmap, etc.  You could also
> > dynamically map the pages in using the nopage method (optional).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: forcibly unmap pages in driver?

2001-06-05 Thread Dan Maas

>> Later, the program calls the ioctl() again to set a smaller
>> buffer size, or closes the file descriptor. At this point
>> I'd like to shrink the buffer or free it completely. But I
>> can't assume that the program will be nice and munmap() the
>> region for me

> Look at drivers/char/drm, for example.  At mmap time they allocate a
> vm_ops to the address space.  With that you catch changes to the vma
> structure initiated by a user mmap, munmap, etc.  You could also
> dynamically map the pages in using the nopage method (optional).

OK I think I have a solution... Whenever I need to re-allocate or free the
DMA buffer, I could set all of the user's corresponding page table entries
to deny all access. Then I'd get a page fault on the next access to the
buffer, and inside nopage() I could update the user's mapping or send a
SIGBUS as appropriate (hmm, just like restoring a file mapping that was
thrown away)... So I just have to figure out how to find the user's page
table entries that are pointing to the DMA buffer.

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: forcibly unmap pages in driver?

2001-06-05 Thread Dan Maas

> That seems a bit perverse.  How will the poor userspace program know
> not to access the pages you have yanked away from it?  If you plan
> to kill it, better to do that directly.  If you plan to signal it
> that the mapping is gone, it can just call munmap() itself.

Thanks Pete. I will explain situation I am envisioning; perhaps there is a
better way to handle this --

My driver uses a variable-size DMA buffer that it shares with user-space; I
provide an ioctl() to choose the buffer size and allocate the buffer. Say
the user program chooses a large buffer size, and mmap()s the entire buffer.
Later, the program calls the ioctl() again to set a smaller buffer size, or
closes the file descriptor. At this point I'd like to shrink the buffer or
free it completely. But I can't assume that the program will be nice and
munmap() the region for me - it might still have the large buffer mapped.
What should I do here?

An easy solution would to allocate the largest possible buffer as my driver
is loaded, even if not all of it will be exposed to user-space. I don't
really like this choice because the buffer needs to be pinned in memory, and
the largest useful buffer size is very big (several tens of MB). Maybe I
should disallow more than one buffer allocation per open() of the device...
But the memory mapping will stay around even after close(), correct? I'd
hate to have to keep the buffer around until my driver module is unloaded.

> However, do_munmap() will call zap_page_range() for you and take care of
> cache and TLB flushing if you're going to do this in the kernel.

I'm not sure if I could use do_munmap() -- how will I know if the user
program has called munmap() already, and then mmap()ed something else in the
same place? Then I'd be killing the wrong mapping...

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



forcibly unmap pages in driver?

2001-06-04 Thread Dan Maas

I am writing a device driver that, like many others, exposes a shared memory
region to user-space via mmap(). The region is allocated with vmalloc(), the
pages are marked reserved, and the user-space mapping is implemented with
remap_page_range().

In my driver, I may have to free the underlying vmalloc() region while the
user-space program is still running. I need to remove the user-space
mapping -- otherwise the user process would still have access to the
now-freed pages. I need an inverse of remap_page_range().

Is zap_page_range() the function I am looking for? Unfortunately it's not
exported to modules =(. As a quick fix, I was thinking I could just remap
all of the user pages to point to a zeroed page or something...

Another question- in the mm.c sources, I see that many of the memory-mapping
functions are surrounded by calls to flush_cache_range() and
flush_tlb_range(). But I don't see these calls in many drivers. Is it
necessary to make them when my driver maps or unmaps the shared memory
region?

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: #define HZ 1024 -- negative effects?

2001-04-25 Thread Dan Maas

> Are there any negative effects of editing include/asm/param.h to change
> HZ from 100 to 1024? Or any other number? This has been suggested as a
> way to improve the responsiveness of the GUI on a Linux system.

I have also played around with HZ=1024 and wondered how it affects
interactivity. I don't quite understand why it could help - one thing I've
learned looking at kernel traces (LTT) is that interactive processes very,
very rarely eat up their whole timeslice (even hogs like X). So more
frequent timer interrupts shouldn't have much of an effect...

If you are burning CPU doing stuff like long compiles, then the increased HZ
might make the system appear more responsive because the CPU hog gets
pre-empted more often. However, you could get the same result just by
running the task 'nice'ly...

The only other possibility I can think of is a scheduler anomaly. A thread
arose on this list recently about strange scheduling behavior of processes
using local IPC - even though one process had readable data pending, the
kernel would still go idle until the next timer interrupt. If this is the
case, then HZ=1024 would kick the system back into action more quickly...

Of course, the appearance of better interactivity could just be a placebo
effect. Double-blind trials, anyone? =)

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Asynchronous IO

2001-04-13 Thread Dan Maas

IIRC the problem with implementing asynchronous *disk* I/O in Linux today is
that the filesystem code assumes synchronous I/O operations that block the
whole process/thread. So implementing "real" asynch I/O (without the
overhead of creating a process context for each operation) would require
re-writing the filesystems as non-blocking state machines. Last I heard this
was a long-term goal, but nobody's done the work yet (aside from maybe the
SGI folks with XFS?). Or maybe I don't know what I'm talking about...

Bart, glad to hear you are working on an event interface, sounds cool! One
feature that I really, really, *really* want to see implemented is the
ability to block on a set of any "waitable kernel objects" with one
syscall - not just file descriptors, but also SysV semaphores and message
queues, UNIX signals and child proceses, file locks, pthreads condition
variables, asynch disk I/O completions, etc. I am dying for a clean way to
accomplish this that doesn't require more than one thread... (Win32 and
FreeBSD kick our butts here with MsgWaitForMultipleObjects() and
kevent()...) IMHO cleaning up this API deficiency is just as important as
optimizing the extreme case of socket I/O with zillions of file
descriptors...

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Using IPCSysV in a device driver

2001-03-03 Thread Dan Maas

> I am wondering if it is permitted to use message queues between a user
> application and a device driver module...
> Can anyone help me?

It may be theoretically possible, but an easier and much more common
approach to this type of thing is for the driver to export an mmap()
interface. You could synchronize using poll() I think...

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: mapping physical memory

2001-01-26 Thread Dan Maas

> I need to be able to obtain and pin approximately 8 MB of
> contiguous physical memory in user space.  How would I go
> about doing that under Linux if it is at all possible?

The only way to allocate that much *physically* contiguous memory is by
writing a driver that grabs it at boot-time (I think the "bootmem" API is
used for this). This is an extreme measure and should rarely be necessary,
except in special cases such as primitive PCI cards that lack support for
scatter/gather DMA.

You can easily implement a mmap() interface to give user-space programs
access to the memory; there are plenty of examples of how to do this in
various character device drivers.
(well OK, if all you need is a one-off hack, you can use the method
developed by the Utah GLX people -- tell the kernel that you have 8MB *less*
RAM than is actually present using a "mem=" directive at boot, then grab
that last piece of memory by mmap'ing /dev/mem -- see
http://utah-glx.sourceforge.net/memory-usage.html)



Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PROBLEM: select() on TCP socket sleeps for 1 tick even if data available

2001-01-20 Thread Dan Maas

> It's not the select that waits. It's a delay in the tcp send
> path waiting for more data.  Try disabling it:
>
> int f=1;
> setsockopt(s, SOL_TCP, TCP_NODELAY, &f, sizeof(f));

Bingo! With this fix, 2.2.18 performance becomes almost identical to 2.4.0
performance. I assume 2.4.0 disables Nagle by default on local
connections...

Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PROBLEM: select() on TCP socket sleeps for 1 tick even if data available

2001-01-20 Thread Dan Maas

What kernel have you been using? I have reproduced your problem on a
standard 2.2.18 kernel (elapsed time ~10sec). However, using a 2.4.0 kernel
with HZ=1000, I see a 100x improvement (elapsed time ~0.1 sec; note that
increasing HZ alone should only give a 10x improvement). Perhaps the
scheduler was fixed in 2.4.0?

2.2.18 very definitely has some scheduling anomalies. In your benchmark,
select() or poll() takes 10ms, as can be observed with strace -T. Skipping
the select() and blocking in read() gives the same behavior. This leads me
to believe the scheduler is at fault, and not select(), poll(), or read().

When run without strace, 2.4.0 appears to have no problems with your
benchmark. Elapsed time is 0.1 sec -- this may be the full potential of my
machine (PII/450). Removing select() and blocking in read() results in a
further improvement, to 0.07 sec.

Strace disturbs the behavior of 2.4.0 in strange ways. Running the benchmark
under strace with 2.4.0 causes the scheduler delays to return -- ~1ms delays
appear in select() or write(). This is confusing - it appears that context
switches can happen inside write() as well as select(), a result I don't
understand at all (the socket buffers never completely fill since you only
write 1000 bytes to each one).

Other notes: poll() behaves same as select(). Using the SCHED_FIFO class and
mlockall() has no effect on this benchmark. Setting the sockets non-blocking
also has no effect.

I wish I had the Linux Trace Toolkit handy; it would give a much better idea
of what's going on than strace...

Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug (really 830MB barrier question)

2001-01-09 Thread Dan Maas

> 08048000-08b5c000 r-xp  03:05 1130923
/tmp/newmagma/magma.exe.dyn
> 08b5c000-08cc9000 rw-p 00b13000 03:05 1130923
/tmp/newmagma/magma.exe.dyn
> 08cc9000-0bd0 rwxp  00:00 0

> Now, subsequent to each memory allocation, only the second number in the
> third line changes.  It becomes 23a78000, then 3b7f, and finally
> 3b808000 (after the failed allocation).

OK it's fairly obvious what's happening here. Your program is using its own
allocator, which relies solely on brk() to obtain more memory. On x86 Linux,
brk()-allocated memory (the heap) begins right above the executable and
grows upward - the increasing number you noted above is the top of the heap,
which grows with every brk(). Problem is, the heap can't keep growing
forever - as you discovered, on x86 Linux the upper bound is just below
0x4000. That boundary is where shared libraries and other memory-mapped
files start to appear.

Note that there is still plenty (~2GB) of address space left, in the region
between the shared libraries and the top of user address space (just under
0xBFFF). How do you use that space? You need an allocation scheme based
on mmap'ing /dev/zero. As others pointed out, glibc's allocator does just
that.

Here's your short answer: ask the authors of your program to either 1)
replace their custom allocator with regular malloc() or 2) enhance their
custom allocator to use mmap. (or, buy some 64-bit hardware =)...)

Dan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Journaling: Surviving or allowing unclean shutdown?

2001-01-04 Thread Dan Maas

> > Being able to shut down by hitting the power switch is a little luxury
> > for which I've been willing to invest more than a year of my life to
> > attain.  Clueless newbies don't know why it should be any other way, and
> > it's essential for embedded devices.

Just some food for thought - hitting the power switch on my old Indy
actually performs the equivalent of "shutdown -r now"; the system only cuts
the power when it's done cleaning up (sometimes several minutes later). I
suspect most workstation-class systems do similar things.

Of course this creates a confusing distinction between "pulling the plug"
and "hitting the power switch." Uninformed users might even be more
bewildered by the flurry of disk activity after performing the latter; heck,
I wouldn't blame someone who freaks out and pull the plug to make it stop
=).

Also, such a system obviously has little benefit in the event of an AC power
failure.

Dan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: LMbench 2.4.0-test10pre-SMP vs. 2.2.18pre-SMP

2000-10-24 Thread Dan Maas

> The pipe bandwidth is intimately related to pipe latency.  Linux pipes
> are fairly small (only 4kB worth of data buffer), so they need good
> latency for good performance.
...
> The pipe bandwidth could be fairly easily improved by just doubling the
> buffer size (or by using VM tricks), but it's not been something that
> anybody has felt was all that important in real life.

A while ago I hacked 2.2.17 to use larger pipe buffers. On my own pure
throughput benchmark (two processes ping-pongging one buffer's worth of data
on a single-CPU system), buffers larger than 4KB hardly gave any advantage.
64KB buffers were marginally (10-20%) faster, but performance dropped quite
considerably after that (cache effects, maybe...).

After seeing these results I simply assumed that 4KB had been deliberately
chosen as the optimal buffer size, rather than by luck =).

Now, Dave Miller's kiobuf pipes may change the picture somewhat...

Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Maas

> Shouldn't there also be a way to add non-filedescriptor based events
> into this, such as "child exited" or "signal caught" or shm things?

Waiting on pthreads condition variables, POSIX message queues, and
semaphores (as well as fd's) at the same time would *rock*...

Unifying all these "waitable objects" would be tremendously helpful to fully
exploit the "library transparency" advantage that Linus brought up. Some
libraries might want to wait on things that are not file descriptors...

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: about time-slice

2000-10-19 Thread Dan Maas

> I have a question about the time-slice of linux, how do I know it, or how
> can I test it?

First look for the (platform-specific) definition of HZ in
include/asm/param.h. This is how many timer interrups you get per second (eg
on i386 it's 100). Then look at include/linux/sched.h for the definition of
DEF_COUNTER. This is the number of timer interrupts between mandatory
schedules. By default it's HZ/10, meaning that the time-slice is 100ms (10
schedules/sec). (of course the interval could be longer if kernel code is
hogging the CPU; the scheduler won't run until the process leaves the kernel
or sleeps explicitly...)

Experts, please correct me if I'm wrong.

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: large memory support for x86

2000-10-12 Thread Dan Maas

The memory map of a user process on x86 looks like this:

-
KERNEL (always present here)
0xC000
-
0xBFFF
STACK
-
MAPPED FILES (incl. shared libs)
0x4000
-
HEAP (brk()/malloc())
EXECUTABLE CODE
0x08048000
-

Try examining /proc/*/maps, and also watch your programs call brk() using
strace; you'll see all this in action...

> So why does the process space start at such a high virtual
> address (why not closer to 0x)? Seems we're wasting ~128 megs of
> RAM. Not a huge amount compared to 4G, but signifigant.

I don't know; anyone care to comment?

> Another question: how (and where in the code) do we translate virtual
> user-addresses to physical addresses?

In hardware, with the TLB and, if the TLB misses, then page tables.

> Does the MMU do it, or does it call a
> kernel handler function?

Only when an attempt is made to access an unmapped or protected page; then
you get an interrupt (page fault), which the kernel code handles.

> Why is the kernel allowed to reference physical
> addresses, while user processes go through the translation step?

Not even the kernel accesses physical memory directly. It can, however,
choose to map the physical memory into its own address space contiguously.
Linux puts it at 0xC000 and up. (question for the gurus- what happens on
machines with >1GB of RAM?)

> Can kernel
> pages be swapped out / faulted in just like user process pages?

Linux does not swap kernel memory; the kernel is so small it's not worth the
trouble (are there other reasons?). e.g. My Linux boxes run 1-2MB of kernel
code; my NT machine is running >6MB at the moment...

Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: thread rant [semi-OT]

2000-09-01 Thread Dan Maas

> All portability issues aside, if one is writing an application in
> Linux that one would be tempted to make multithreaded for
> whatever reason, what would be the better Linux way of doing
> things?

Let's go back to basics. Look inside your computer. See what's there:

1) one (or more) CPUs
2) some RAM
3) a PCI bus, containing:
4)   -- a SCSI/IDE controller
5)   -- a network card
6)   -- a graphics card

These are all the parts of your computer that are smart enough to accomplish
some amount of work on their own. The SCSI or IDE controller can read data
from disk without bothering any other components. The network card can send
and receive packets fairly autonomously. Each CPU in an SMP system operates
nearly independently. An ideal application could have all of these devices
doing useful work at the same time.

When people think of "multithreading," often they are just looking for a way
to extract more concurrency from their machine. You want all these
independent parts to be working on your task simultaneously. There are many
different mechanisms for achieveing this. Here we go...

A naively-written "server" program (eg a web server) might be coded like so:

* Read configuration file - all other work stops while data is fetched from
disk
* Parse configuration file - all other work stops while CPU/RAM work on
parsing the file
* Wait for a network connection - all other work stops while waiting for
incoming packets
* Read request from client - all other work stops while waiting for incoming
packets
* Process request - all other work stops while CPU/RAM figure out what to do
  - all other work stops while disk fetches requested file
* Write reply to client - all other work stops until final buffer
transmitted

I've phrased the descriptions to emphasize that only one resource is being
used at once - the rest of the system sits twiddling its thumbs until the
one device in question finishes its task.


Can we do better? Yes, thanks to various programming techniques that allow
us to keep more of the system busy. The most important bottleneck is
probably the network - it makes no sense for our server to wait while a slow
client takes its time acknowledging our packets. By using standard UNIX
multiplexed I/O (select()/poll()), we can send buffers of data to the kernel
just when space becomes available in the outgoing queue; we can also accept
client requests piecemeal, as the individual packets flow in. And while
we're waiting for packets from one client, we can be processing another
client's request.

The improved program performs better since it keeps the CPU and network busy
at the same time. However, it will be more difficult to write, since we have
to maintain the connection state manually, rather than implicitly on the
call stack.


So now the server handles many clients at once, and it gracefully handles
slow clients. Can we do even better? Yes, let's look at the next
bottleneck - disk I/O. If a client asks for a file that's not in memory, the
whole server will come to a halt while it read()s the data in. But the
SCSI/IDE controller is smart enough to handle this alone; why not let the
CPU and network take care of other clients while the disk does its work?

How do we go about doing this? Well, it's UNIX, right? We talk to disk files
the same way we talk to network sockets, so let's just select()/poll() on
the disk files too, and everything will be dandy... (Unfortunately we can't
do that - the designers of UNIX made a huge mistake and decided against
implementing non-blocking disk I/O as they had with network I/O. Big booboo.
For that reason, it was impossible to do concurrent disk I/O until the POSIX
Asynchronous I/O standard came along. So we go learn this whole bloated API,
in the process finding out that we can no longer use select()/poll(), and
must switch to POSIX RT signals - sigwaitinfo() - to control our server***).
After the dust has settled, we can now keep the CPU, network card, and the
disk busy all the time -- so our server is even faster.


Notice that our program has been made heavily concurrent, and I haven't even
used the word "thread" yet!


Let's take it one step further. Packets and buffers are now coming in and
out so quickly that the CPU is sweating just handling all the I/O. But say
we have one or three more CPU's sitting there idle - how can we get them
going, too? We need to run multiple request handlers at once.

Conventional multithreading is *one* possible way to accomplish this; it's
rather brute-force, since the threads share all their memory, sockets, etc.
(and full VM sharing doesn't scale optimally, since interrupts must be sent
to all the CPUs when the memory layout changes).

Lots of UNIX servers run multiple *processes*- the "sub-servers" might not
share anything, or they might file cache or request queue. If we were brave,
we'd think carefully about what resources really should be shared between
the sub-servers, and then implement it manually using