On 02/27/2015 12:58 AM, Steven D'Aprano wrote:
Dave Angel wrote:

(Although I believe Seymour Cray was quoted as saying that virtual
memory is a crock, because "you can't fake what you ain't got.")

If I recall correctly, disk access is about 10000 times slower than RAM, so
virtual memory is *at least* that much slower than real memory.


It's so much more complicated than that, that I hardly know where to start. I'll describe a generic processor/OS/memory/disk architecture; there will be huge differences between processor models even from a single manufacturer.

First, as soon as you add swapping logic to your processor/memory-system, you theoretically slow it down. And in the days of that quote, Cray's memory was maybe 50 times as fast as the memory used by us mortals. So adding swapping logic would have slowed it down quite substantially, even when it was not swapping. But that logic is inside the CPU chip these days, and presumably thoroughly optimized.

Next, statistically, a program uses a small subset of its total program & data space in its working set, and the working set should reside in real memory. But when the program greatly increases that working set, and it approaches the amount of physical memory, then swapping becomes more frenzied, and we say the program is thrashing. Simple example, try sorting an array that's about the size of available physical memory.

Next, even physical memory is divided into a few levels of caching, some on-chip and some off. And the caching is done in what I call strips, where accessing just one byte causes the whole strip to be loaded from non-cached memory. I forget the current size for that, but it's maybe 64 to 256 bytes or so.

If there are multiple processors (not multicore, but actual separate processors), then each one has such internal caches, and any writes on one processor may have to trigger flushes of all the other processors that happen to have the same strip loaded.

The processor not only prefetches the next few instructions, but decodes and tentatively executes them, subject to being discarded if a conditional branch doesn't go the way the processor predicted. So some instructions execute in zero time, some of the time.

Every address of instruction fetch, or of data fetch or store, goes through a couple of layers of translation. Segment register plus offset gives linear address. Lookup those in tables to get physical address, and if table happens not to be in on-chip cache, swap it in. If physical address isn't valid, a processor exception causes the OS to potentially swap something out, and something else in.

Once we're paging from the swapfile, the size of the read is perhaps 4k. And that read is regardless of whether we're only going to use one byte or all of it.

The ratio between an access which was in the L1 cache and one which required a page to be swapped in from disk? Much bigger than your 10,000 figure. But hopefully it doesn't happen a big percentage of the time.

Many, many other variables, like the fact that RAM chips are not directly addressable by bytes, but instead count on rows and columns. So if you access many bytes in the same row, it can be much quicker than random access. So simple access time specifications don't mean as much as it would seem; the controller has to balance the RAM spec with the various cache requirements.
--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to