Re: read vs. mmap (or io vs. page faults)
On Tuesday 22 June 2004 11:27 pm, Peter Wemm wrote: = mmap is more valuable as a programmer convenience these days. Don't = make the mistake of assuming its faster, especially since the cost of = a copy has gone way down. Actually, let me back off from agreeing with you here :-) On io-bound machines (such as my laptop), there is no discernable difference in either the CPU or the elapsed time -- md5-ing a file with mmap or read is (curiously) slightly faster than just cat-ing it into /dev/null. On an dual P2 450MHz, the single process always wins the CPU time and sometimes the elapsed time. Sometimes it wins handsomly: mmap: 35.271u 4.004s 1:06.08 59.4% 10+190k 0+0io 4185pf+0w read: 32.134u 15.797s 1:58.72 40.3% 408+302k 11228+0io 12pf+0w or mmap: 35.039u 4.558s 1:10.27 56.3%10+190k 5+0io 5028pf+0w read: 29.931u 27.848s 2:07.17 45.4% 10+187k 11219+0io 5pf+0w Mind you, both of the two processors are Xeons with _2Mb of cache on each_, so memory copying should be even cheaper on them than usual. And yet mmap manages to win... On a single P2 400MHz (standard 521Kb cache) mmap always wins the CPU time, and, thanks to that, can win the elapsed time on a busy system. For example, running two of these processes in parallel (on two separate copies of the same huge file residing on distinct disks) yields (same 1462726660-byte file as in the dual Xeon stats above): mmap: 66.989u 7.584s 3:01.76 41.0%5+238k 90+0io 22456pf+0w 65.474u 7.729s 2:38.59 46.1%5+241k 90+0io 22401pf+0w read: 60.724u 42.394s 3:37.01 47.5% 5+241k 22541+0io 0pf+0w 61.778u 41.987s 3:35.36 48.1% 5+239k 11256+0io 0pf+0w That's 182 vs. 215 seconds, or 15% elapsed time win for mmap. Evidently, mmap runs through that "nasty nasty code" faster than read runs through its. mmap loses on an idle system, I presume, because page-faulting is not smart enough to page-fault ahead as efficiently as read pre-reads ahead. Why am I complaining then? Because I want the "nasty nasty code" improved so that using mmap is beneficial for the single process too. Thank you very much! Yours, -mi ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
вівторок 22 червень 2004 23:27, Peter Wemm, Ви написали: = On Monday 21 June 2004 10:08 pm, Mikhail Teterin wrote: = The amount of "work" for the kernel to do a read() and a high-speed = memory copy is much less than the cost of taking a page fault, running = a whole bunch of really really nasty code in the vm system, repairing = the damage from the page fault, updating the process paging state and = restarting the instruction. Does the code _have_ to be "really really nasty", or it just _happened_ to be that way for historical reasons -- like this being a very complex issue, and, once it worked, no one really wanted to mess with it? = The numbers you're posting are a simple reflection of the fact that = the read syscall path has fewer (and less expensive) instructions to = execute compared to the mmap fault paths. Why, then, is the total number of CPU seconds (kernel+user) favorable towards mmap on CPU bound machines and about the same on IO bound? May be, because all that CPU work, you are describing, is also much faster on the modern CPUs? = Some operating systems implemented read(2) as an internal in-kernel = mmap/fault/copy/unmap. Naturally, that made mmap look fast compared to = read, at the time. But that isn't how it is implemented in FreeBSD. = mmap is more valuable as a programmer convenience these days. I figured :-( It is very convenient. As such, it should be wider used (because it leads to cleaner code), but that wouldn't come until it also offers the performance comparable to the less clean method(s)... Yours, -mi ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
On Monday 21 June 2004 10:08 pm, Mikhail Teterin wrote: > On Monday 21 June 2004 08:15 pm, Matthew Dillon wrote: > > = :The mmap interface is supposed to be more efficient -- theoreticly > = :-- because it requires one less buffer-copying, and because it > = :(together with the possible madvise()) provides the kernel with > more = :information thus enabling it to make better (at least -- no > worse) = :decisions. > > = Well, I think you forgot my earlier explanation regarding > buffer = copying. = Buffer copying is a very cheap operation if > it occurs = within the L1 or L2 cache, and that is precisely what > is happening = when you read() int > > This could explain, why using mmap is not faster than read, but it > does not explain, why it is slower. > > I'm afraid, your vast knowledge of the internals of the kernel > workings obscure your vision. I, on the other hand, "enjoy" an almost > total ignorance of it, and can see, that mmap interface _allows_ for > a more (certainly, no _less_) efficient handling of the IO, than > read. That the kernel is not using all the information passed to it, > I can only explain by deficiencies/simplicity the implementation. At the risk of propagating the thread, take a step back for a minute. 10-15 years ago, when mmap was first on the drawing boards as a concept for unix, the cost of a kernel trap and entering the vm system for fault recovery versus memory bandwidth is very very different compared to today. Back then, getting into the kernel was relatively painless and memory was proportionally very slow and expensive to use. However, these days, the memory subsystem is proportionally much much faster relative to the cost of kernel traps and vm processing and recovery. The amount of "work" for the kernel to do a read() and a high-speed memory copy is much less than the cost of taking a page fault, running a whole bunch of really really nasty code in the vm system, repairing the damage from the page fault, updating the process paging state and restarting the instruction. The numbers you're posting are a simple reflection of the fact that the read syscall path has fewer (and less expensive) instructions to execute compared to the mmap fault paths. Some operating systems implemented read(2) as an internal in-kernel mmap/fault/copy/unmap. Naturally, that made mmap look fast compared to read, at the time. But that isn't how it is implemented in FreeBSD. mmap is more valuable as a programmer convenience these days. Don't make the mistake of assuming its faster, especially since the cost of a copy has gone way down. Also, dont assume that read() is faster for cases where you're reading a file that has been mmapped (and possibly even dirtied) in another process. -- Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] "All of this is for nothing if we don't go to the stars" - JMS/B5 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
(current removed, but I'm leaving this on question@ since it contains some useful information). :This is, sort of, self-perpetuating -- as long as mmap is slower/less :reliable, applications will be hesitant to use it, thus there will be :little insentive to improve it. :-( Well, again, this is an incorrect perception. Your use of mmap() to process huge linear data sets is not what mmap() is best at doing, on *any* operating system, and not what people use mmap() for most of the time. There are major hardware related overheads to the use of mmap(), on *ANY* operating system, that cannot be circumvented. You have no choice but to allocate the pages for a page table, to populate the pages with pte's, you must invalidate the pages in the tlb whenever you modify a page table entry (e.g. invlpg instruction for IA32, which on a P2 is extremely expensive), and if you are processing huge data sets you also have to remove the page table entry from the page table when the underlying data page is reused due to the dataset being larger then main memory. There are overheads related to each of these issues, and overheads related the algorithms the operating system *MUST* use to figure out which pages to remove (on the fly) when the data set does not fit in main memory, and there are overheads related to the heuristics the operating system employs to try to predict the memory usage pattern to perform some read-ahead. These are hardware and software issues that cannot simply be wished away. No matter how much you want the concept of memory mapping to be 'free', it isn't. Memory mapping and management are complex operations for any operating system, always have been, and always will be. :I'd rather call attention to my slower -- CPU-bound boxes. On them, the :total CPU time spent computing md5 of a file is less with mmap -- by a :noticable margin. But because the CPU is underutilized, the elapsed "wall :clock" time is higher. : :As far as the cache-using statistics, having to do a cache-cache copy :doubles the cache used, stealing it from other processes/kernel tasks. But it is also not relevant for this case because the L2 cache is typically much larger (128K-2MB) then the 8-32K you might use for your local buffer. What you are complaining about here is going to wind up being mere microseconds over a multi-minute run. It's really important, and I can't stress this enough, to not simply assume what the performance impact of a particular operation will be by the way it feels to you. Your assumptions are all skewed... you are assuming that copying is always bad (it isn't), that copying is always horrendously expensive (it isn't), that memory mapping is always cheap (it isn't cheap), and that a small bit of cache pollution will have a huge penalty in time (it doesn't necessary, certainly not for a reasonably sized user buffer). I've already told you how to measure these things. Do me a favor and just run this dd on all of your FreeBSD boxes: dd if=/dev/zero of=/dev/null bs=32k count=8192 The resulting bytes/sec that it reports is a good guestimate of the cost of a memory copy (the actual copy rate will be faster since the times include the read and write system calls, but it's still a reasonable basis). So in the case of my absolute fastest machine (an AMD64 3200+ tweaked up a bit): 268435456 bytes transferred in 0.058354 secs (4600128729 bytes/sec) That means, basically, that it costs 1 second of cpu to copy 4.6 GBytes of data. On my slowest box, a C3 VIA Samuel 2 cpu (roughly equivalent to a P2/400Mhz): 268435456 bytes transferred in 0.394222 secs (680924559 bytes/sec) So the cost is 1 second to copy 680 MBytes of data on my slowest box. :Here, again, is from my first comparision on the P2 400MHz: : : stdio: 56.837u 34.115s 2:06.61 71.8% 66+193k 11253+0io 3pf+0w : mmap: 72.463u 7.534s 2:34.62 51.7% 5+186k 105+0io 22328pf+0w Well, the cpu utilization is only 71.8% for the read case, so the box is obviously I/O bound already. The real question you should be asking is not why mmap is only using 51.7% of the cpu, but why stdio is only using 71.8% of the cpu. If you want to make your processing program more efficient, 'fix' stdio first. You need to: (1) Figure out the rate at which your processing program reads data in the best case. You can do this by timing it on a data set that fits in memory (so no disk I/O is done). Note that it might be bursty, so the average rate along does not precisely quanity the amount of buffering that will be needed. (2) If your hard drive is faster then the datarate, then determine if the overhead of doing double-buffering is worth keeping the processing program populat
Re: read vs. mmap (or io vs. page faults)
On Monday 21 June 2004 08:15 pm, Matthew Dillon wrote: = :The mmap interface is supposed to be more efficient -- theoreticly = :-- because it requires one less buffer-copying, and because it = :(together with the possible madvise()) provides the kernel with more = :information thus enabling it to make better (at least -- no worse) = :decisions. = Well, I think you forgot my earlier explanation regarding buffer = copying. = Buffer copying is a very cheap operation if it occurs = within the L1 or L2 cache, and that is precisely what is happening = when you read() int This could explain, why using mmap is not faster than read, but it does not explain, why it is slower. I'm afraid, your vast knowledge of the internals of the kernel workings obscure your vision. I, on the other hand, "enjoy" an almost total ignorance of it, and can see, that mmap interface _allows_ for a more (certainly, no _less_) efficient handling of the IO, than read. That the kernel is not using all the information passed to it, I can only explain by deficiencies/simplicity the implementation. This is, sort of, self-perpetuating -- as long as mmap is slower/less reliable, applications will be hesitant to use it, thus there will be little insentive to improve it. :-( = As you can see by your timing results, even on your fastest box, = processing a file around that size is only going to incur 1-2 = seconds of real time overhead to do the extra buffer copy. 2 = seconds is a hard number to beat. I'd rather call attention to my slower -- CPU-bound boxes. On them, the total CPU time spent computing md5 of a file is less with mmap -- by a noticable margin. But because the CPU is underutilized, the elapsed "wall clock" time is higher. As far as the cache-using statistics, having to do a cache-cache copy doubles the cache used, stealing it from other processes/kernel tasks. Here, again, is from my first comparision on the P2 400MHz: stdio: 56.837u 34.115s 2:06.61 71.8% 66+193k 11253+0io 3pf+0w mmap: 72.463u 7.534s 2:34.62 51.7% 5+186k 105+0io 22328pf+0w 91 vs. 78 seconds CPU time (15% win for mmap), but 126 vs. 154 elapsed (22% loss)? Why is the CPU so underutilized in the mmap case? There was nothing else running at the time. The CPU was, indeed, at about 88% utilization, according to top. This alone seems to invalidate some of what you are saying below about the immediate disadvantages of mmap on a modern CPU. Or is P2 400MHz not modern? May be, but the very modern Sparcs, on which FreeBSD intends to run are not much faster. = The mmap interface is not supposed to be more efficient, per say. = Why would it be? Puzzling question. Because the kernel is supplied with more information -- it knows, that I only plan to _read_ from the memory (PROT_READ), the total size of what I plan to read (mmap's len, optionally, madvise's len), and (optionally), that I plan to read sequentially (MADV_SEQUENTIONAL). With that information, the kernel should be able to decide how many pages to pre-fault in and, what and when to drop. Mmap also needs no CPU data-cache to read. If the device is capable of writing to memory directly (DMA?), the CPU does not need to be involved at all, while with read the data still has to go from the DMA-filled kernel buffer to the application buffer -- there being two copies of it in cache instead of none for just storing or one copy for processing. Also, in case of RAM shortage, mmap-ed pages can be just dropped, while the too large buffer needs to be written into swap. And mmap requires no application buffers -- win, win, and win. Is there an inherent "lose" somewhere, I don't see? Like: = On a modern cpu, where an L1 cache copy is a two cycle streaming = operation, the several hundred (or more) cycles it takes to process = a page fault or even just populate the page table is equivalent to a = lot of copied bytes. But each call to read also takes cycles -- in the user space (read() function) and in the kernel (the syscall). And there are a lot of them too... = mmap() is not designed to streamline large demand-page reads of = data sets much larger then main memory. Then it was not designed to take advantage of all the possibilities of the interface, I say. = mmap() works best for data that is already cached in the kernel, = and even then it still has a fairly large hurdle to overcome vs a = streaming read(). This is a HARDWARE limitation. Wait, HARDWARE? Which hardware issues are we talking about? You suggested, I pre-fault in the pages and Julian explained how best to do it. If that is, indeed, the solution, why is not kernel doing it for me, pre-faulting in the same number of bytes, that read pre-reads? = 15% is nothing anyone cares about except perhaps gamers. I = certainly couldn't care less about 15%. 50%, on the otherhand, = is something that I would care about. Well, here we have a server dedicated t
Re: read vs. mmap (or io vs. page faults)
Matthew Dillon wrote: Mikhail Teterin wrote: =Both read and mmap have a read-ahead heuristic. The heuristic =works. In fact, the mmap heuristic is so smart it can read-behind =as well as read-ahead if it detects a backwards scan. Evidently, read's heuristics are better. At least, for this task. I'm, actually, surprised, they are _different_ at all. It might be interesting to retry your tests under a Mach kernel. BSD has multiple codepaths for IPC functionality that are unified under Mach. The mmap interface is supposed to be more efficient -- theoreticly -- because it requires one less buffer-copying, and because it (together with the possible madvise()) provides the kernel with more information thus enabling it to make better (at least -- no worse) decisions. I've heard people repeat the same notion, that is to say "that mmap()ing a file is supposed to be faster than read()ing it" [1], but the two operations are not quite the same thing, and there is more work being done to mmap a file (and thus gain random access to any byte of the file by dereferencing memory), than to read and process small blocks of data at a time. Matt's right that processing a small block that fits into L1/L2 cache (and probably already is resident) is very fast. The extra copy doesn't matter as much as it once did on slower machines, and he's provided some good analysis of L1/L2 caching issues and buffer copying speeds. However, I tend to think the issue of buffer copying speeds are likely to be moot when you are reading from disk and are thus I/O bound [2], rather than having the manner in which the file's contents are represented to the program being that significant. - [1]: Actually, while it is intuitive that trying to tell the system, "hey, I want all of that file read into RAM now, as quickly as you can using mmap() and madvise()", what happens with systems which use demand-paging VM (like FreeBSD, Linux, and most others) is far more lazy: In reality, your process gets nothing but a promise from mmap() that if you access the right chunk of memory, your program will unblock once that data has been read and faulted into the local address space. That level of urgency doesn't seem to correspond to what you asked for :-), although it still works pretty well in practice. [2]: We're talking about maybe 20 to 60 or so MB/s for disk, versus 10x to 100x that for RAM to RAM copying, much less the L2 copying speeds Matt mentions below: Well, I think you forgot my earlier explanation regarding buffer copying. Buffer copying is a very cheap operation if it occurs within the L1 or L2 cache, and that is precisely what is happening when you read() into a fixed buffer in a loop in a C program... your buffer is fixed in memory and is almost guarenteed to be in the L1/L2 cache, which means that the extra copy operation is very fast on a modern processor. It's something like 12-16 GBytes/sec to the L1 cache on an Athlon 64, for example, and 3 GBytes/sec uncached to main memory. This has been an interesting discussion, BTW, thanks. -- -Chuck ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
:The mmap interface is supposed to be more efficient -- theoreticly -- :because it requires one less buffer-copying, and because it (together :with the possible madvise()) provides the kernel with more information :thus enabling it to make better (at least -- no worse) decisions. Well, I think you forgot my earlier explanation regarding buffer copying. Buffer copying is a very cheap operation if it occurs within the L1 or L2 cache, and that is precisely what is happening when you read() into a fixed buffer in a loop in a C program... your buffer is fixed in memory and is almost guarenteed to be in the L1/L2 cache, which means that the extra copy operation is very fast on a modern processor. It's something like 12-16 GBytes/sec to the L1 cache on an Athlon 64, for example, and 3 GBytes/sec uncached to main memory. Consider the cpu time cost, then, of the local copy on a 2GB file... the cpu time cost on an AMD64 is about 2/12 of one second. This is the number mmap would have to beat. As you can see by your timing results, even on your fastest box, processing a file around that size is only going to incur 1-2 seconds of real time overhead to do the extra buffer copy. 2 seconds is a hard number to beat. This is something you can calculate yourself. Time a dd from /dev/zero to /dev/null. crater# dd if=/dev/zero of=/dev/null bs=32k count=8192 268435456 bytes transferred in 0.244561 secs (1097620804 bytes/sec) amd64# dd if=/dev/zero of=/dev/null bs=32k count=8192 268435456 bytes transferred in 0.066994 secs (4006846790 bytes/sec) amd64# dd if=/dev/zero of=/dev/null bs=16m count=32 536870912 bytes transferred in 0.431774 secs (1243407512 bytes/sec) Try it for different buffer sizes (16K through 16MB) and you will get a feel for how the L1 and L2 caches effect copying bandwidth. These numbers are reasonably close to the raw memory bandwidth available to the cpu (and will be different depending on whether the buffer fits in the L1 or L2 caches, or doesn't fit at all). The mmap interface is not supposed to be more efficient, per say. Why would it be? There are overheads involved with mapping the page table entries and taking faults to map more. Even if you pre-mapped everything, there are still overheads involved in populating the page table and performing invlpg operations on the TLB to reload the entry, and for large data sets there is overhead involved with removing page table entries and invalidating the pte. On a modern cpu, where an L1 cache copy is a two cycle streaming operation, the several hundred (or more) cycles it takes to process a page fault or even just populate the page table is equivalent to a lot of copied bytes. This immediately puts mmap() at a disadvantage on a modern cpu, but of course it also depends on what the data processing loop itself is doing. If the data processing loop is sensitive to the L1 cache then processing larger chunks of data is going to be make it more efficient, and mmap() can certainly provide that where read() might require buffers too large to fit comfortably in the L1/L2 cache. On the otherhand, if the processing loop is relatively insensitive to the L1 cache (i.e. its small), then you can afford to process the data in smaller chunks, like 16K, without any significant penalty. mmap() is not designed to streamline large demand-page reads of data sets much larger then main memory. mmap() works best for data that is already cached in the kernel, and even then it still has a fairly large hurdle to overcome vs a streaming read(). This is a HARDWARE limitation. Drastic action would have to be taken in software to get rid of this overhead (we'd have to use 4MB page table entries, which come with their own problems). The overhead required to manage a large mmap'd data set can skyrocket. FreeBSD (and DragonFly) have heuristics that attempt to detect sequential operations like this with mmap'd data and to depress the page priority behind the read (so: read-ahead and depress-behind), and this works, but it only mitigates the additional overhead some, it doesn't get rid of it. For linear processing of large data sets you almost universally want to use a read() loop. There's no good reason to use mmap(). :=: read: 10.619u 23.814s 1:17.67 44.3% 62+274k 11255+0io 0pf+0w := :Well, now we are venturing into the domain of humans' subjective :perception... I'd say, 12% is plenty, actually. This is what some people :achieve by rewriting stuff in assembler -- and are proud, when it works ::-) Nobody is going to stare at their screen for one minute and 17 seconds and really care that something might take one minute and 27 seconds instead of one minute and 17 seconds. That's
Re: read vs. mmap (or io vs. page faults)
: :ask for 8k, but the system will fetch the next 64k of data. Problem is :the system does nothing until you read the next 8k past the 64k :alreqady read in, then it jumps up and grabs the next 64k. You're :still waiting on I/O every 8th read. Ideally it would do an async :.. :-- : Dan Nelson : [EMAIL PROTECTED] No, this isn't true. The system places a marker 8K or 16K before the last read block and initiates the next read-ahead before you exhaust the first one. For mapped data the system intentionally does not map the page table entry for a page or two before the end of the read ahead in order to force a page fault so it can initiate the next read ahead. For read data the system marks a buffer near the end of the read ahead so when read encounters it the system knows to queue then next read-ahead. Also, for that matter, remember that the hard drives themselves generally cache whole tracks and do their own read-ahead. This is why dd'ing a large file usually results in the maximum transfer rate the hard drive can do. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
In the last episode (Jun 21), Mikhail Teterin said: > > Both read and mmap have a read-ahead heuristic. The heuristic > > works. In fact, the mmap heuristic is so smart it can read-behind > > as well as read-ahead if it detects a backwards scan. > > Evidently, read's heuristics are better. At least, for this task. > I'm, actually, surprised, they are _different_ at all. [...] > That other OSes have similar shortcomings simply gives us some > breathing room from an advocacy point of view. I hope, my rhetoric > will burn an itch in someone capable of addressing it technically :-) > > > The heuristic does not try to read megabytes and megabytes ahead, > > however... > > Neither does the read-handling. I think part of the problem is that it's just clustering reads instead of making sure the next N blocks of data are prefetched. So you may ask for 8k, but the system will fetch the next 64k of data. Problem is the system does nothing until you read the next 8k past the 64k alreqady read in, then it jumps up and grabs the next 64k. You're still waiting on I/O every 8th read. Ideally it would do an async fetch of a 8k block (64k ahead of the current read) every time you read a block. It should be a lot easier for read to do this, since the kernel is getting a steady stream of syscalls. Once a 64k chunk of mmapped address space is pulled in, the system isn't notified until the next page fault. (or am I misunderstanding how readahead is implemented on mmapped data?) -- Dan Nelson [EMAIL PROTECTED] ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
=Both read and mmap have a read-ahead heuristic. The heuristic =works. In fact, the mmap heuristic is so smart it can read-behind =as well as read-ahead if it detects a backwards scan. Evidently, read's heuristics are better. At least, for this task. I'm, actually, surprised, they are _different_ at all. The mmap interface is supposed to be more efficient -- theoreticly -- because it requires one less buffer-copying, and because it (together with the possible madvise()) provides the kernel with more information thus enabling it to make better (at least -- no worse) decisions. That these theoretical advantages -- small or not -- are eaten by, what seems like, practical implementation deficiencies to the point, that using mmap is not only not faster, but frequently slower -- wallclock-wise -- is, in itself, a serious shortcoming, that stands between an OS and perfection. That other OSes have similar shortcomings simply gives us some breathing room from an advocacy point of view. I hope, my rhetoric will burn an itch in someone capable of addressing it technically :-) =The heuristic does not try to read megabytes and megabytes ahead, =however... Neither does the read-handling. =that might speed up this particular application a little, but it =would destroy performance for many other types of applications, =especially in a loaded environment. I'm not asking mmap (page fault handling) to cache any more aggressively, than read-handling does. =Well now hold a second... the best you can do here is compare relative =differences between mmap and read. This is all I am doing, actually. :-) =If you really want to compare operating systems, you have to run the =OS's and the tests on the same hardware. I am comparing relative differences between between read and mmap on different OSes. =: 4.8-stable on Pentium2-400MHz =: mmap: 21.507u 11.472s 1:27.53 37.6% 62+276k 99+0io 44736pf+0w =: read: 10.619u 23.814s 1:17.67 44.3% 62+274k 11255+0io 0pf+0w = =mmap 12% slower then read. 12% isn't much. Well, now we are venturing into the domain of humans' subjective perception... I'd say, 12% is plenty, actually. This is what some people achieve by rewriting stuff in assembler -- and are proud, when it works :-) =: recent -current on dual P2 Xeon-450MHz (mmap WINS -- SMP?) =: mmap: 12.482u 12.872s 2:28.70 17.0% 74+298k 23+0io 46522pf+0w =: read: 7.255u 16.366s 3:27.07 11.4%70+283k 44437+0io 7pf+0w = =mmap 39% faster. That's a significant difference. = =It kinda smells funny, actually... are you sure that you compiled =your FreeBSD-5 system with Witness turned off? There are no "WITNESS" options in the kernel's config file (unlike in NOTES). So, unless there has to be some sort of explicit "NOWITNESS", I am sure. =: recent -current on a Centrino-laptop P4-1GHz (NO win at all) =: mmap: 4.197u 3.920s 2:07.57 6.3% 65+284k 63+0io 45568pf+0w =: read: 3.965u 4.265s 1:50.26 7.4% 67+291k 13131+0io 17pf+0w = =mmap 15% slower. =: Linux 2.4.20-30.9bigmem dual P4-3GHz (with a different file) =: mmap: 2.280u 4.800s 1:13.39 9.6% 0+0k 0+0io 512434pf+0w =: read: 1.630u 2.820s 0:08.89 50.0% 0+0k 0+0io 396pf+0w = =mmap 821% slower on Linux? With a different file? So these numbers =can't be compared to anything else (over and above the fact that this =machine is three times faster then any of the others). No, the file is different (as is the processor) -- relative performance difference only. I was quite surprised myself. My fmd5 program does not show such a dramatic difference, but `fgrep --mmap' is vastly slower on Linux, than the regular `fgrep'. Here are the results of the two new fgrep runs: mmap1: 1.450u 3.000s 0:46.00 9.6% 0+0k 0+0io 512439pf+0w read1: 1.830u 2.620s 0:09.51 46.7% 0+0k 0+0io 393pf+0w mmap2: 1.700u 4.040s 1:02.31 9.2% 0+0k 0+0io 512427pf+0w read2: 1.330u 3.150s 0:09.38 47.7% 0+0k 0+0io 396pf+0w =I'm not sure why you are complaining about FreeBSD. Because I have much higher expectations for it :-) I thought, I'll be able to use the powerful technique of presenting a Linux' superiority in some area to fire up rapid improvements in the same area in FreeBSD. Now I'm back to fighting the "12% gain is not worth the effort" mentality. =:Once mmap-handling is improved, all sorts of whole-file operations =:(bzip2, gzip, md5, sha1) can be made faster... =Well, your numbers don't really say that. It looks like you might =eeek out a 10-15% improvement, and while this is faster it really =isn't all that much faster. It certainly isn't something to write =home about, and certainly not significant enough to warrant major =codework. Put it into perspective -- 10-15% is usually the difference between the latest processor and the previous one. Pe
Re: read vs. mmap (or io vs. page faults)
: := pre-faulting is best done by a worker thread or child process, or it := will just slow you down.. : :Read is also used for large files sometimes, and never tries to prefetch :the whole file at once. Why can't the same smarts/heuristics be employed :by the page-fault handling code -- especially, if we are so proud of our :unified caching? Both read and mmap have a read-ahead heuristic. The heuristic works. In fact, the mmap heuristic is so smart it can read-behind as well as read-ahead if it detects a backwards scan. The heuristic does not try to read megabytes and megabytes ahead, however... that might speed up this particular application a little, but it would destroy performance for many other types of applications, especially in a loaded environment. :If anything mmap/madvise provide the kernel with _more_ information than :read -- kernel just does not use it, it seems. : :According to my tests (`fgrep string /huge/file' vs. `fgrep --mmap :string /huge/file') the total CPU time is much less with mmap. But :sometimes the total "wall clock" time is longer with itj because the CPU :is underutilized, when using the mmap method. Well now hold a second... the best you can do here is compare relative differences between mmap and read. All of these machines are different, with different cpus and different configurations. For example, a duel-P2 is going to be horrendously bad doing SMP things because the P2's locked bus cycle instruction overhead is horrendous. That is going to seriously skew the results. There are major architectural differences between these cpus... cache size, memory bandwidth, MP operations overhead, not to mention raw megaherz. Disk transfer rate and the disk bus interface and driver will also make a big difference here, as well as the contents of the file you are fgrep'ing. If you really want to compare operating systems, you have to run the OS's and the tests on the same hardware. : 4.8-stable on Pentium2-400MHz : mmap: 21.507u 11.472s 1:27.53 37.6% 62+276k 99+0io 44736pf+0w : read: 10.619u 23.814s 1:17.67 44.3% 62+274k 11255+0io 0pf+0w mmap 12% slower then read. 12% isn't much. : recent -current on dual P2 Xeon-450MHz (mmap WINS -- SMP?) : mmap: 12.482u 12.872s 2:28.70 17.0% 74+298k 23+0io 46522pf+0w : read: 7.255u 16.366s 3:27.07 11.4%70+283k 44437+0io 7pf+0w mmap 39% faster. That's a significant difference. It kinda smells funny, actually... are you sure that you compiled your FreeBSD-5 system with Witness turned off? : recent -current on a Centrino-laptop P4-1GHz (NO win at all) : mmap: 4.197u 3.920s 2:07.57 6.3% 65+284k 63+0io 45568pf+0w : read: 3.965u 4.265s 1:50.26 7.4% 67+291k 13131+0io 17pf+0w mmap 15% slower. : Linux 2.4.20-30.9bigmem dual P4-3GHz (with a different file) : mmap: 2.280u 4.800s 1:13.39 9.6% 0+0k 0+0io 512434pf+0w : read: 1.630u 2.820s 0:08.89 50.0% 0+0k 0+0io 396pf+0w mmap 821% slower on Linux? With a different file? So these numbers can't be compared to anything else (over and above the fact that this machine is three times faster then any of the others). It kinda looks like either you wrote the linux numbers down wrong, or linux's mmap is much, much worse then FreeBSD's. I'm not sure why you are complaining about FreeBSD. If I were to assume 1:08.89 instead of 1:13.39 the difference would be 6.5%, which is narrower then 15% but not by all that much... a few seconds is nothing to quibble over. :The attached md5-computing program is more CPU consuming than fgrep. It :wins with mmap even on the "sceptical" Centrino-laptop -- presumably, :because MD5_Update is not interrupted as much and remains in the :instruction cache: : : read: 22.024u 8.418s 1:28.44 34.4%5+166k 10498+0io 4pf+0w : mmap: 21.428u 3.086s 1:23.88 29.2%5+170k 40+0io 19649pf+0w read is 6% faster then mmap here. :Once mmap-handling is improved, all sorts of whole-file operations :(bzip2, gzip, md5, sha1) can be made faster... : : -mi Well, your numbers don't really say that. It looks like you might eeek out a 10-15% improvement, and while this is faster it really isn't all that much faster. It certainly isn't something to write home about, and certainly not significant enough to warrent major codework. Though I personally have major issues with FreeBSD-5's performance in general, I don't really see that anything stands out in these tests except perhaps for FreeBSD-5's horrible MP performance with read() vs mmap() on the duel P2 (but I suspect that might be due to some other issue such as perhaps Witness being turned on). If you really want to get comparative results you have to run all of these tests on the same hardware with the
Re: read vs. mmap (or io vs. page faults)
On Sunday 20 June 2004 08:16 pm, Julian Elischer wrote: = On Sun, 20 Jun 2004, Matthew Dillon wrote: [...] = > It is usually a bad idea to try to populate the page table with = > all resident pages associated with the a memory mapping, because = > mmap() is often used to map huge files... [...] = pre-faulting is best done by a worker thread or child process, or it = will just slow you down.. Read is also used for large files sometimes, and never tries to prefetch the whole file at once. Why can't the same smarts/heuristics be employed by the page-fault handling code -- especially, if we are so proud of our unified caching? If anything mmap/madvise provide the kernel with _more_ information than read -- kernel just does not use it, it seems. According to my tests (`fgrep string /huge/file' vs. `fgrep --mmap string /huge/file') the total CPU time is much less with mmap. But sometimes the total "wall clock" time is longer with itj because the CPU is underutilized, when using the mmap method. 4.8-stable on Pentium2-400MHz mmap: 21.507u 11.472s 1:27.53 37.6% 62+276k 99+0io 44736pf+0w read: 10.619u 23.814s 1:17.67 44.3% 62+274k 11255+0io 0pf+0w recent -current on dual P2 Xeon-450MHz (mmap WINS -- SMP?) mmap: 12.482u 12.872s 2:28.70 17.0% 74+298k 23+0io 46522pf+0w read: 7.255u 16.366s 3:27.07 11.4%70+283k 44437+0io 7pf+0w recent -current on a Centrino-laptop P4-1GHz (NO win at all) mmap: 4.197u 3.920s 2:07.57 6.3% 65+284k 63+0io 45568pf+0w read: 3.965u 4.265s 1:50.26 7.4% 67+291k 13131+0io 17pf+0w Linux 2.4.20-30.9bigmem dual P4-3GHz (with a different file) mmap: 2.280u 4.800s 1:13.39 9.6% 0+0k 0+0io 512434pf+0w read: 1.630u 2.820s 0:08.89 50.0% 0+0k 0+0io 396pf+0w The attached md5-computing program is more CPU consuming than fgrep. It wins with mmap even on the "sceptical" Centrino-laptop -- presumably, because MD5_Update is not interrupted as much and remains in the instruction cache: read: 22.024u 8.418s 1:28.44 34.4%5+166k 10498+0io 4pf+0w mmap: 21.428u 3.086s 1:23.88 29.2%5+170k 40+0io 19649pf+0w Once mmap-handling is improved, all sorts of whole-file operations (bzip2, gzip, md5, sha1) can be made faster... -mi ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
On Sun, 20 Jun 2004, Matthew Dillon wrote: > Hmm. Well, you can try calling madvise(... MADV_WILLNEED), that's what > it is for. > > It is usually a bad idea to try to populate the page table with all > resident pages associated with the a memory mapping, because mmap() > is often used to map huge files... hundreds of megabytes or even > dozens of gigabytes (on 64 bit architectures). The last thing you want > to do is to populate the page table for the entire file. It might > work for your particular program, but it is a bad idea for the OS to > assume that for every mmap(). > > What it comes down to, really, is whether you feel you actually need the > additional performance, because it kinda sounds to me that whatever > processing you are doing to the data is either going to be I/O bound, > or it isn't going to run long enough for the additional overhead to matter > verses the processing overhead of the program itself. > > If you are really worried you could pre-fault the mmap before you do > any processing at all and measure the time it takes to pre-fault the > pages vs the time it takes to process the memory image. (You pre-fault > simply by accessing one byte of data in each page across the mmap(), > before you begin any processing). pre-faulting is best done by a worker thread or child process, or it will just slow you down.. > > -Matt > Matthew Dillon > <[EMAIL PROTECTED]> > > := It's hard to say. mmap() could certainly be made more efficient, e.g. > := by faulting in more pages at a time to reduce the actual fault rate. > := But it's fairly difficult to beat a read copy into a small buffer. > : > :Well, that's the thing -- by mmap-ing the whole file at once (and by > :madvise-ing with MADV_SEQUENTIONAL), I thought, I told, the kernel > :everything it needed to know to make the best decision. Why can't > :page-faulting code do a better job using all this knowledge, than the > :poor read, which only knows about the partial read in question? > : > :I find it so disappointing, that it can, probably, be considered a bug. > :I'll try this code on Linux and Solaris. If mmap is better there (as it > :really ought to be), we have a problem, IMHO. Thanks! > : > : -mi > : > ___ > [EMAIL PROTECTED] mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "[EMAIL PROTECTED]" > ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
Hmm. Well, you can try calling madvise(... MADV_WILLNEED), that's what it is for. It is usually a bad idea to try to populate the page table with all resident pages associated with the a memory mapping, because mmap() is often used to map huge files... hundreds of megabytes or even dozens of gigabytes (on 64 bit architectures). The last thing you want to do is to populate the page table for the entire file. It might work for your particular program, but it is a bad idea for the OS to assume that for every mmap(). What it comes down to, really, is whether you feel you actually need the additional performance, because it kinda sounds to me that whatever processing you are doing to the data is either going to be I/O bound, or it isn't going to run long enough for the additional overhead to matter verses the processing overhead of the program itself. If you are really worried you could pre-fault the mmap before you do any processing at all and measure the time it takes to pre-fault the pages vs the time it takes to process the memory image. (You pre-fault simply by accessing one byte of data in each page across the mmap(), before you begin any processing). -Matt Matthew Dillon <[EMAIL PROTECTED]> := It's hard to say. mmap() could certainly be made more efficient, e.g. := by faulting in more pages at a time to reduce the actual fault rate. := But it's fairly difficult to beat a read copy into a small buffer. : :Well, that's the thing -- by mmap-ing the whole file at once (and by :madvise-ing with MADV_SEQUENTIONAL), I thought, I told, the kernel :everything it needed to know to make the best decision. Why can't :page-faulting code do a better job using all this knowledge, than the :poor read, which only knows about the partial read in question? : :I find it so disappointing, that it can, probably, be considered a bug. :I'll try this code on Linux and Solaris. If mmap is better there (as it :really ought to be), we have a problem, IMHO. Thanks! : : -mi : ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
On Sunday 20 June 2004 02:35 pm, you wrote: = = :I this how things are supposed to be, or will mmap() become more = :efficient eventually? Thanks! = : = : -mi = It's hard to say. mmap() could certainly be made more efficient, e.g. = by faulting in more pages at a time to reduce the actual fault rate. = But it's fairly difficult to beat a read copy into a small buffer. Well, that's the thing -- by mmap-ing the whole file at once (and by madvise-ing with MADV_SEQUENTIONAL), I thought, I told, the kernel everything it needed to know to make the best decision. Why can't page-faulting code do a better job using all this knowledge, than the poor read, which only knows about the partial read in question? I find it so disappointing, that it can, probably, be considered a bug. I'll try this code on Linux and Solaris. If mmap is better there (as it really ought to be), we have a problem, IMHO. Thanks! -mi ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
:Hello! : :I'm writing a message-digest utility, which operates on file and :can use either stdio: : : while (not eof) { : char buffer[BUFSIZE]; : size = read( buffer ...); : process(buffer, size); : } : :or mmap: : : buffer = mmap(... file_size, PROT_READ ...); : process(buffer, file_size); : :I expected the second way to be faster, as it is supposed to avoid :one memory copying (no user-space buffer). But in reality, on a :CPU-bound (rather than IO-bound) machine, using mmap() is considerably :slower. Here are the tcsh's time results: read() is likely going to be faster because it does not involve any page fault overhead. The VM system only faults 16 or so pages ahead which is only 64KB, so the fault overhead is very high for the data rate. Why does the extra copy not matter? Well, it's fairly simple, actually. It's because your buffer is smaller then the L1 cache, and/or also simply because the VM fault overhead is higher then it would take to copy an extra 64KB. read() loops typically use buffer sizes in the 8K-46K range. L1 caches are typically 16K (for celeron class cpus) through 64K, or more for higher end cpus. L2 caches are typically 256K-1MB, or more. The copy bandwidth from or to the L1 cache is usually around 10x faster then main memory and the copy bandwidth from or two L2 cache is usually around 4x faster. (Note that I'm talking copy bandwidth here, not random access. The L1 cache is ~50x faster or more for random access). So the cost of the extra copy in a read() loop using a reasonable buffer size (~8K-64K) (L1 or L2 access) is virtually nil compared to the cost of accessing the kernel's buffer cache (which involves main memory accesses for files > L2 cache). :On the IO-bound machine, using mmap is only marginally faster: : : Single Pentium4M (Centrino 1GHz) runing recent -current: : :stdio: 27.195u 8.280s 1:33.02 38.1%10+169k 11221+0io 1pf+0w :mmap: 26.619u 3.004s 1:23.59 35.4%10+169k 47+0io 19463pf+0w Yes, because it's I/O bound. As long as the kernel queues some readahead to the device it can burn those cpu cycles on whatever it wants without really effecting the transfer rate. :I this how things are supposed to be, or will mmap() become more :efficient eventually? Thanks! : : -mi It's hard to say. mmap() could certainly be made more efficient, e.g. by faulting in more pages at a time to reduce the actual fault rate. But it's fairly difficult to beat a read copy into a small buffer. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
On Sunday 20 June 2004 11:41 am, Dan Nelson wrote: = In the last episode (Jun 20), Mikhail Teterin said: = > I expected the second way to be faster, as it is supposed to avoid = > one memory copying (no user-space buffer). But in reality, on a = > CPU-bound (rather than IO-bound) machine, using mmap() is = > considerably slower. Here are the tcsh's time results: = MADV_SEQUENTIAL just lets the system expire already-read blocks from = its cache faster, so it won't help much here. That may be what it _does_, but from the manual page, one gets an impression, it should tell the VM, that once a page is requested (and had to be page-faulted in), the one after it will be requested soon and may as well be prefeteched (and the ones before can be dropped if memory is in short supply). Anyway, using MADV_SEQUENTIAL is consintently making mmap behave slightly worse, rather then have no effect. But let's not get distracted with madvise(). Why is mmap() slower? So much so, that the machine, which is CPU-bound using read() only uses 90% of the CPU when using mmap -- while, at the same time -- the disk bandwidth is also less than that of the read(). It looks to me, like a lot of thought went into optimizing read(), but much less into mmap, which is supposed to be faster -- less memory shuffling. Is that true, is there something inherent in mmap-style of reading, that I don't see? = read() should cause some prefetching to occur, but it obviously = doesn't work all the time or else inblock wouldn't have been as high = as 11000. For sequential access I would have expected read() to have = been able to prefetch almost every block before the userland process = needed it. Thanks! -mi ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: read vs. mmap (or io vs. page faults)
In the last episode (Jun 20), Mikhail Teterin said: > I expected the second way to be faster, as it is supposed to avoid > one memory copying (no user-space buffer). But in reality, on a > CPU-bound (rather than IO-bound) machine, using mmap() is > considerably slower. Here are the tcsh's time results: > > Single Pentium2-400MHz running 4.8-stable: > -- > stdio: 56.837u 34.115s 2:06.61 71.8% 66+193k 11253+0io 3pf+0w > mmap: 72.463u 7.534s 2:34.62 51.7% 5+186k 105+0io 22328pf+0w > > Dual Pentium2 Xeon 450MHz running recent -current: > -- > stdio: 36.557u 29.395s 3:09.88 34.7% 10+165k 32646+0io 0pf+0w > mmap: 42.052u 7.545s 2:02.25 40.5% 10+169k 16+0io15232pf+0w > > On the IO-bound machine, using mmap is only marginally faster: > > Single Pentium4M (Centrino 1GHz) runing recent -current: > > stdio: 27.195u 8.280s 1:33.02 38.1%10+169k 11221+0io 1pf+0w > mmap: 26.619u 3.004s 1:23.59 35.4%10+169k 47+0io19463pf+0w > > Notice the last two columns in time's output -- why is page-faulting a > page in -- on-demand -- so much slower then read()-ing it? I even tried > inserting ``madvise(buffer, file_size, MADV_SEQUENTIAL)'' between the > mmap() and the process() -- made difference at all (or made the mmap() > take slightly longer)... MADV_SEQUENTIAL just lets the system expire already-read blocks from its cache faster, so it won't help much here. read() should cause some prefetching to occur, but it obviously doesn't work all the time or else inblock wouldn't have been as high as 11000. For sequential access I would have expected read() to have been able to prefetch almost every block before the userland process needed it. -- Dan Nelson [EMAIL PROTECTED] ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
read vs. mmap (or io vs. page faults)
Hello! I'm writing a message-digest utility, which operates on file and can use either stdio: while (not eof) { char buffer[BUFSIZE]; size = read( buffer ...); process(buffer, size); } or mmap: buffer = mmap(... file_size, PROT_READ ...); process(buffer, file_size); I expected the second way to be faster, as it is supposed to avoid one memory copying (no user-space buffer). But in reality, on a CPU-bound (rather than IO-bound) machine, using mmap() is considerably slower. Here are the tcsh's time results: Single Pentium2-400MHz running 4.8-stable: -- stdio: 56.837u 34.115s 2:06.61 71.8% 66+193k 11253+0io 3pf+0w mmap: 72.463u 7.534s 2:34.62 51.7%5+186k 105+0io 22328pf+0w Dual Pentium2 Xeon 450MHz running recent -current: -- stdio: 36.557u 29.395s 3:09.88 34.7% 10+165k 32646+0io 0pf+0w mmap: 42.052u 7.545s 2:02.25 40.5%10+169k 16+0io 15232pf+0w On the IO-bound machine, using mmap is only marginally faster: Single Pentium4M (Centrino 1GHz) runing recent -current: stdio: 27.195u 8.280s 1:33.02 38.1%10+169k 11221+0io 1pf+0w mmap: 26.619u 3.004s 1:23.59 35.4%10+169k 47+0io 19463pf+0w Notice the last two columns in time's output -- why is page-faulting a page in -- on-demand -- so much slower then read()-ing it? I even tried inserting ``madvise(buffer, file_size, MADV_SEQUENTIAL)'' between the mmap() and the process() -- made difference at all (or made the mmap() take slightly longer)... I this how things are supposed to be, or will mmap() become more efficient eventually? Thanks! -mi ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"