Re: read vs. mmap (or io vs. page faults)

2004-06-22 Thread Mikhail Teterin
On Tuesday 22 June 2004 11:27 pm, Peter Wemm wrote:

= mmap is more valuable as a programmer convenience these days. Don't
= make the mistake of assuming its faster, especially since the cost of
= a copy has gone way down.

Actually, let me back off from agreeing with you here :-) On io-bound
machines (such as my laptop), there is no discernable difference in
either the CPU or the elapsed time -- md5-ing a file with mmap or read
is (curiously) slightly faster than just cat-ing it into /dev/null.

On an dual P2 450MHz, the single process always wins the CPU time and
sometimes the elapsed time. Sometimes it wins handsomly:

mmap: 35.271u 4.004s 1:06.08 59.4%   10+190k 0+0io 4185pf+0w
read: 32.134u 15.797s 1:58.72 40.3%  408+302k 11228+0io 12pf+0w

or

mmap: 35.039u 4.558s 1:10.27 56.3%10+190k 5+0io 5028pf+0w
read: 29.931u 27.848s 2:07.17 45.4%   10+187k 11219+0io 5pf+0w

Mind you, both of the two processors are Xeons with _2Mb of cache on
each_, so memory copying should be even cheaper on them than usual. And
yet mmap manages to win...

On a single P2 400MHz (standard 521Kb cache) mmap always wins the CPU
time, and, thanks to that, can win the elapsed time on a busy system.
For example, running two of these processes in parallel (on two separate
copies of the same huge file residing on distinct disks) yields (same
1462726660-byte file as in the dual Xeon stats above):

mmap: 66.989u 7.584s 3:01.76 41.0%5+238k 90+0io 22456pf+0w
  65.474u 7.729s 2:38.59 46.1%5+241k 90+0io 22401pf+0w
read: 60.724u 42.394s 3:37.01 47.5%   5+241k 22541+0io 0pf+0w
  61.778u 41.987s 3:35.36 48.1%   5+239k 11256+0io 0pf+0w

That's 182 vs. 215 seconds, or 15% elapsed time win for mmap. Evidently,
mmap runs through that "nasty nasty code" faster than read runs through
its. mmap loses on an idle system, I presume, because page-faulting is
not smart enough to page-fault ahead as efficiently as read pre-reads
ahead.

Why am I complaining then? Because I want the "nasty nasty code"
improved so that using mmap is beneficial for the single process too.

Thank you very much! Yours,

-mi

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-22 Thread Mikhail T.
вівторок 22 червень 2004 23:27, Peter Wemm, Ви написали:
= On Monday 21 June 2004 10:08 pm, Mikhail Teterin wrote:

= The amount of "work" for the kernel to do a read() and a high-speed
= memory copy is much less than the cost of taking a page fault, running
= a whole bunch of really really nasty code in the vm system, repairing
= the damage from the page fault, updating the process paging state and
= restarting the instruction.

Does the code _have_ to be "really really nasty", or it just _happened_
to be that way for historical reasons -- like this being a very complex
issue, and, once it worked, no one really wanted to mess with it?

= The numbers you're posting are a simple reflection of the fact that
= the read syscall path has fewer (and less expensive) instructions to
= execute compared to the mmap fault paths.

Why, then, is the total number of CPU seconds (kernel+user) favorable
towards mmap on CPU bound machines and about the same on IO bound? May
be, because all that CPU work, you are describing, is also much faster
on the modern CPUs?

= Some operating systems implemented read(2) as an internal in-kernel
= mmap/fault/copy/unmap. Naturally, that made mmap look fast compared to
= read, at the time. But that isn't how it is implemented in FreeBSD.

= mmap is more valuable as a programmer convenience these days.

I figured :-( It is very convenient. As such, it should be wider used
(because it leads to cleaner code), but that wouldn't come until it also
offers the performance comparable to the less clean method(s)... Yours,

-mi

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-22 Thread Peter Wemm
On Monday 21 June 2004 10:08 pm, Mikhail Teterin wrote:
> On Monday 21 June 2004 08:15 pm, Matthew Dillon wrote:
>
> = :The mmap interface is supposed to be more efficient -- theoreticly
> = :-- because it requires one less buffer-copying, and because it
> = :(together with the possible madvise()) provides the kernel with
> more = :information thus enabling it to make better (at least -- no
> worse) = :decisions.
>
> = Well, I think you forgot my earlier explanation regarding
> buffer = copying. = Buffer copying is a very cheap operation if
> it occurs = within the L1 or L2 cache, and that is precisely what
> is happening = when you read() int
>
> This could explain, why using mmap is not faster than read, but it
> does not explain, why it is slower.
>
> I'm afraid, your vast knowledge of the internals of the kernel
> workings obscure your vision. I, on the other hand, "enjoy" an almost
> total ignorance of it, and can see, that mmap interface _allows_ for
> a more (certainly, no _less_) efficient handling of the IO, than
> read. That the kernel is not using all the information passed to it,
> I can only explain by deficiencies/simplicity the implementation.

At the risk of propagating the thread, take a step back for a minute.

10-15 years ago, when mmap was first on the drawing boards as a concept 
for unix, the cost of a kernel trap and entering the vm system for 
fault recovery versus memory bandwidth is very very different compared 
to today.  Back then, getting into the kernel was relatively painless 
and memory was proportionally very slow and expensive to use.

However, these days, the memory subsystem is proportionally much much 
faster relative to the cost of kernel traps and vm processing and 
recovery.

The amount of  "work"  for the kernel to do a read() and a high-speed 
memory copy is much less than the cost of taking a page fault, running 
a whole bunch of really really nasty code in the vm system, repairing 
the damage from the page fault, updating the process paging state and 
restarting the instruction.

The numbers you're posting are a simple reflection of the fact that the 
read syscall path has fewer (and less expensive) instructions to 
execute compared to the mmap fault paths.

Some operating systems implemented read(2) as an internal in-kernel 
mmap/fault/copy/unmap.  Naturally, that made mmap look fast compared to 
read, at the time.  But that isn't how it is implemented in FreeBSD. 

mmap is more valuable as a programmer convenience these days.  Don't 
make the mistake of assuming its faster, especially since the cost of a 
copy has gone way down.  Also, dont assume that read() is faster for 
cases where you're reading a file that has been mmapped (and possibly 
even dirtied) in another process.

-- 
Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
"All of this is for nothing if we don't go to the stars" - JMS/B5
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-22 Thread Matthew Dillon
(current removed, but I'm leaving this on question@ since it contains
some useful information).

:This is, sort of, self-perpetuating -- as long as mmap is slower/less
:reliable, applications will be hesitant to use it, thus there will be
:little insentive to improve it. :-(

Well, again, this is an incorrect perception.  Your use of mmap() to
process huge linear data sets is not what mmap() is best at doing, on
*any* operating system, and not what people use mmap() for most of the
time.  There are major hardware related overheads to the use of mmap(),
on *ANY* operating system, that cannot be circumvented.  You have no
choice but to allocate the pages for a page table, to populate the pages
with pte's, you must invalidate the pages in the tlb whenever you modify
a page table entry (e.g. invlpg instruction for IA32, which on a P2 is
extremely expensive), and if you are processing huge data sets you also
have to remove the page table entry from the page table when the
underlying data page is reused due to the dataset being larger then
main memory.  There are overheads related to each of these issues, and 
overheads related the algorithms the operating system *MUST* use to
figure out which pages to remove (on the fly) when the data set does 
not fit in main memory, and there are overheads related to the heuristics
the operating system employs to try to predict the memory usage pattern
to perform some read-ahead.

These are hardware and software issues that cannot simply be wished away. 
No matter how much you want the concept of memory mapping to be 'free',
it isn't.  Memory mapping and management are complex operations for
any operating system, always have been, and always will be.

:I'd rather call attention to my slower -- CPU-bound boxes. On them, the
:total CPU time spent computing md5 of a file is less with mmap -- by a
:noticable margin. But because the CPU is underutilized, the elapsed "wall
:clock" time is higher.
:
:As far as the cache-using statistics, having to do a cache-cache copy
:doubles the cache used, stealing it from other processes/kernel tasks.

But it is also not relevant for this case because the L2 cache is
typically much larger (128K-2MB) then the 8-32K you might use for
your local buffer.  What you are complaining about here is going
to wind up being mere microseconds over a multi-minute run.

It's really important, and I can't stress this enough, to not simply 
assume what the performance impact of a particular operation will be
by the way it feels to you.  Your assumptions are all skewed... you
are assuming that copying is always bad (it isn't), that copying is
always horrendously expensive (it isn't), that memory mapping is always
cheap (it isn't cheap), and that a small bit of cache pollution will have
a huge penalty in time (it doesn't necessary, certainly not for a 
reasonably sized user buffer). 

I've already told you how to measure these things.  Do me a favor and just
run this dd on all of your FreeBSD boxes:

dd if=/dev/zero of=/dev/null bs=32k count=8192

The resulting bytes/sec that it reports is a good guestimate of the
cost of a memory copy (the actual copy rate will be faster since the
times include the read and write system calls, but it's still a reasonable
basis).  So in the case of my absolute fastest machine
(an AMD64 3200+ tweaked up a bit):

268435456 bytes transferred in 0.058354 secs (4600128729 bytes/sec)

That means, basically, that it costs 1 second of cpu to copy 4.6 GBytes
of data.  On my slowest box, a C3 VIA Samuel 2 cpu (roughly equivalent
to a P2/400Mhz):

268435456 bytes transferred in 0.394222 secs (680924559 bytes/sec)

So the cost is 1 second to copy 680 MBytes of data on my slowest box.

:Here, again, is from my first comparision on the P2 400MHz:
:
:   stdio: 56.837u 34.115s 2:06.61 71.8%   66+193k 11253+0io 3pf+0w
:   mmap:  72.463u  7.534s 2:34.62 51.7%   5+186k  105+0io   22328pf+0w

Well, the cpu utilization is only 71.8% for the read case, so the box
is obviously I/O bound already.

The real question you should be asking is not why mmap is only using
51.7% of the cpu, but why stdio is only using 71.8% of the cpu.  If
you want to make your processing program more efficient, 'fix' stdio
first.  You need to:

(1) Figure out the rate at which your processing program reads data in
the best case.  You can do this by timing it on a data set that fits
in memory (so no disk I/O is done).  Note that it might be bursty,
so the average rate along does not precisely quanity the amount of
buffering that will be needed.

(2) If your hard drive is faster then the datarate, then determine if
the overhead of doing double-buffering is worth keeping the
processing program populat

Re: read vs. mmap (or io vs. page faults)

2004-06-21 Thread Mikhail Teterin
On Monday 21 June 2004 08:15 pm, Matthew Dillon wrote:

= :The mmap interface is supposed to be more efficient -- theoreticly
= :-- because it requires one less buffer-copying, and because it
= :(together with the possible madvise()) provides the kernel with more
= :information thus enabling it to make better (at least -- no worse)
= :decisions.

= Well, I think you forgot my earlier explanation regarding buffer
= copying. = Buffer copying is a very cheap operation if it occurs
= within the L1 or L2 cache, and that is precisely what is happening
= when you read() int

This could explain, why using mmap is not faster than read, but it does
not explain, why it is slower.

I'm afraid, your vast knowledge of the internals of the kernel workings
obscure your vision. I, on the other hand, "enjoy" an almost total
ignorance of it, and can see, that mmap interface _allows_ for a more
(certainly, no _less_) efficient handling of the IO, than read. That the
kernel is not using all the information passed to it, I can only explain
by deficiencies/simplicity the implementation.

This is, sort of, self-perpetuating -- as long as mmap is slower/less
reliable, applications will be hesitant to use it, thus there will be
little insentive to improve it. :-(

= As you can see by your timing results, even on your fastest box,
= processing a file around that size is only going to incur 1-2
= seconds of real time overhead to do the extra buffer copy. 2
= seconds is a hard number to beat.

I'd rather call attention to my slower -- CPU-bound boxes. On them, the
total CPU time spent computing md5 of a file is less with mmap -- by a
noticable margin. But because the CPU is underutilized, the elapsed "wall
clock" time is higher.

As far as the cache-using statistics, having to do a cache-cache copy
doubles the cache used, stealing it from other processes/kernel tasks.

Here, again, is from my first comparision on the P2 400MHz:

stdio: 56.837u 34.115s 2:06.61 71.8%   66+193k 11253+0io 3pf+0w
mmap:  72.463u  7.534s 2:34.62 51.7%   5+186k  105+0io   22328pf+0w

91 vs. 78 seconds CPU time (15% win for mmap), but 126 vs. 154 elapsed
(22% loss)? Why is the CPU so underutilized in the mmap case? There was
nothing else running at the time. The CPU was, indeed, at about 88%
utilization, according to top. This alone seems to invalidate some of
what you are saying below about the immediate disadvantages of mmap on a
modern CPU.

Or is P2 400MHz not modern? May be, but the very modern Sparcs, on which
FreeBSD intends to run are not much faster.

= The mmap interface is not supposed to be more efficient, per say.
= Why would it be?

Puzzling question. Because the kernel is supplied with more information
-- it knows, that I only plan to _read_ from the memory (PROT_READ),
the total size of what I plan to read (mmap's len, optionally,
madvise's len), and (optionally), that I plan to read sequentially
(MADV_SEQUENTIONAL).

With that information, the kernel should be able to decide how many
pages to pre-fault in and, what and when to drop.

Mmap also needs no CPU data-cache to read. If the device is capable of
writing to memory directly (DMA?), the CPU does not need to be involved
at all, while with read the data still has to go from the DMA-filled
kernel buffer to the application buffer -- there being two copies of it
in cache instead of none for just storing or one copy for processing.

Also, in case of RAM shortage, mmap-ed pages can be just dropped, while
the too large buffer needs to be written into swap.

And mmap requires no application buffers -- win, win, and win. Is there
an inherent "lose" somewhere, I don't see? Like:

=   On a modern cpu, where an L1 cache copy is a two cycle streaming
=   operation, the several hundred (or more) cycles it takes to process
=   a page fault or even just populate the page table is equivalent to a
=   lot of copied bytes.

But each call to read also takes cycles -- in the user space (read()
function) and in the kernel (the syscall). And there are a lot of them
too...

= mmap() is not designed to streamline large demand-page reads of
= data sets much larger then main memory.

Then it was not designed to take advantage of all the possibilities of
the interface, I say.

= mmap() works best for data that is already cached in the kernel,
= and even then it still has a fairly large hurdle to overcome vs a
= streaming read(). This is a HARDWARE limitation.

Wait, HARDWARE? Which hardware issues are we talking about? You
suggested, I pre-fault in the pages and Julian explained how best to do
it. If that is, indeed, the solution, why is not kernel doing it for me,
pre-faulting in the same number of bytes, that read pre-reads?

= 15% is nothing anyone cares about except perhaps gamers. I
= certainly couldn't care less about 15%. 50%, on the otherhand,
= is something that I would care about.

Well, here we have a server dedicated t

Re: read vs. mmap (or io vs. page faults)

2004-06-21 Thread Chuck Swiger
Matthew Dillon wrote:
Mikhail Teterin wrote:
=Both read and mmap have a read-ahead heuristic. The heuristic
=works. In fact, the mmap heuristic is so smart it can read-behind
=as well as read-ahead if it detects a backwards scan.
Evidently, read's heuristics are better. At least, for this task. I'm,
actually, surprised, they are _different_ at all.
It might be interesting to retry your tests under a Mach kernel.  BSD has 
multiple codepaths for IPC functionality that are unified under Mach.

The mmap interface is supposed to be more efficient -- theoreticly --
because it requires one less buffer-copying, and because it (together
with the possible madvise()) provides the kernel with more information
thus enabling it to make better (at least -- no worse) decisions.
I've heard people repeat the same notion, that is to say "that mmap()ing a 
file is supposed to be faster than read()ing it" [1], but the two operations 
are not quite the same thing, and there is more work being done to mmap a file 
(and thus gain random access to any byte of the file by dereferencing memory), 
 than to read and process small blocks of data at a time.

Matt's right that processing a small block that fits into L1/L2 cache (and 
probably already is resident) is very fast.  The extra copy doesn't matter as 
much as it once did on slower machines, and he's provided some good analysis 
of L1/L2 caching issues and buffer copying speeds.

However, I tend to think the issue of buffer copying speeds are likely to be 
moot when you are reading from disk and are thus I/O bound [2], rather than 
having the manner in which the file's contents are represented to the program 
being that significant.

-
[1]: Actually, while it is intuitive that trying to tell the system, "hey, I 
want all of that file read into RAM now, as quickly as you can using mmap() 
and madvise()", what happens with systems which use demand-paging VM (like 
FreeBSD, Linux, and most others) is far more lazy:

In reality, your process gets nothing but a promise from mmap() that if you 
access the right chunk of memory, your program will unblock once that data has 
been read and faulted into the local address space.  That level of urgency 
doesn't seem to correspond to what you asked for :-), although it still works 
pretty well in practice.

[2]: We're talking about maybe 20 to 60 or so MB/s for disk, versus 10x to 
100x that for RAM to RAM copying, much less the L2 copying speeds Matt 
mentions below:

Well, I think you forgot my earlier explanation regarding buffer copying.
Buffer copying is a very cheap operation if it occurs within the L1 or
L2 cache, and that is precisely what is happening when you read() into
a fixed buffer in a loop in a C program... your buffer is fixed in
memory and is almost guarenteed to be in the L1/L2 cache, which means
that the extra copy operation is very fast on a modern processor.  It's
something like 12-16 GBytes/sec to the L1 cache on an Athlon 64, for
example, and 3 GBytes/sec uncached to main memory.
This has been an interesting discussion, BTW, thanks.
--
-Chuck
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-21 Thread Matthew Dillon

:The mmap interface is supposed to be more efficient -- theoreticly --
:because it requires one less buffer-copying, and because it (together
:with the possible madvise()) provides the kernel with more information
:thus enabling it to make better (at least -- no worse) decisions.

Well, I think you forgot my earlier explanation regarding buffer copying.
Buffer copying is a very cheap operation if it occurs within the L1 or
L2 cache, and that is precisely what is happening when you read() into
a fixed buffer in a loop in a C program... your buffer is fixed in
memory and is almost guarenteed to be in the L1/L2 cache, which means
that the extra copy operation is very fast on a modern processor.  It's
something like 12-16 GBytes/sec to the L1 cache on an Athlon 64, for
example, and 3 GBytes/sec uncached to main memory.

Consider the cpu time cost, then, of the local copy on a 2GB file...
the cpu time cost on an AMD64 is about 2/12 of one second.  This is
the number mmap would have to beat. 

As you can see by your timing results, even on your fastest box,
processing a file around that size is only going to incur 1-2 seconds
of real time overhead to do the extra buffer copy.  2 seconds is a hard
number to beat.

This is something you can calculate yourself.  Time a dd from /dev/zero
to /dev/null.

crater# dd if=/dev/zero of=/dev/null bs=32k count=8192
268435456 bytes transferred in 0.244561 secs (1097620804 bytes/sec)

amd64# dd if=/dev/zero of=/dev/null bs=32k count=8192
268435456 bytes transferred in 0.066994 secs (4006846790 bytes/sec)

amd64# dd if=/dev/zero of=/dev/null bs=16m count=32
536870912 bytes transferred in 0.431774 secs (1243407512 bytes/sec)

Try it for different buffer sizes (16K through 16MB) and you will get
a feel for how the L1 and L2 caches effect copying bandwidth.  These
numbers are reasonably close to the raw memory bandwidth available to
the cpu (and will be different depending on whether the buffer fits in
the L1 or L2 caches, or doesn't fit at all).

The mmap interface is not supposed to be more efficient, per say.  Why
would it be?  There are overheads involved with mapping the page table
entries and taking faults to map more.  Even if you pre-mapped everything,
there are still overheads involved in populating the page table and
performing invlpg operations on the TLB to reload the entry, and for
large data sets there is overhead involved with removing page table
entries and invalidating the pte.  On a modern cpu, where an L1 cache 
copy is a two cycle streaming operation, the several hundred (or more)
cycles it takes to process a page fault or even just populate the
page table is equivalent to a lot of copied bytes.

This immediately puts mmap() at a disadvantage on a modern cpu, but of
course it also depends on what the data processing loop itself is
doing.  If the data processing loop is sensitive to the L1 cache then
processing larger chunks of data is going to be make it more efficient,
and mmap() can certainly provide that where read() might require buffers
too large to fit comfortably in the L1/L2 cache.  On the otherhand, if
the processing loop is relatively insensitive to the L1 cache (i.e. its
small), then you can afford to process the data in smaller chunks, like
16K, without any significant penalty.

mmap() is not designed to streamline large demand-page reads of data
sets much larger then main memory.  mmap() works best for data that
is already cached in the kernel, and even then it still has a fairly
large hurdle to overcome vs a streaming read().  This is a HARDWARE
limitation.  Drastic action would have to be taken in software to get
rid of this overhead (we'd have to use 4MB page table entries, which
come with their own problems).

The overhead required to manage a large mmap'd data set can skyrocket.
FreeBSD (and DragonFly) have heuristics that attempt to detect
sequential operations like this with mmap'd data and to depress the
page priority behind the read (so: read-ahead and depress-behind), and
this works, but it only mitigates the additional overhead some, it 
doesn't get rid of it.

For linear processing of large data sets you almost universally want
to use a read() loop.  There's no good reason to use mmap().

:=: read: 10.619u 23.814s 1:17.67 44.3%   62+274k 11255+0io 0pf+0w
:=
:Well, now we are venturing into the domain of humans' subjective
:perception... I'd say, 12% is plenty, actually. This is what some people
:achieve by rewriting stuff in assembler -- and are proud, when it works
::-)

Nobody is going to stare at their screen for one minute and 17 seconds
and really care that something might take one minute and 27 seconds instead
of one minute and 17 seconds.  That's

Re: read vs. mmap (or io vs. page faults)

2004-06-21 Thread Matthew Dillon

:
:ask for 8k, but the system will fetch the next 64k of data.  Problem is
:the system does nothing until you read the next 8k past the 64k
:alreqady read in, then it jumps up and grabs the next 64k.  You're
:still waiting on I/O every 8th read.  Ideally it would do an async
:..
:-- 
:   Dan Nelson
:   [EMAIL PROTECTED]


No, this isn't true.  The system places a marker 8K or 16K before the
last read block and initiates the next read-ahead before you exhaust the
first one.

For mapped data the system intentionally does not map the page table entry
for a page or two before the end of the read ahead in order to force a
page fault so it can initiate the next read ahead.

For read data the system marks a buffer near the end of the read ahead
so when read encounters it the system knows to queue then next read-ahead.

Also, for that matter, remember that the hard drives themselves generally
cache whole tracks and do their own read-ahead.  This is why dd'ing a 
large file usually results in the maximum transfer rate the hard drive
can do.

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-21 Thread Dan Nelson
In the last episode (Jun 21), Mikhail Teterin said:
> > Both read and mmap have a read-ahead heuristic. The heuristic
> > works. In fact, the mmap heuristic is so smart it can read-behind
> > as well as read-ahead if it detects a backwards scan.
> 
> Evidently, read's heuristics are better. At least, for this task.
> I'm, actually, surprised, they are _different_ at all.
[...] 
> That other OSes have similar shortcomings simply gives us some
> breathing room from an advocacy point of view. I hope, my rhetoric
> will burn an itch in someone capable of addressing it technically :-)
> 
> > The heuristic does not try to read megabytes and megabytes ahead,
> > however...
> 
> Neither does the read-handling.

I think part of the problem is that it's just clustering reads instead
of making sure the next N blocks of data are prefetched.  So you may
ask for 8k, but the system will fetch the next 64k of data.  Problem is
the system does nothing until you read the next 8k past the 64k
alreqady read in, then it jumps up and grabs the next 64k.  You're
still waiting on I/O every 8th read.  Ideally it would do an async
fetch of a 8k block (64k ahead of the current read) every time you read
a block.  It should be a lot easier for read to do this, since the
kernel is getting a steady stream of syscalls.  Once a 64k chunk of
mmapped address space is pulled in, the system isn't notified until the
next page fault.  (or am I misunderstanding how readahead is
implemented on mmapped data?)
 
-- 
Dan Nelson
[EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-21 Thread Mikhail Teterin
=Both read and mmap have a read-ahead heuristic. The heuristic
=works. In fact, the mmap heuristic is so smart it can read-behind
=as well as read-ahead if it detects a backwards scan.

Evidently, read's heuristics are better. At least, for this task. I'm,
actually, surprised, they are _different_ at all.

The mmap interface is supposed to be more efficient -- theoreticly --
because it requires one less buffer-copying, and because it (together
with the possible madvise()) provides the kernel with more information
thus enabling it to make better (at least -- no worse) decisions.

That these theoretical advantages -- small or not -- are eaten by,
what seems like, practical implementation deficiencies to the point,
that using mmap is not only not faster, but frequently slower --
wallclock-wise -- is, in itself, a serious shortcoming, that stands
between an OS and perfection.

That other OSes have similar shortcomings simply gives us some breathing
room from an advocacy point of view. I hope, my rhetoric will burn an
itch in someone capable of addressing it technically :-)

=The heuristic does not try to read megabytes and megabytes ahead,
=however...

Neither does the read-handling.

=that might speed up this particular application a little, but it
=would destroy performance for many other types of applications,
=especially in a loaded environment.

I'm not asking mmap (page fault handling) to cache any more aggressively,
than read-handling does.

=Well now hold a second... the best you can do here is compare relative
=differences between mmap and read.

This is all I am doing, actually. :-)

=If you really want to compare operating systems, you have to run the
=OS's and the tests on the same hardware.

I am comparing relative differences between between read and mmap on
different OSes.

=:  4.8-stable on Pentium2-400MHz
=:  mmap: 21.507u 11.472s 1:27.53 37.6%   62+276k 99+0io 44736pf+0w
=:  read: 10.619u 23.814s 1:17.67 44.3%   62+274k 11255+0io 0pf+0w
=
=mmap 12% slower then read.  12% isn't much.

Well, now we are venturing into the domain of humans' subjective
perception... I'd say, 12% is plenty, actually. This is what some people
achieve by rewriting stuff in assembler -- and are proud, when it works
:-)

=:  recent -current on dual P2 Xeon-450MHz (mmap WINS -- SMP?)
=:  mmap: 12.482u 12.872s 2:28.70 17.0%   74+298k 23+0io 46522pf+0w
=:  read: 7.255u 16.366s 3:27.07 11.4%70+283k 44437+0io 7pf+0w
=
=mmap 39% faster.  That's a significant difference.
=
=It kinda smells funny, actually... are you sure that you compiled
=your FreeBSD-5 system with Witness turned off?

There are no "WITNESS" options in the kernel's config file (unlike in
NOTES). So, unless there has to be some sort of explicit "NOWITNESS", I
am sure.

=:  recent -current on a Centrino-laptop P4-1GHz (NO win at all)
=:  mmap: 4.197u 3.920s 2:07.57 6.3%  65+284k 63+0io 45568pf+0w
=:  read: 3.965u 4.265s 1:50.26 7.4%  67+291k 13131+0io 17pf+0w
=
=mmap 15% slower.

=:  Linux 2.4.20-30.9bigmem dual P4-3GHz (with a different file)
=:  mmap: 2.280u 4.800s 1:13.39 9.6%  0+0k 0+0io 512434pf+0w
=:  read: 1.630u 2.820s 0:08.89 50.0% 0+0k 0+0io 396pf+0w
=
=mmap 821% slower on Linux?  With a different file?  So these numbers
=can't be compared to anything else (over and above the fact that this
=machine is three times faster then any of the others).

No, the file is different (as is the processor) -- relative performance
difference only. I was quite surprised myself. My fmd5 program does not
show such a dramatic difference, but `fgrep --mmap' is vastly slower on
Linux, than the regular `fgrep'. Here are the results of the two new
fgrep runs:

mmap1: 1.450u 3.000s 0:46.00 9.6%  0+0k 0+0io 512439pf+0w
read1: 1.830u 2.620s 0:09.51 46.7% 0+0k 0+0io 393pf+0w
mmap2: 1.700u 4.040s 1:02.31 9.2%  0+0k 0+0io 512427pf+0w
read2: 1.330u 3.150s 0:09.38 47.7% 0+0k 0+0io 396pf+0w

=I'm not sure why you are complaining about FreeBSD.

Because I have much higher expectations for it :-) I thought, I'll be
able to use the powerful technique of presenting a Linux' superiority in
some area to fire up rapid improvements in the same area in FreeBSD. Now
I'm back to fighting the "12% gain is not worth the effort" mentality.

=:Once mmap-handling is improved, all sorts of whole-file operations
=:(bzip2, gzip, md5, sha1) can be made faster...

=Well, your numbers don't really say that. It looks like you might
=eeek out a 10-15% improvement, and while this is faster it really
=isn't all that much faster. It certainly isn't something to write
=home about, and certainly not significant enough to warrant major
=codework.

Put it into perspective -- 10-15% is usually the difference between
the latest processor and the previous one. Pe

Re: read vs. mmap (or io vs. page faults)

2004-06-21 Thread Matthew Dillon

:
:= pre-faulting is best done by a worker thread or child process, or it
:= will just slow you down..
:
:Read is also used for large files sometimes, and never tries to prefetch
:the whole file at once. Why can't the same smarts/heuristics be employed
:by the page-fault handling code -- especially, if we are so proud of our
:unified caching?

Both read and mmap have a read-ahead heuristic.  The heuristic works.
In fact, the mmap heuristic is so smart it can read-behind as well as
read-ahead if it detects a backwards scan.  The heuristic does not try 
to read megabytes and megabytes ahead, however... that might speed up
this particular application a little, but it would destroy performance
for many other types of applications, especially in a loaded environment.

:If anything mmap/madvise provide the kernel with _more_ information than
:read -- kernel just does not use it, it seems.
:
:According to my tests (`fgrep string /huge/file' vs. `fgrep --mmap
:string /huge/file') the total CPU time is much less with mmap. But
:sometimes the total "wall clock" time is longer with itj because the CPU
:is underutilized, when using the mmap method.

Well now hold a second... the best you can do here is compare relative
differences between mmap and read.  All of these machines are different,
with different cpus and different configurations.  For example, a 
duel-P2 is going to be horrendously bad doing SMP things because the P2's
locked bus cycle instruction overhead is horrendous.  That is going to
seriously skew the results.  There are major architectural differences
between these cpus... cache size, memory bandwidth, MP operations 
overhead, not to mention raw megaherz.  Disk transfer rate and the
disk bus interface and driver will also make a big difference here,
as well as the contents of the file you are fgrep'ing.

If you really want to compare operating systems, you have to run the
OS's and the tests on the same hardware.

:   4.8-stable on Pentium2-400MHz
:   mmap: 21.507u 11.472s 1:27.53 37.6%   62+276k 99+0io 44736pf+0w
:   read: 10.619u 23.814s 1:17.67 44.3%   62+274k 11255+0io 0pf+0w

mmap 12% slower then read.  12% isn't much.

:   recent -current on dual P2 Xeon-450MHz (mmap WINS -- SMP?)
:   mmap: 12.482u 12.872s 2:28.70 17.0%   74+298k 23+0io 46522pf+0w
:   read: 7.255u 16.366s 3:27.07 11.4%70+283k 44437+0io 7pf+0w

mmap 39% faster.  That's a significant difference.

It kinda smells funny, actually... are you sure that you compiled
your FreeBSD-5 system with Witness turned off?

:   recent -current on a Centrino-laptop P4-1GHz (NO win at all)
:   mmap: 4.197u 3.920s 2:07.57 6.3%  65+284k 63+0io 45568pf+0w
:   read: 3.965u 4.265s 1:50.26 7.4%  67+291k 13131+0io 17pf+0w

mmap 15% slower.

:   Linux 2.4.20-30.9bigmem dual P4-3GHz (with a different file)
:   mmap: 2.280u 4.800s 1:13.39 9.6%  0+0k 0+0io 512434pf+0w
:   read: 1.630u 2.820s 0:08.89 50.0% 0+0k 0+0io 396pf+0w

mmap 821% slower on Linux?  With a different file?  So these numbers
can't be compared to anything else (over and above the fact that this
machine is three times faster then any of the others).

It kinda looks like either you wrote the linux numbers down wrong,
or linux's mmap is much, much worse then FreeBSD's.  I'm not sure why
you are complaining about FreeBSD.  If I were to assume 1:08.89 instead
of 1:13.39 the difference would be 6.5%, which is narrower then 15%
but not by all that much... a few seconds is nothing to quibble over.

:The attached md5-computing program is more CPU consuming than fgrep. It
:wins with mmap even on the "sceptical" Centrino-laptop -- presumably,
:because MD5_Update is not interrupted as much and remains in the
:instruction cache:
:
:   read: 22.024u 8.418s 1:28.44 34.4%5+166k 10498+0io 4pf+0w
:   mmap: 21.428u 3.086s 1:23.88 29.2%5+170k 40+0io 19649pf+0w

read is 6% faster then mmap here.

:Once mmap-handling is improved, all sorts of whole-file operations
:(bzip2, gzip, md5, sha1) can be made faster...
:
:   -mi

Well, your numbers don't really say that.  It looks like you might
eeek out a 10-15% improvement, and while this is faster it really isn't
all that much faster.  It certainly isn't something to write home about,
and certainly not significant enough to warrent major codework.

Though I personally have major issues with FreeBSD-5's performance
in general, I don't really see that anything stands out in these tests
except perhaps for FreeBSD-5's horrible MP performance with read() vs
mmap() on the duel P2 (but I suspect that might be due to some other 
issue such as perhaps Witness being turned on).

If you really want to get comparative results you have to run all of
these tests on the same hardware with the

Re: read vs. mmap (or io vs. page faults)

2004-06-21 Thread Mikhail Teterin
On Sunday 20 June 2004 08:16 pm, Julian Elischer wrote:
= On Sun, 20 Jun 2004, Matthew Dillon wrote:
[...]
= > It is usually a bad idea to try to populate the page table with
= > all resident pages associated with the a memory mapping, because
= > mmap() is often used to map huge files...
[...]

= pre-faulting is best done by a worker thread or child process, or it
= will just slow you down..

Read is also used for large files sometimes, and never tries to prefetch
the whole file at once. Why can't the same smarts/heuristics be employed
by the page-fault handling code -- especially, if we are so proud of our
unified caching?

If anything mmap/madvise provide the kernel with _more_ information than
read -- kernel just does not use it, it seems.

According to my tests (`fgrep string /huge/file' vs. `fgrep --mmap
string /huge/file') the total CPU time is much less with mmap. But
sometimes the total "wall clock" time is longer with itj because the CPU
is underutilized, when using the mmap method.

4.8-stable on Pentium2-400MHz
mmap: 21.507u 11.472s 1:27.53 37.6%   62+276k 99+0io 44736pf+0w
read: 10.619u 23.814s 1:17.67 44.3%   62+274k 11255+0io 0pf+0w

recent -current on dual P2 Xeon-450MHz (mmap WINS -- SMP?)
mmap: 12.482u 12.872s 2:28.70 17.0%   74+298k 23+0io 46522pf+0w
read: 7.255u 16.366s 3:27.07 11.4%70+283k 44437+0io 7pf+0w

recent -current on a Centrino-laptop P4-1GHz (NO win at all)
mmap: 4.197u 3.920s 2:07.57 6.3%  65+284k 63+0io 45568pf+0w
read: 3.965u 4.265s 1:50.26 7.4%  67+291k 13131+0io 17pf+0w

Linux 2.4.20-30.9bigmem dual P4-3GHz (with a different file)
mmap: 2.280u 4.800s 1:13.39 9.6%  0+0k 0+0io 512434pf+0w
read: 1.630u 2.820s 0:08.89 50.0% 0+0k 0+0io 396pf+0w

The attached md5-computing program is more CPU consuming than fgrep. It
wins with mmap even on the "sceptical" Centrino-laptop -- presumably,
because MD5_Update is not interrupted as much and remains in the
instruction cache:

read: 22.024u 8.418s 1:28.44 34.4%5+166k 10498+0io 4pf+0w
mmap: 21.428u 3.086s 1:23.88 29.2%5+170k 40+0io 19649pf+0w

Once mmap-handling is improved, all sorts of whole-file operations
(bzip2, gzip, md5, sha1) can be made faster...

-mi

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-20 Thread Julian Elischer


On Sun, 20 Jun 2004, Matthew Dillon wrote:

> Hmm.  Well, you can try calling madvise(... MADV_WILLNEED), that's what
> it is for.  
> 
> It is usually a bad idea to try to populate the page table with all
> resident pages associated with the a memory mapping, because mmap()
> is often used to map huge files... hundreds of megabytes or even 
> dozens of gigabytes (on 64 bit architectures).  The last thing you want
> to do is to populate the page table for the entire file.  It might
> work for your particular program, but it is a bad idea for the OS to
> assume that for every mmap().
> 
> What it comes down to, really, is whether you feel you actually need the
> additional performance, because it kinda sounds to me that whatever 
> processing you are doing to the data is either going to be I/O bound,
> or it isn't going to run long enough for the additional overhead to matter
> verses the processing overhead of the program itself.
> 
> If you are really worried you could pre-fault the mmap before you do
> any processing at all and measure the time it takes to pre-fault the
> pages vs the time it takes to process the memory image.  (You pre-fault
> simply by accessing one byte of data in each page across the mmap(),
> before you begin any processing).

pre-faulting is best done by a worker thread or child process, or it
will just slow you down..


> 
>   -Matt
>   Matthew Dillon 
>   <[EMAIL PROTECTED]>
> 
> := It's hard to say.  mmap() could certainly be made more efficient, e.g.
> := by faulting in more pages at a time to reduce the actual fault rate.
> := But it's fairly difficult to beat a read copy into a small buffer.
> :
> :Well, that's the thing -- by mmap-ing the whole file at once (and by
> :madvise-ing with MADV_SEQUENTIONAL), I thought, I told, the kernel
> :everything it needed to know to make the best decision. Why can't
> :page-faulting code do a better job using all this knowledge, than the
> :poor read, which only knows about the partial read in question?
> :
> :I find it so disappointing, that it can, probably, be considered a bug.
> :I'll try this code on Linux and Solaris. If mmap is better there (as it
> :really ought to be), we have a problem, IMHO. Thanks!
> :
> : -mi
> :
> ___
> [EMAIL PROTECTED] mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"
> 

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-20 Thread Matthew Dillon
Hmm.  Well, you can try calling madvise(... MADV_WILLNEED), that's what
it is for.  

It is usually a bad idea to try to populate the page table with all
resident pages associated with the a memory mapping, because mmap()
is often used to map huge files... hundreds of megabytes or even 
dozens of gigabytes (on 64 bit architectures).  The last thing you want
to do is to populate the page table for the entire file.  It might
work for your particular program, but it is a bad idea for the OS to
assume that for every mmap().

What it comes down to, really, is whether you feel you actually need the
additional performance, because it kinda sounds to me that whatever 
processing you are doing to the data is either going to be I/O bound,
or it isn't going to run long enough for the additional overhead to matter
verses the processing overhead of the program itself.

If you are really worried you could pre-fault the mmap before you do
any processing at all and measure the time it takes to pre-fault the
pages vs the time it takes to process the memory image.  (You pre-fault
simply by accessing one byte of data in each page across the mmap(),
before you begin any processing).

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>

:= It's hard to say.  mmap() could certainly be made more efficient, e.g.
:= by faulting in more pages at a time to reduce the actual fault rate.
:= But it's fairly difficult to beat a read copy into a small buffer.
:
:Well, that's the thing -- by mmap-ing the whole file at once (and by
:madvise-ing with MADV_SEQUENTIONAL), I thought, I told, the kernel
:everything it needed to know to make the best decision. Why can't
:page-faulting code do a better job using all this knowledge, than the
:poor read, which only knows about the partial read in question?
:
:I find it so disappointing, that it can, probably, be considered a bug.
:I'll try this code on Linux and Solaris. If mmap is better there (as it
:really ought to be), we have a problem, IMHO. Thanks!
:
:   -mi
:
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-20 Thread Mikhail Teterin
On Sunday 20 June 2004 02:35 pm, you wrote:
=
= :I this how things are supposed to be, or will mmap() become more
= :efficient eventually? Thanks!
= :
= : -mi

= It's hard to say.  mmap() could certainly be made more efficient, e.g.
= by faulting in more pages at a time to reduce the actual fault rate.
= But it's fairly difficult to beat a read copy into a small buffer.

Well, that's the thing -- by mmap-ing the whole file at once (and by
madvise-ing with MADV_SEQUENTIONAL), I thought, I told, the kernel
everything it needed to know to make the best decision. Why can't
page-faulting code do a better job using all this knowledge, than the
poor read, which only knows about the partial read in question?

I find it so disappointing, that it can, probably, be considered a bug.
I'll try this code on Linux and Solaris. If mmap is better there (as it
really ought to be), we have a problem, IMHO. Thanks!

-mi

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-20 Thread Matthew Dillon
:Hello!
:
:I'm writing a message-digest utility, which operates on file and
:can use either stdio:
:
:   while (not eof) {
:   char buffer[BUFSIZE];
:   size = read( buffer ...);
:   process(buffer, size);
:   }
:
:or mmap:
:
:   buffer = mmap(... file_size, PROT_READ ...);
:   process(buffer, file_size);
:
:I expected the second way to be faster, as it is supposed to avoid
:one memory copying (no user-space buffer). But in reality, on a
:CPU-bound (rather than IO-bound) machine, using mmap() is considerably
:slower. Here are the tcsh's time results:

read() is likely going to be faster because it does not involve any
page fault overhead.  The VM system only faults 16 or so pages ahead 
which is only 64KB, so the fault overhead is very high for the data rate.

Why does the extra copy not matter?  Well, it's fairly simple, actually.
It's because your buffer is smaller then the L1 cache, and/or also simply
because the VM fault overhead is higher then it would take to copy
an extra 64KB.

read() loops typically use buffer sizes in the 8K-46K range.  L1 caches
are typically 16K (for celeron class cpus) through 64K, or more for
higher end cpus.  L2 caches are typically 256K-1MB, or more.  The copy
bandwidth from or to the L1 cache is usually around 10x faster then main
memory and the copy bandwidth from or two L2 cache is usually
around 4x faster.  (Note that I'm talking copy bandwidth here, not random
access.  The L1 cache is ~50x faster or more for random access).

So the cost of the extra copy in a read() loop using a reasonable buffer
size (~8K-64K) (L1 or L2 access) is virtually nil compared to the cost
of accessing the kernel's buffer cache (which involves main memory
accesses for files > L2 cache).

:On the IO-bound machine, using mmap is only marginally faster:
:
:   Single Pentium4M (Centrino 1GHz) runing recent -current:
:   
:stdio: 27.195u 8.280s 1:33.02 38.1%10+169k 11221+0io 1pf+0w
:mmap:  26.619u 3.004s 1:23.59 35.4%10+169k 47+0io 19463pf+0w

Yes, because it's I/O bound.  As long as the kernel queues some readahead
to the device it can burn those cpu cycles on whatever it wants without
really effecting the transfer rate.

:I this how things are supposed to be, or will mmap() become more
:efficient eventually? Thanks!
:
:   -mi

It's hard to say.  mmap() could certainly be made more efficient, e.g.
by faulting in more pages at a time to reduce the actual fault rate.
But it's fairly difficult to beat a read copy into a small buffer.

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-20 Thread Mikhail Teterin
On Sunday 20 June 2004 11:41 am, Dan Nelson wrote:
= In the last episode (Jun 20), Mikhail Teterin said:
= > I expected the second way to be faster, as it is supposed to avoid
= > one memory copying (no user-space buffer). But in reality, on a
= > CPU-bound (rather than IO-bound) machine, using mmap() is
= > considerably slower. Here are the tcsh's time results:

= MADV_SEQUENTIAL just lets the system expire already-read blocks from
= its cache faster, so it won't help much here.

That may be what it _does_, but from the manual page, one gets an
impression, it should tell the VM, that once a page is requested (and
had to be page-faulted in), the one after it will be requested soon and
may as well be prefeteched (and the ones before can be dropped if memory
is in short supply). Anyway, using MADV_SEQUENTIAL is consintently
making mmap behave slightly worse, rather then have no effect.

But let's not get distracted with madvise(). Why is mmap() slower? So
much so, that the machine, which is CPU-bound using read() only uses 90%
of the CPU when using mmap -- while, at the same time -- the disk
bandwidth is also less than that of the read(). It looks to me, like a
lot of thought went into optimizing read(), but much less into mmap,
which is supposed to be faster -- less memory shuffling. Is that true,
is there something inherent in mmap-style of reading, that I don't see?

= read() should cause some prefetching to occur, but it obviously
= doesn't work all the time or else inblock wouldn't have been as high
= as 11000. For sequential access I would have expected read() to have
= been able to prefetch almost every block before the userland process
= needed it.

Thanks!

-mi

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: read vs. mmap (or io vs. page faults)

2004-06-20 Thread Dan Nelson
In the last episode (Jun 20), Mikhail Teterin said:
> I expected the second way to be faster, as it is supposed to avoid
> one memory copying (no user-space buffer). But in reality, on a
> CPU-bound (rather than IO-bound) machine, using mmap() is
> considerably slower. Here are the tcsh's time results:
> 
>   Single Pentium2-400MHz running 4.8-stable:
>   --
> stdio: 56.837u 34.115s 2:06.61 71.8%   66+193k 11253+0io 3pf+0w
> mmap:  72.463u  7.534s 2:34.62 51.7%   5+186k  105+0io   22328pf+0w
> 
>   Dual Pentium2 Xeon 450MHz running recent -current:
>   --
> stdio: 36.557u 29.395s 3:09.88 34.7%   10+165k 32646+0io 0pf+0w
> mmap:  42.052u  7.545s 2:02.25 40.5%   10+169k 16+0io15232pf+0w
> 
> On the IO-bound machine, using mmap is only marginally faster:
> 
>   Single Pentium4M (Centrino 1GHz) runing recent -current:
>   
> stdio: 27.195u 8.280s 1:33.02 38.1%10+169k 11221+0io 1pf+0w
> mmap:  26.619u 3.004s 1:23.59 35.4%10+169k 47+0io19463pf+0w
> 
> Notice the last two columns in time's output -- why is page-faulting a
> page in -- on-demand -- so much slower then read()-ing it? I even tried
> inserting ``madvise(buffer, file_size, MADV_SEQUENTIAL)'' between the
> mmap() and the process() -- made difference at all (or made the mmap()
> take slightly longer)...

MADV_SEQUENTIAL just lets the system expire already-read blocks from
its cache faster, so it won't help much here.  read() should cause some
prefetching to occur, but it obviously doesn't work all the time or
else inblock wouldn't have been as high as 11000.  For sequential
access I would have expected read() to have been able to prefetch
almost every block before the userland process needed it.

-- 
Dan Nelson
[EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


read vs. mmap (or io vs. page faults)

2004-06-20 Thread Mikhail Teterin
Hello!

I'm writing a message-digest utility, which operates on file and
can use either stdio:

while (not eof) {
char buffer[BUFSIZE];
size = read( buffer ...);
process(buffer, size);
}

or mmap:

buffer = mmap(... file_size, PROT_READ ...);
process(buffer, file_size);

I expected the second way to be faster, as it is supposed to avoid
one memory copying (no user-space buffer). But in reality, on a
CPU-bound (rather than IO-bound) machine, using mmap() is considerably
slower. Here are the tcsh's time results:

Single Pentium2-400MHz running 4.8-stable:
--
stdio:  56.837u 34.115s 2:06.61 71.8%   66+193k 11253+0io 3pf+0w
mmap:   72.463u 7.534s 2:34.62 51.7%5+186k 105+0io 22328pf+0w

Dual Pentium2 Xeon 450MHz running recent -current:
--
stdio:  36.557u 29.395s 3:09.88 34.7%   10+165k 32646+0io 0pf+0w
mmap:   42.052u 7.545s 2:02.25 40.5%10+169k 16+0io 15232pf+0w

On the IO-bound machine, using mmap is only marginally faster:

Single Pentium4M (Centrino 1GHz) runing recent -current:

stdio:  27.195u 8.280s 1:33.02 38.1%10+169k 11221+0io 1pf+0w
mmap:   26.619u 3.004s 1:23.59 35.4%10+169k 47+0io 19463pf+0w

Notice the last two columns in time's output -- why is page-faulting a
page in -- on-demand -- so much slower then read()-ing it? I even tried
inserting ``madvise(buffer, file_size, MADV_SEQUENTIAL)'' between the
mmap() and the process() -- made difference at all (or made the mmap()
take slightly longer)...

I this how things are supposed to be, or will mmap() become more
efficient eventually? Thanks!

-mi



___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"