Re: [HACKERS] Sequential Scan Read-Ahead

Curt Sampson Wed, 24 Apr 2002 20:02:18 -0700

On Wed, 24 Apr 2002, Bruce Momjian wrote:

> >     1. Not all systems do readahead.
>
> If they don't, that isn't our problem.  We expect it to be there, and if
> it isn't, the vendor/kernel is at fault.


It is your problem when another database kicks Postgres' ass
performance-wise.

And at that point, *you're* at fault. You're the one who's knowingly
decided to do things inefficiently.

Sorry if this sounds harsh, but this, "Oh, someone else is to blame"
attitude gets me steamed. It's one thing to say, "We don't support
this." That's fine; there are often good reasons for that. It's a
completely different thing to say, "It's an unrelated entity's fault we
don't support this."

At any rate, relying on the kernel to guess how to optimise for
the workload will never work as well as well as the software that
knows the workload doing the optimization.

The lack of support thing is no joke. Sure, lots of systems nowadays
support unified buffer cache and read-ahead. But how many, besides
Solaris, support free-behind, which is also very important to avoid
blowing out your buffer cache when doing sequential reads? And who
at all supports read-ahead for reverse scans? (Or does Postgres
not do those, anyway? I can see the support is there.)

And even when the facilities are there, you create problems by
using them.  Look at the OS buffer cache, for example. Not only do
we lose efficiency by using two layers of caching, but (as people
have pointed out recently on the lists), the optimizer can't even
know how much or what is being cached, and thus can't make decisions
based on that.

> Yes, seek() in file will turn off read-ahead.  Grabbing bigger chunks
> would help here, but if you have two people already reading from the
> same file, grabbing bigger chunks of the file may not be optimal.

Grabbing bigger chunks is always optimal, AFICT, if they're not
*too* big and you use the data. A single 64K read takes very little
longer than a single 8K read.

> >     3. Even when the read-ahead does occur, you're still doing more
> >     syscalls, and thus more expensive kernel/userland transitions, than
> >     you have to.
>
> I would guess the performance impact is minimal.

If it were minimal, people wouldn't work so hard to build multi-level
thread systems, where multiple userland threads are scheduled on
top of kernel threads.

However, it does depend on how much CPU your particular application
is using. You may have it to spare.

>       http://candle.pha.pa.us/mhonarc/todo.detail/performance/msg00009.html

Well, this message has some points in it that I feel are just incorrect.

    1. It is *not* true that you have no idea where data is when
    using a storage array or other similar system. While you
    certainly ought not worry about things such as head positions
    and so on, it's been a given for a long, long time that two
    blocks that have close index numbers are going to be close
    together in physical storage.

    2. Raw devices are quite standard across Unix systems (except
    in the unfortunate case of Linux, which I think has been
    remedied, hasn't it?). They're very portable, and have just as
    well--if not better--defined write semantics as a filesystem.

    3. My observations of OS performance tuning over the past six
    or eight years contradict the statement, "There's a considerable
    cost in complexity and code in using "raw" storage too, and
    it's not a one off cost: as the technologies change, the "fast"
    way to do things will change and the code will have to be
    updated to match." While optimizations have been removed over
    the years the basic optimizations (order reads by block number,
    do larger reads rather than smaller, cache the data) have
    remained unchanged for a long, long time.

    4. "Better to leave this to the OS vendor where possible, and
    take advantage of the tuning they do." Well, sorry guys, but
    have a look at the tuning they do. It hasn't changed in years,
    except to remove now-unnecessary complexity realated to really,
    really old and slow disk devices, and to add a few thing that
    guess workload but still do a worse job than if the workload
    generator just did its own optimisations in the first place.

>       http://candle.pha.pa.us/mhonarc/todo.detail/optimizer/msg00011.html

Well, this one, with statements like "Postgres does have control
over its buffer cache," I don't know what to say. You can interpret
the statement however you like, but in the end Postgres very little
control at all over how data is moved between memory and disk.

BTW, please don't take me as saying that all control over physical
IO should be done by Postgres. I just think that Posgres could do
a better job of managing data transfer between disk and memory than
the OS can. The rest of the things (using raw paritions, read-ahead,
free-behind, etc.) just drop out of that one idea.

cjs
-- 
Curt Sampson  <[EMAIL PROTECTED]>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Sequential Scan Read-Ahead

Reply via email to