On 4/9/06, Tom Lane <[EMAIL PROTECTED]> wrote:
> "Gregory Maxwell" <[EMAIL PROTECTED]> writes:
> > For example, one case made in this thread involved bursty performance
> > with seqscans presumably because the I/O was stalling while processing
> > was being performed.
>
> Actually, the question that that raised in my mind is "why isn't the
> kernel doing read-ahead properly?"  When we're doing nonsequential
> access like an indexscan, it's unsurprising that the kernel can't guess
> which block we need next, but in a plain seqscan you'd certainly expect
> the read-ahead algorithm to kick in and ensure that the next block is
> fetched before we need it.
>
> So before we go inventing complicated bits of code with lots of added
> overhead, we should first find out exactly why the system doesn't
> already work the way it's supposed to.

But is that really the behavior we should expect?

How much memory can we expect the OS to spend on opportunistic
read-in? How much disk access should be spent on a guess? There is an
intrinsic tradeoff here, applications tend to be bursty so just
because you're reading a lot now doesn't mean you'll continue... and
the filesystem will have fragmentation, so a failed guess can
translate into a lot of pointless seeking.

As I recall, in Linux 2.6 you have something like a max of 128KB of
readahead. Given that and a disk subsystem that reads at 200MB/sec you
can't spend more than 600us processing before requesting enough
additional blocks put the disk back into readhead or you will stall
the disk.  Stalling the disk costs more than you'd expect, due to FS
fragmentation there can be terrific gains from allowing the OS and
disk to issue reads out of order from a large request queue.

It would probably be reasonable to say that the OS should be using
much larger readhead buffers, especially on systems with fast disk
subsystems... But that doesn't come for free and can slaughter
performance for many workloads (consider, what if it was triggering
5MB of file oriented read-ahead for every index scan seek we did?). 
There is an adaptive readahead patch for Linux which should improve
things (http://lwn.net/Articles/176279/ and if you google around there
are some benchmarks) but I doubt that even that would be able to keep
a 200MB/sec+ disk subsystem saturated with the sort of access patterns
PG has...

To address this in a cross platform way will be a challenge. I doubt
Linux is alone at having skimpy readahead (because big readahead
translates into huge losses if you get it wrong).

Given this information, a stupid 'fault-in' process should probably
give huge gains for seqscans... but I think the extra work required to
find a solution which is also useful for index operations is probably
worth it as well.

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

Reply via email to