On 4/9/06, Tom Lane <[EMAIL PROTECTED]> wrote: > "Gregory Maxwell" <[EMAIL PROTECTED]> writes: > > For example, one case made in this thread involved bursty performance > > with seqscans presumably because the I/O was stalling while processing > > was being performed. > > Actually, the question that that raised in my mind is "why isn't the > kernel doing read-ahead properly?" When we're doing nonsequential > access like an indexscan, it's unsurprising that the kernel can't guess > which block we need next, but in a plain seqscan you'd certainly expect > the read-ahead algorithm to kick in and ensure that the next block is > fetched before we need it. > > So before we go inventing complicated bits of code with lots of added > overhead, we should first find out exactly why the system doesn't > already work the way it's supposed to.
But is that really the behavior we should expect? How much memory can we expect the OS to spend on opportunistic read-in? How much disk access should be spent on a guess? There is an intrinsic tradeoff here, applications tend to be bursty so just because you're reading a lot now doesn't mean you'll continue... and the filesystem will have fragmentation, so a failed guess can translate into a lot of pointless seeking. As I recall, in Linux 2.6 you have something like a max of 128KB of readahead. Given that and a disk subsystem that reads at 200MB/sec you can't spend more than 600us processing before requesting enough additional blocks put the disk back into readhead or you will stall the disk. Stalling the disk costs more than you'd expect, due to FS fragmentation there can be terrific gains from allowing the OS and disk to issue reads out of order from a large request queue. It would probably be reasonable to say that the OS should be using much larger readhead buffers, especially on systems with fast disk subsystems... But that doesn't come for free and can slaughter performance for many workloads (consider, what if it was triggering 5MB of file oriented read-ahead for every index scan seek we did?). There is an adaptive readahead patch for Linux which should improve things (http://lwn.net/Articles/176279/ and if you google around there are some benchmarks) but I doubt that even that would be able to keep a 200MB/sec+ disk subsystem saturated with the sort of access patterns PG has... To address this in a cross platform way will be a challenge. I doubt Linux is alone at having skimpy readahead (because big readahead translates into huge losses if you get it wrong). Given this information, a stupid 'fault-in' process should probably give huge gains for seqscans... but I think the extra work required to find a solution which is also useful for index operations is probably worth it as well. ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org