Greg Stark <[EMAIL PROTECTED]> writes: > Well my theory was sort of half right. It has nothing to do with fooling Linux > into thinking it's a sequential read. Apparently this filesystem was created > with 32k blocks. I don't remember if that was intentional or if ext2/3 did it > automatically based on the size of the filesystem. > > So it doesn't have wide-ranging implications for Postgres's default 8k block > size. But it is a good lesson about the importance of not using a larger > filesystem block than Postgres's block size. The net effect is that if the > filesystem block is N*8k then your random_page_cost goes up by a factor of N. > That could be devastating for OLTP performance.
Hm, apparently I spoke too soon. tune2fs says the block size is in fact 4k. Yet the performance of the block reading test program with 4k or 8k blocks behaves as if Linux is reading 32k blocks. And in fact when I run it with 32k blocks I get kind of behaviour we were expecting where the breakeven point is around 20%. So it's not the 8k block reading that's fooling Linux into reading ahead 32k. It seems 32k readahead is the default for Linux, or perhaps it's the sequential access pattern that's triggering it. I'm trying to test it with posix_fadvise() set to random access but I'm having trouble compiling. Do I need a special #define to get posix_fadvise from glibc? -- greg ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match