Hi Stephen, hackers, >> > With all those 'readahead' calls it certainly makes one wonder if the >> > Linux kernel is reading more than just the block we're looking for >> > because it thinks we're doing a sequential read and will therefore want >> > the next few blocks when, in reality, we're going to skip past them, >> > meaning that any readahead the kernel is doing is likely just wasted >> > I/O. >> I've done some quick&dirty tests with blockdev --setra/setfra 0 after >> spending time looking at the smgr/md/fd API changes required to find >> shortcut, but I'm getting actually a little bit worse timings at least on >> "laptop DB tests". One thing that I've noticed is that needs to be only for >> automatic-analyze, but not for automatic-vacuum where apparently there is >> some boost due to readahead.
>Interesting that you weren't seeing any benefit to disabling readahead. I've got some free minutes and I have repeated the exercise in more realistic and strict environment that previous one to conclude that the current situation is preferable: Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+) Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have different storage settings), VG on top of dm-0, LV with stripe-size 8kB, ext4. s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12: INFO: "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in sample, 1678321053 estimated total rows Hot VFS cache: Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> .. -> pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s No VFS cache: Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in perf, total elapsed 3m30s Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28s No VFS cache, readahead off: Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no readaheads [ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s One thing not clear here is maybe in future worth measuring how striped LVs are being affected by readaheads. >Were you able to see where the time in the kernel was going when >readahead was turned off for the ANALYZE? Yes, my interpretation is that the time spent goes into directly block I/O layer waiting. 54.67% 1.33% postgres postgres [.] FileRead ---FileRead --53.33%--__pread_nocancel --50.67%--entry_SYSCALL_64_after_hwframe do_syscall_64 sys_pread64 |--49.33%--vfs_read | --48.00%--__vfs_read | |--45.33%--generic_file_read_iter | | |--42.67%--ondemand_readahead | | | __do_page_cache_readahead | | | |--25.33%--ext4_mpage_readpages | | | | |--10.67%--submit_bio | | | | | generic_make_request | | | | | |--8.00%--blk_mq_make_request | | | | | | |--4.00%--blk_mq_get_request | | | | | | | |--1.33%--blk_mq_get_tag | | | | | | | --1.33%--sched_clock | | | | | | | xen_sched_clock | | | | | | | pvclock_clocksource_read | | | | | | |--1.33%--bio_integrity_prep | | | | | | --1.33%--blk_account_io_start | | | | | | part_round_stats | | | | | | blk_mq_in_flight | | | | | | blk_mq_queue_tag_busy_iter | | | | | --2.67%--dm_make_request | | | | | __split_and_process_bio | | | | | __split_and_process_non_flush | | | | | |--1.33%--__map_bio | | | | | | generic_make_request | | | | | | generic_make_request_checks | | | | | | percpu_counter_add_batch | | | | | --1.33%--bio_alloc_bioset | | | | | mempool_alloc | | | | | kmem_cache_alloc | | | | |--6.67%--ext4_map_blocks | | | | | |--4.00%--ext4_es_lookup_extent | | | | | | --2.67%--_raw_read_lock | | | | | --2.67%--__check_block_validity.constprop.81 | | | | | ext4_data_block_valid | | | | --6.67%--add_to_page_cache_lru | | | | |--4.00%--__add_to_page_cache_locked | | | | | --1.33%--mem_cgroup_try_charge | | | | | get_mem_cgroup_from_mm | | | | --2.67%--__lru_cache_add | | | | pagevec_lru_move_fn | | | | __lock_text_start | | | |--12.00%--blk_finish_plug | | | | blk_flush_plug_list | | | | blk_mq_flush_plug_list | | | | |--10.67%--__blk_mq_delay_run_hw_queue | | | | | __blk_mq_run_hw_queue | | | | | blk_mq_sched_dispatch_requests | | | | | --9.33%--blk_mq_dispatch_rq_list | | | | | nvme_queue_rq | | | | | --1.33%--blk_mq_start_request >The VACUUM case is going to be complicated by what's in the visibility >map. (..) After observing the ANALYZE readahead behavior benefit I've abandoned the case of testing much more advanced VACUUM processing, clearly Linux read-ahead is beneficial in even simple cases. >> My only idea would be that a lot of those blocks could be read >> asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on block-range >> before, so maybe the the optimization is possible, but not until we'll have >> AIO ;) > > (..)AIO is a whole other animal that's been discussed off and on >around here but it's a much larger and more invasive change than just >calling posix_fadvise(). Yes, I'm aware and I'm keeping my fingers crossed that maybe some day.... The ANALYZE just seem fit to be natural candidate to use it. The only easy chance of acceleration of stats gathering - at least to me and enduser point of view - is to have more parallel autoanalyze workers running to drive more I/O concurrency (by e.g. partitioning the data), both in readahead and non-readahead scenarios. Which is a pity because 70-80% of such process sits idle. The readahead might read 10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO would be it could throw thousands of requests without readahead and achieve much better efficiency probably</speculation> I hope the previous simple patch goes into master and helps other people understand the picture more easily. -J.