On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > Whatever the right abstraction is, it probably needs to do these VM > checks only once.
Makes sense. > Yeah, after you pointed out the "leaky" abstraction, I also started to > think about customizing the behavior using a callback. Not sure what > exactly you mean by "fully transparent" but as I explained above I think > we need to allow passing some information between the prefetcher and the > executor - for example results of the visibility map checks in IOS. Agreed. > I have imagined something like this: > > nodeIndexscan / index_getnext_slot() > -> no callback, all TIDs are prefetched > > nodeIndexonlyscan / index_getnext_tid() > -> callback checks VM for the TID, prefetches if not all-visible > -> the VM check result is stored in the queue with the VM (but in an > extensible way, so that other callback can store other stuff) > -> index_getnext_tid() also returns this extra information > > So not that different from the WIP patch, but in a "generic" and > extensible way. Instead of hard-coding the all-visible flag, there'd be > a something custom information. A bit like qsort_r() has a void* arg to > pass custom context. > > Or if envisioned something different, could you elaborate a bit? I can't totally follow the sketch you give above, but I think we're thinking along similar lines, at least. > I think if the code stays in indexam.c, it's sensible to keep the index_ > prefix, but then also have a more appropriate rest of the name. For > example it might be index_prefetch_heap_pages() or something like that. Yeah, that's not a bad idea. > > index_prefetch_is_sequential() makes me really nervous > > because it seems to depend an awful lot on whether the OS is doing > > prefetching, and how the OS is doing prefetching, and I think those > > might not be consistent across all systems and kernel versions. > > If the OS does not have read-ahead, or it's not configured properly, > then the patch does not perform worse than what we have now. I'm far > more concerned about the opposite issue, i.e. causing regressions with > OS-level read-ahead. And the check handles that well, I think. I'm just not sure how much I believe that it's going to work well everywhere. I mean, I have no evidence that it doesn't, it just kind of looks like guesswork to me. For instance, the behavior of the algorithm depends heavily on PREFETCH_QUEUE_HISTORY and PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is to say that on some system or workload you didn't test the required values aren't entirely different, or that the whole algorithm doesn't need rethinking? Maybe we can't really answer that question perfectly, but the patch doesn't really explain the reasoning behind this choice of algorithm. > > Similarly with index_prefetch(). There's a lot of "magical" > > assumptions here. Even index_prefetch_add_cache() has this problem -- > > the function assumes that it's OK if we sometimes fail to detect a > > duplicate prefetch request, which makes sense, but under what > > circumstances is it necessary to detect duplicates and in what cases > > is it optional? The function comments are silent about that, which > > makes it hard to assess whether the algorithm is good enough. > > I don't quite understand what problem with duplicates you envision here. > Strictly speaking, we don't need to detect/prevent duplicates - it's > just that if you do posix_fadvise() for a block that's already in > memory, it's overhead / wasted time. The whole point is to not do that > very often. In this sense it's entirely optional, but desirable. Right ... but the patch sets up some data structure that will eliminate duplicates in some circumstances and fail to eliminate them in others. So it's making a judgement that the things it catches are the cases that are important enough that we need to catch them, and the things that it doesn't catch are cases that aren't particularly important to catch. Here again, PREFETCH_LRU_SIZE and PREFETCH_LRU_COUNT seem like they will have a big impact, but why these values? The comments suggest that it's because we want to cover ~8MB of data, but it's not clear why that should be the right amount of data to cover. My naive thought is that we'd want to avoid prefetching a block during the time between we had prefetched it and when we later read it, but then the value that is here magically 8MB should really be replaced by the operative prefetch distance. > I really don't want to have multiple knobs. At this point we have three > GUCs, each tuning prefetching for a fairly large part of the system: > > effective_io_concurrency = regular queries > maintenance_io_concurrency = utility commands > recovery_prefetch = recovery / PITR > > This seems sensible, but I really don't want many more GUCs tuning > prefetching for different executor nodes or something like that. > > If we have issues with how effective_io_concurrency works (and I'm not > sure that's actually true), then perhaps we should fix that rather than > inventing new GUCs. Well, that would very possibly be a good idea, but I still think using the same GUC for two different purposes is likely to cause trouble. I think what effective_io_concurrency currently controls is basically the heap prefetch distance for bitmap scans, and what you want to control here is the heap prefetch distance for index scans. If those are necessarily related in some understandable way (e.g. always the same, one twice the other, one the square of the other) then it's fine to use the same parameter for both, but it's not clear to me that this is the case. I fear someone will find that if they crank up effective_io_concurrency high enough to get the amount of prefetching they want for bitmap scans, it will be too much for index scans, or the other way around. -- Robert Haas EDB: http://www.enterprisedb.com