On 3/29/24 23:03, Thomas Munro wrote: > On Sat, Mar 30, 2024 at 10:39 AM Thomas Munro <thomas.mu...@gmail.com> wrote: >> On Sat, Mar 30, 2024 at 4:53 AM Tomas Vondra >> <tomas.von...@enterprisedb.com> wrote: >>> ... Maybe there should be some flag to force >>> issuing fadvise even for sequential patterns, perhaps at the tablespace >>> level? ... >> >> Yeah, I've wondered about trying harder to "second guess" the Linux >> RA. At the moment, read_stream.c detects *exactly* sequential reads >> (see seq_blocknum) to suppress advice, but if we knew/guessed the RA >> window size, we could (1) detect it with the same window that Linux >> will use to detect it, and (2) [new realisation from yesterday's >> testing] we could even "tickle" it to wake it up in certain cases >> where it otherwise wouldn't, by temporarily using a smaller >> io_combine_limit if certain patterns come along. I think that sounds >> like madness (I suspect that any place where the latter would help is >> a place where you could turn RA up a bit higher for the same effect >> without weird kludges), or another way to put it would be to call it >> "overfitting" to the pre-existing quirks; but maybe it's a future >> research idea... >
I don't know if I'd call this overfitting - yes, we certainly don't want to tailor this code to only work with the linux RA, but OTOH it's the RA is what most systems do. And if we plan to rely on that, we probably have to "respect" how it works ... Moving to a "clean" approach that however triggers regressions does not seem like a great thing for users. I'm not saying the goal has to be "no regressions", that would be rather impossible. At this point I still try to understand what's causing this. BTW are you suggesting that increasing the RA distance could maybe fix the regressions? I can give it a try, but I was assuming that 128kB readahead would be enough for combine_limit=8kB. > I guess I missed a step when responding that suggestion: I don't think > we could have an "issue advice always" flag, because it doesn't seem > to work out as well as letting the kernel do it, and a global flag > like that would affect everything else including sequential scans > (once the streaming seq scan patch goes in). But suppose we could do > that, maybe even just for BHS. In my little test yesterday had to > issue a lot of them, patched eic=4, to beat the kernel's RA with > unpatched eic=0: > > eic unpatched patched > 0 4172 9572 > 1 30846 10376 > 2 18435 5562 > 4 18980 3503 > > So if we forced fadvise to be issued with a GUC, it still wouldn't be > good enough in this case. So we might need to try to understand what > exactly is waking the RA up for unpatched but not patched, and try to > tickle it by doing a little less I/O combining (for example just > setting io_combine_limit=1 gives the same number for eic=0, a major > clue), but that seems to be going down a weird path, and tuning such a > copying algorithm seems too hard. Hmmm. I admit I didn't think about the "always prefetch" flag too much, but I did imagine it'd only affect some places (e.g. BHS, but not for sequential scans). If it could be done by lowering the combine limit, that could work too - in fact, I was wondering if we should have combine limit as a tablespace parameter too. But I think adding such knobs should be only the last resort - I myself don't know how to set these parameters, how could we expect users to pick good values? Better to have something that "just works". I admit I never 100% understood when exactly the kernel RA kicks in, but I always thought it's enough for the patterns to be only "close enough" to sequential. Isn't the problem that this only skips fadvise for 100% sequential patterns, but keeps prefetching for cases the RA would deal on it's own? So maybe we should either relax the conditions when to skip fadvise, or combine even pages that are not perfectly sequential (I'm not sure if that's possible only for fadvise), though. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company