On Thu, Nov 18, 2021 at 3:14 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: > > There's a second place where the patch needs to wait for something > > also, and that one I've crudely kludged with sleep(10). If anybody > > around here who is good at figuring out how to write clever TAP tests > > can tell me how to fix this test to be non-stupid, I will happily do > > so. > > As far as that goes, if you conceptualize it as "wait for this text > to appear in the log file", there's prior art in existing TAP tests. > Basically, sleep for some reasonable short period and check the > log file; if not there, repeat until timeout.
Yeah, there's something like that in the form of find_in_log in 019_replslot_init.pl. I thought about copying that, but that didn't seem great, and I also thought about trying to move into a common module, which seems maybe better but also more work, and thus not worth doing unless we have agreement that it's what we should do. > I'm a little dubious that this test case is valuable enough to > mess around with a nonstandard postmaster startup protocol, though. > The main reason I dislike that idea is that any fixes we apply to > the TAP tests' normal postmaster-startup code would almost inevitably > miss fixing this test. IIRC there have been security-related fixes in > that area (e.g. where do we put the postmaster's socket), so I find > that prospect pretty scary. The problem that I have with the present situation is that the test coverage of xlog.c is pretty abysmal. It actually doesn't look bad if you just run a coverage report, but there are a shazillion flag variables in that file and bugs like this make it quite clear that we don't come close to testing all the possible combinations. It's really borderline unmaintainable. I don't know whether there's a specific individual who wrote most of this code and didn't get the memo that global variables are best avoided, or whether this is sort of case where we started over 1 or 2 and then it just gradually ballooned into the giant mess that it now is, but the present situation is pretty outrageous. It's taking me weeks of time to figure out how to make changes that would normally take days, or maybe hours. We clearly need to try both to get the number of cases under control by eliminating stupid variables that are almost but not quite the same as something else, and also get proper test coverage for the things that remain so that it's possible to modify the code without excesive risk of shooting ourselves in the foot. That said, I'm not wedded to this particular test case, either. It's an extremely specific bug that is unlikely to reappear once squashed, and making the test case robust enough to avoid having the buildfarm complain seems fraught. -- Robert Haas EDB: http://www.enterprisedb.com