Hi, On 2023-12-08 10:05:09 -0500, Tom Lane wrote: > Peter Eisentraut <pe...@eisentraut.org> writes: > > One possible question for discussion is whether the default for this > > should be off, on, or possibly something like on-in-assert-builds. > > (Personally, I'm happy to turn it on myself at run time, but everyone > > has different workflows.) > > ... there was already opinion upthread that this should be on by > default, which I agree with. You shouldn't be hitting cases like > this commonly (if so, they're bugs to fix or the errcode should be > rethought), and the failure might be pretty hard to reproduce.
FWIW, I did some analysis on aggregated logs on a larger number of machines, and it does look like that'd be a measurable increase in log volume. There are a few voluminous internal errors in core, but the bigger issue is extensions. They are typically much less disciplined about assigning error codes than core PG is. I've been wondering about doing some macro hackery to inform elog.c about whether a log message is from core or an extension. It might even be possible to identify the concrete extension, e.g. by updating the contents of PG_MODULE_MAGIC during module loading, and referencing that. Based on the aforementioned data, the most common, in-core, log messages without assigned error codes are: could not accept SSL connection: %m - with zero errno archive command was terminated by signal %d: %s could not send data to client: %m - with zero errno cache lookup failed for type %u archive command failed with exit code %d tuple concurrently updated could not restore file "%s" from archive: %s archive command was terminated by signal %d: %s %s at file "%s" line %u invalid memory alloc request size %zu could not send data to client: %m could not open directory "%s": %m - errno indicating ENOMEM could not write init file out of relcache_callback_list slots online backup was canceled, recovery cannot continue requested timeline %u does not contain minimum recovery point %X/%X on timeline %u There were a lot more in older PG versions, I tried to filter those out. I'm a bit confused about the huge number of "could not accept SSL connection: %m" with a zero errno. I guess we must be clearing errno somehow, but I don't immediately see where. Or perhaps we need to actually look at what SSL_get_error() returns? Greetings, Andres Freund