Re: RFC: Additional Directory for Extensions
On Wed, 28 Aug 2024 at 03:26, David E. Wheeler wrote: > Right. ISTM it could complicate PGXS quite a bit. If we set, say, > > SET extension_search_path = $extsdir, /mnt/extensions/pg16, > /mnt/extensions/pg16/gosuperfast/extensions; > > What should be the output of `pg_config --sharedir`? `pg_config` only cares about compile-time settings, so I would not expect its output to change. I suspect we'd have to add PGXS extension-path awareness if going for per-extension subdir support. I'm not sure it makes sense to teach `pg_config` about this, since it'd need to have a different mode like pg_config --extension myextname --extension-sharedir since the extension's "sharedir" is $basedir/extensions/myextension/share or whatever. Supporting this looks to be a bit intrusive in the makefiles, requiring them to differentiate between "share dir for extensions" and "share dir for !extensions", etc. I'm not immediately sure if it can be done in a way that transparently converts unmodified extension PGXS makefiles to target the new paths; it might require an additional define, or use of new variables and an ifdef block to add backwards-compat to the extension makefile for building on older postgres. > But that leaves the issue of directory organization. The current patch is > just a prefix for various PGXS/pg_config directories; the longer-term > proposal I’ve made here is not a prefix for sharedir, mandir, etc., but a > directory that contains directories named for extensions. So even if we were > to take this approach, the directory structure would vary. Right. The proposed structure is rather a bigger change than I was thinking when I suggested supporting an extension search path not just a single extra path. But it's also a cleaner proposal; the per-extension directory would make it easier to ensure that the extension control file, sql scripts, and dynamic library all match the same extension and version if multiple ones are on the path. Which is desirable when doing things like testing a new version of an in-core extension. > OTOH, we have this patch now, and this other stuff is just a proposal. Actual > code trumps ideas in my mind. Right. And I've been on the receiving end of having a small, focused patch derailed by others jumping in and scope-exploding it into something completely different to solve a much wider but related problem. I'm definitely not trying to stand in the way of progress with this; I just want to make sure that it doesn't paint us into a corner that prevents a more general solution from being adopted later. That's why I'm suggesting making the config a multi-value string (list of paths) and raising a runtime "ERROR: searching multiple paths for extensions not yet supported" or something if >1 path configured. If that doesn't work, no problem. > I think we should get some clarity on the proposal, and then consensus, as > you say. I say “get some clarity” because my proposal doesn’t require > recursing, and I’m not sure why it’d be needed. >From what you and Gabriele are discussing (which I agree with), it wouldn't.
Re: RFC: Additional Directory for Extensions
On Tue, 27 Aug 2024 at 02:07, David E. Wheeler wrote: > On Aug 21, 2024, at 19:07, Craig Ringer wrote: > But even if it’s just one or two, the only proper way an extension directory > would work, IME, is to define a directory-based structure for extensions, > where every file for an extension is in a directory named for the extension, > and subdirectories are defined for each of the above requisite file types. > [...] > I think this would be a much nicer layout for packaging, installing, and > managing extensions versus the current method of strewing files around to a > slew of different directories. This looks like a good suggestion to me, it would make the packaging, distribution and integration of 3rd party extensions significantly easier without any obvious large or long term cost. > But it would come at some cost, in terms of backward with the existing layout > (or migration to it), significant modification of the server to use the new > layout (and extension_search_path), and other annoyances like PATH and > MANPATH management. Also PGXS, the windows extension build support, and 3rd party cmake builds etc. But not by the looks a drastic change. > Long term I think it would be worthwhile, but the current patch feels like a > decent interim step we could live with, solving most of the integration > problems (immutable servers, packaging testing, etc.) at the cost of a > slightly unexpected directory layout. What I mean by that is that the current > patch is pretty much just using extension_destdir as a prefix to all of those > directories from pg_config, so they never have to change, but it does mean > that you end up installing extensions in something like: > > /mnt/extensions/pg16/usr/share/postgresql/16 > /mnt/extensions/pg16/usr/include/postgresql My only real concern with the current patch is that it limits searching for extensions to one additional configurable location, which is inconsistent with how things like the dynamic_library_path works. Once in, it'll be difficult to change or extend for BC, and if someone wants to add a search path capability it'll break existing configurations. Would it be feasible to define its configuration syntax as accepting a list of paths, but only implement the semantics for single-entry lists and ERROR on multiple paths? That way it could be extended w/o breaking existing configurations later. With that said, I'm not the one doing the work at the moment, and the functionality would definitely be helpful. If there's agreement on supporting a search-path or recursing into subdirectories I'd be willing to have a go at it, but I'm a bit stale on Pg's codebase now so I'd want to be fairly confident the work wouldn't just be thrown out. -- Craig Ringer
Re: RFC: Additional Directory for Extensions
On Fri, 23 Aug 2024 at 10:14, Craig Ringer wrote: > On Thu, 22 Aug 2024 at 21:00, Gabriele Bartolini > wrote: > > On Thu, 22 Aug 2024 at 09:32, Jelte Fennema-Nio wrote: > >> SET extension_search_path = /mnt/extensions/pg16/* > > > > That'd be great. +1. > > Agreed, that'd be handy, but not worth blocking the underlying capability for. > > Except possibly to the degree that the feature should reserve wildcard > characters and require them to be escaped if they appear on a path, so > there's no BC break if it's added later. ... though on second thoughts, it might make more sense to just recursively search directories found under each path entry. Rules like 'search if a redundant trailing / is present' can be an option. That way there's no icky path escaping needed for normal configuration.
Re: RFC: Additional Directory for Extensions
On Thu, 22 Aug 2024 at 21:00, Gabriele Bartolini wrote: > On Thu, 22 Aug 2024 at 09:32, Jelte Fennema-Nio wrote: >> SET extension_search_path = /mnt/extensions/pg16/* > > That'd be great. +1. Agreed, that'd be handy, but not worth blocking the underlying capability for. Except possibly to the degree that the feature should reserve wildcard characters and require them to be escaped if they appear on a path, so there's no BC break if it's added later. On Thu, 22 Aug 2024 at 21:00, Gabriele Bartolini wrote: > > Hi Jelte, > > On Thu, 22 Aug 2024 at 09:32, Jelte Fennema-Nio wrote: >> >> It looks like you want one directory per extension, so that list would >> get pretty long if you have multiple extensions. Maybe (as a follow up >> change), we should start to support a * as a wildcard in both of these >> GUCs. So you could say: >> >> SET extension_search_path = /mnt/extensions/pg16/* >> >> To mean effectively the same as you're describing above. > > > That'd be great. +1. > > -- > Gabriele Bartolini > Vice President, Cloud Native at EDB > enterprisedb.com
Re: RFC: Additional Directory for Extensions
On Thu, 22 Aug 2024 at 08:00, Gabriele Bartolini wrote: > > Hi everyone, > > Apologies for only starting to look into this now. Thanks, David, for pushing > this forward. 100%. I've wanted this for some time but never had time to cook up a patch. > I want to emphasize the importance of this patch for the broader adoption of > extensions in immutable container environments, such as those used by the > CloudNativePG operator in Kubernetes. It's also very relevant for local development and testing. Right now postgres makes it impossible to locally compile and install an extension for a distro-packaged postgres (whether from upstream repos or PGDG repos) without dirtying the distro-managed filesystem subtrees with local changes under /usr etc, because it cannot be configured to look for locally installed extensions on non-default paths. > To provide some context, one of the key principles of CloudNativePG is that > containers, once started, cannot be modified—this includes the installation > of Postgres extensions and their libraries. This restriction prevents us from > adding extensions on the fly, requiring them to be included in the main > PostgreSQL operand image. As a result, users who need specific extensions > must build custom images through automated pipelines (see: > https://cloudnative-pg.io/blog/creating-container-images/). It may be possible to weaken this restriction somewhat thanks to the upcoming https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/ feature that permits additional OCI images to be mounted as read-only volumes on a workload. This would still only permit mounting at Pod-creation time, not runtime mounting and unmonuting, but means the base postgres image could be supplemented by mounting additional images for extensions. For example, one might mount image "postgis-vX.Y.Z" image onto base image "postgresql-16" if support for PostGIS is desired, without then having to bake every possible extension anyone might ever want into the base image. This solves all sorts of messy issues with upgrades and new version releases. But for it to work, it must be possible to tell postgres to look in _multiple places_ for extension .sql scripts and control files. This is presently possible for modules (dynamic libraries, .so / .dylib / .dll) but without a way to also configure the path for extensions it's of very limited utility. > We’ve been considering ways to improve this process for some time. The > direction we're exploring involves mounting an ephemeral volume that contains > the necessary extensions (namely $sharedir and $pkglibdir from pg_config). > These volumes would be created and populated with the required extensions > when the container starts and destroyed when it shuts down. To make this > work, each extension must be independently packaged as a container image > containing the appropriate files for a specific extension version, tailored > to the architecture, distribution, OS version, and Postgres version. Right. And there might be more than one of them. So IMO this should be a _path_ to search for extension control files and SQL scripts. If the current built-in default extension dir was exposed as a var $extdir like we do for $libdir, this might look something like this for local development and testing while working with a packaged postgres build: SET extension_search_path = $extsdir, /opt/myapp/extensions, /usr/local/postgres/my-custom-extension/extensions; SET dynamic_library_path = $libdir, /opt/myapp/lib, /usr/local/postgres/my-custom-extension/lib or in the container extensions case, something like: SET extension_search_path = $extsdir, /mnt/extensions/pg16/postgis-vX.Y/extensions, /mnt/extensions/pg16/gosuperfast/extensions; SET dynamic_library_path = $libdir, /mnt/extensions/pg16/postgis-vX.Y/lib, /mnt/extensions/pg16/gosuperfast/lib; For safety, it might make sense to impose the restriction that if an extension control file is found in a given directory, SQL scripts will also only be looked for in that same directory. That way there's no chance of accidentally mixing and matching SQL scripts from different versions of an extension if it appears twice on the extension search path in different places. The rule for loading SQL scripts would be: * locate first directory on path contianing matching extension control file * use this directory as the extension directory for all subsequent SQL script loading and running actions -- Craig Ringer EnterpriseDB
Re: POC: Extension for adding distributed tracing - pg_tracing
On Tue, 12 Mar 2024, 23:45 Anthonin Bonnefoy, < anthonin.bonne...@datadoghq.com> wrote: > > > - I don't think it's a good idea to do memory allocations in the middle > of a > > PG_CATCH. If the error was due to out-of-memory, you'll throw another > error. > Good point. I was wondering what were the risks of generating spans > for errors. I will try to find a better way to handle this. > The usual approach is to have pre-allocated memory. This must actually be written (zeroed usually) or it might be lazily allocated only on page fault. And it can't be copy on write memory allocated once in postmaster startup then inherited by fork. That imposes an overhead for every single postgres backend. So maybe there's a better solution.
Re: POC: Extension for adding distributed tracing - pg_tracing
Hi all Just saw this thread. I cooked up a PoC distributed tracing support in Pg years ago as part of the 2ndQuadrant BDR project. I used GUCs to set the `opentelemtery_trace_id` and `opentelemetry_span_id`. These can be sent as protocol-level messages by the client driver if the client driver has native trace integration enabled, in which case the user doesn't need to see or care. And for other drivers, the application can use `SET` or `SET LOCAL` to assign them. This approach avoids the need to rewrite SQL or give special meaning to SQL comments. Within the server IIRC I used the existing postgres resource owner and memory context stacks to maintain some internal nested traces. My implementation used the C++ opentelemetry client library to emit traces, so it was never going to be able to land in core. But IIRC the current BDR/PGD folks are now using an OpenTelemetry sidecar process instead. I think they send it UDP packets to emit metrics and events. Petr or Markus might be able to tell you more about how they're doing that. I'd love to see OpenTelemetry integration in Pg.
Re: POC: Better infrastructure for automated testing of concurrency issues
On Tue, 23 Feb 2021 at 08:09, Peter Geoghegan wrote: > On Tue, Dec 8, 2020 at 2:42 AM Alexander Korotkov > wrote: > > Thank you for your feedback! > > It would be nice to use this patch to test things that are important > but untested inside vacuumlazy.c, such as the rare > HEAPTUPLE_DEAD/tupgone case (grep for "Ordinarily, DEAD tuples would > have been removed by..."). Same is true of the closely related > heap_prepare_freeze_tuple()/heap_tuple_needs_freeze() code. > That's what the PROBE_POINT()s functionality I referenced is for, too. The proposed stop events feature has finer grained control over when the events fire than PROBE_POINT()s do. That's probably the main limitation in the PROBE_POINT()s functionality right now - controlling it at runtime is a bit crude unless you opt for using a C test extension or a systemtap script, and both those have other downsides. On the other hand, PROBE_POINT()s are lighter weight when not actively turned on, to the point where they can be included in production builds to facilitate support and runtime diagnostics. They interoperate very nicely with static tracepoint markers (SDTs), the TRACE_POSTGRESQL_FOO(...) stuff, so there's no need to yet another separate set of debug markers scattered through the code. They can perform a wider set of actions useful for testing and diagnostics - PANIC the current backend, self-deliver an arbitrary signal, force a LOG message, introduce an interruptible or uninterruptible sleep, send a message to the client if any (handy for regress tests), or fire an extension-defined callback function. I'd like to find a way to get the best of both worlds if possible. Rather than completely sidetrack the thread on this patch I posted the PROBE_POINT()s patch on a separate thread here.
Re: [PATCH] ProcessInterrupts_hook
On Sat, 2 Oct 2021 at 01:24, Jaime Casanova wrote: > On Tue, Jun 29, 2021 at 01:32:26PM +0800, Craig Ringer wrote: > > On Sat, 20 Mar 2021 at 03:46, Tom Lane wrote: > > > > > Robert Haas writes: > > > > On Fri, Mar 19, 2021 at 3:25 PM Tom Lane wrote: > > > >> I'm not very comfortable about the idea of having the postmaster set > > > >> child processes' latches ... that doesn't sound terribly safe from > the > > > >> standpoint of not allowing the postmaster to mess with shared memory > > > >> state that could cause it to block or crash. If we already do that > > > >> elsewhere, then OK, but I don't think we do. > > > > > > > It should be unnecessary anyway. We changed it a while back to make > > > > any SIGUSR1 set the latch > > > > > > Hmm, so the postmaster could send SIGUSR1 without setting any > particular > > > pmsignal reason? Yeah, I suppose that could work. Or we could recast > > > this as being a new pmsignal reason. > > > > > > > I'd be fine with either way. > > > > I don't expect to be able to get to working on a concrete patch for this > > any time soon, so I'll be leaving it here unless someone else needs to > pick > > it up for their extension work. The in-principle agreement is there for > > future work anyway. > > Hi Craig, > > There is still a CF entry for this. Should we close it as withdrawn? or > maybe RwF? > I'm not going to get time for it now, so I think marking it withdrawn is reasonable. I think it's well worth doing and Tom seems to think it's not a crazy idea, but I'm no longer working on the software that needed it, and don't see a lot of other people calling for it, so it can wait until someone else needs it.
Re: Windows crash / abort handling
On Wed, 6 Oct 2021, 03:30 Andres Freund, wrote: > Hi, > > > My first attempt was to try to use the existing crashdump stuff in > pgwin32_install_crashdump_handler(). That's not really quite what I want, > because it only handles postmaster rather than any binary, but I thought > it'd > be a good start. But outside of toy situations it didn't work for me. > Odd. It usually has for me, and definitely not limited to the postmaster. But it will fall down for OOM, smashed stack, and other states where in-process self-debugging is likely to fail. > > A bunch of debugging later I figured out that the reason neither the > SetUnhandledExceptionFilter() nor JIT debugging works is that the > SEM_NOGPFAULTERRORBOX in the > SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX); > we do in startup_hacks() prevents the paths dealing with crashes from being > reached. > Right. I patch this out when working on windows because it's a real pain. I keep meaning to propose that we remove this functionality entirely. It's obsolete. It was introduced back in the days where DrWatson.exe "windows error reporting") used to launch an interactive prompt asking the user what to do when a process crashed. This would block the crashed process from exiting, making everything grind to a halt until the user interacted with the UI. Even for a service process. Not fun on a headless or remote server. These days Windows handles all this a lot more sensibly, and blocking crash reporting is quite obsolete and unhelpful. I'd like to just remove it. If we can't do that I'd like to at least make it optional. Alternatively we can generate "minidumps" [6], but that doesn't appear to > be more > helpful for CI purposes at least - all we'd do is to create a backtrace > using > the same tool. But it might be helpful for local development, to e.g. > analyze > crashes in more detail. > They're immensely helpful when a bt isn't enough, but BTs are certainly the first step for CI use.
Re: [PATCH] More docs on what to do and not do in extension code
On Tue, 29 Jun 2021 at 13:30, Craig Ringer wrote: > Laurenz, > > Thanks for your comments. Sorry it's taken me so long to get back to you. > Commenting inline below on anything I think needs comment; other proposed > changes look good. > I'm not going to get back to this anytime soon. If anybody wants to pick it up that's great. Otherwise at least it's there in the mailing lists to search.
Re: Mark all GUC variable as PGDLLIMPORT
On Fri, 27 Aug 2021 at 08:59, Julien Rouhaud wrote: > On Fri, Aug 27, 2021 at 3:42 AM Magnus Hagander > wrote: > > > > Yeah, but that does move the problem to the other side doesn't it? So > > if you (as a pure test of course) were to remove the variable > > completely from the included header and just declare it manually with > > PGDLLSPEC in your file, it should work? > > > > Ugly as it is, I wonder if there's a chance we could just process all > > the headers at install times and inject the PGDLLIMPORT. We know which > > symvols it is on account of what we're getting in the DEF file. > > > > Not saying that's not a very ugly solution, but it might work? > > It's apparently not enough. I tried with autovacuum_max_workers GUC, > and it still errors out. > If I add a PGDLLIMPORT, there's a link error when trying to access the > variable: > error LNK2019: unresolved external symbol __imp_autovacuum_max_workers > referenced in function... > > If I use PGDLLEXPORT I simply get: > error LNK2001: unresolved external symbol aytovacuum_max_workers > It works, but you can't use PGDLLIMPORT, you have to implement the import yourself without the help of __declspec(dllimport) . Where you want autovacuum_max_workers you must instead write *((int*)__imp_autovacuum_max_workers) Here's the comment I wrote on the topic in something I was working on: /* * On Windows, a symbol is not accessible outside the executable or shared * library (PE object) that it is defined in unless explicitly exported in * the DLL interface. * * It must then be explicitly imported by objects that use it; Windows doesn't * do ELF-style fixups. * * The export part is usually accomplished by a __declspec(dllexport) * annotation on the symbol or a .DEF file supplied as linker input. Postgres * uses the .DEF approach, auto-exporting all symbols using * src\tools\msvc\gendef.pl . Internally this hides "symname" from the DLL * interface and instead generates an export symbol "__imp_symname" that is a * pointer to the value of "symname". * * The import part is usually done with the __declspec(dllimport) annotation on * the symbol. The PGDLLIMPORT macro expands to __declspec(dllimport) when * postgres headers are included during extension compilation. But not all the * symbols that pglogical needs are annotated with PGDLLIMPORT. Attempting to * access a symbol that is not so-annotated will fail at link time with an * error like * * error LNK2001: unresolved external symbol criticalSharedRelcachesBuilt * * Because of gendefs.pl, postgres still exports the symbol even if it isn't * annotated PGDLLIMPORT. So we can just do the shorthand that * __declspec(dllimport) does for us in the preprocessor instead: replace each * symbol with its __imp_symbol indirection and dereference it. * * There's one wrinkle in this though. MSVC doesn't generate a definition for a * global data symbol that is neither initialized nor explicitly marked * __declspec(dllexport). gendefs.pl will think such symbols are references to * a symbol defined in another object file and will skip them without emitting * a DATA entry for them in the DEF file, so no __imp_ stub is generated in the * DLL interface. We can't use (*__imp_symbolname) if there's no import stub. * * If they're GUCs, we can round-trip them through their text values * to read them. Nothing should ever be assigning to GUC storage and there's no * reason to take the address of GUC storage, so this should work fine, albeit * slower. If we find any that aren't GUCs we're in trouble but so far there * haven't been any. * * See also: * * - https://docs.microsoft.com/en-us/cpp/build/importing-data-using-declspec-dllimport * - https://docs.microsoft.com/en-us/cpp/build/importing-using-def-files * - https://docs.microsoft.com/en-us/cpp/build/exporting-from-a-dll-using-def-files * - https://docs.microsoft.com/en-us/cpp/build/determining-which-exporting-method-to-use */ This is gruesome and I hadn't planned to mention it, but now someone noticed the .DEF file exports the symbols I guess it does no harm. So can we just fix PGDLLIMPORT now?
Re: Mark all GUC variable as PGDLLIMPORT
On Thu, 26 Aug 2021 at 01:51, Alvaro Herrera wrote: > On 2021-Aug-25, Magnus Hagander wrote: > > > The thing we need the PGDLLIMPORT definition for is to *import* them > > on the other end? > > Oh ... so modules that are willing to cheat can include their own > declarations of the variables they need, and mark them __declspec > (dllimport)? > > Damn. I was hoping nobody would notice that. I do exactly that in some extensions to work around some of this mess, but it is quite awkward and has its limitations.
Re: Mark all GUC variable as PGDLLIMPORT
On Wed, 25 Aug 2021 at 22:36, Magnus Hagander wrote: > On Wed, Aug 25, 2021 at 4:06 PM Robert Haas wrote: > > > > On Tue, Aug 24, 2021 at 5:06 PM Chapman Flack > wrote: > > > The thing is, I think I have somewhere a list of all the threads on > this > > > topic that I've read through since the first time I had to come with > my own > > > hat in hand asking for a PGDLLIMPORT on something, years ago now, and > > > I don't think I have ever seen one where it was as uncontroversial > > > as you suggest. > > > > It does tend to be controversial, but I think that's basically only > > because Tom Lane has reservations about it. I think if Tom dropped his > > opposition to this, nobody else would really care. And I think that > > would be a good thing for the project. > > > I have only one consideration about it, and that's a technical one :) > > Does this in some way have an effect on the size of the binary and/or > the time it takes to load it? > On *nix, no. On Windows, very, very minimally. We *should* be looking into making private symbols we can't make non-static have hidden visibility at link time, i.e. be DSO-private. This can have a huge impact on link-time optimisation and inlining. But doing so is quite orthogonal to the matter of fixing a linkage issue on Windows. By making select symbols hidden we'd be *reducing* the exposed set of functions and data symbols in a disciplined and progressive way on all platforms. Useful but different.
Re: Mark all GUC variable as PGDLLIMPORT
On Wed, 25 Aug 2021 at 03:13, Robert Haas wrote: > On Tue, Aug 24, 2021 at 2:52 PM Chapman Flack > wrote: > > I don't think that's true of the second proposal in [0]. I don't foresee > > a noticeable runtime cost unless there is a plausible workload that > > involves very frequent updates to GUC settings that are also of interest > > to a bunch of extensions. Maybe I'll take a stab at a POC. > > I'm not sure I fully understand that proposal, but I find it hard to > believe that we would seriously consider replacing every direct GUC > reference in the backend with something that goes through an API. Even > if didn't hurt performance, I think it would uglify the code a whole > lot. > It'd probably have to be something that resolves the GUC storage addresses at compile-time or once at runtime, if it's going to be used by core code. While some code doesn't hit a lot of GUCs, some *really* hammers some common GUCs. There are various issues with cache lines and pointer chasing that are beyond my low-level fu at work here. Adding a level of pointer indirection can be very expensive in the wrong situations. So you're probably looking at some kind of mess with token pasting, macros and static inlines. Ew.
Re: Mark all GUC variable as PGDLLIMPORT
On Mon, 23 Aug 2021 at 22:45, Julien Rouhaud wrote: > On Mon, Aug 23, 2021 at 10:22 PM Tom Lane wrote: > > > > Bruce Momjian writes: > > > On Mon, Aug 23, 2021 at 10:15:04AM -0400, Tom Lane wrote: > > >> By that argument, *every* globally-visible variable should be marked > > >> PGDLLIMPORT. But the mere fact that two backend .c files need to > access > > > > > No, Julien says 99% need only the GUCs, so that is not the argument I > am > > > making. > > > > That's a claim unbacked by any evidence that I've seen. More to > > the point, we already have a mechanism that extensions can/should > > use to read and write GUC settings, and it's not direct access. > > I clearly didn't try all single extension available out there. It's > excessively annoying to compile extensions on Windows, and also I > don't have a lot of dependencies installed so there are some that I > wasn't able to test since I'm lacking some other lib and given my > absolute lack of knowledge of that platform I didn't spent time trying > to install those. > > Plenty of them are closed too. While that's not the Pg community's problem, as such, it'd be nice not to arbitrarily break them all for no actual benefit.
Re: Mark all GUC variable as PGDLLIMPORT
On Tue, 24 Aug 2021 at 05:08, Chapman Flack wrote: > On 08/23/21 14:30, Robert Haas wrote: > > it seems likely that this proposed > > API would have the exact same problem, because it would let people do > > exactly the same thing. And, going through this proposed API would > > still be significantly more expensive than just accessing the bare > > variables, because you'd at least have to do some kind of lookup based > > on the GUC name > > I think the API ideas in [0] would not let people do exactly the same > thing. > > They would avoid exposing the bare variables to overwrite. Not that > there has been any plague of extensions going and overwriting GUCs, > but I think in some messages on this thread I detected a sense that > in principle it's better if an API precludes it, and that makes sense > to me. > > The second idea also avoids the expense of name-based lookup (except once > at extension initialization), and in fact minimizes the cost of obtaining > the current value when needed, by slightly increasing the infrequent cost > of updating values. > I'd be generally in favour of something that reduced our reliance on the current chaotic and inconsistent jumble of globals which are a semi-random combination of compilation-unit-scoped, globally-scoped-except-on-Windows and globally scoped. Tackling GUCs would be a good start. Especially given the number of GUCs where the actual GUC value isn't the state that anyone should be interacting with directly since a hook maintains the "real" state derived from the GUC storage. And preventing direct writes to GUCs seems like a clearly correct thing to do. Some consideration of performance would be important for some of the hotter GUCs, of course, but most extensions won't be hammering most GUC access a lot.
Re: Mark all GUC variable as PGDLLIMPORT
On Tue, 24 Aug 2021 at 13:21, Craig Ringer wrote: > > There is not even the thinnest pretense of Pg having a dedicated extension > API or any sort of internal vs public API separation. > Oh, and if we do want such a separation, we'll need to introduce a MUCH lower-pain-and-overhead way to get related patches in. Otherwise it'll take decades to add any necessary function wrappers for currently exported data symbols, add necessary hooks, wrap or hide internal symbols and state, etc. But ... what is the actual goal and expected outcome of such a hypothetical public/private API separation? It won't help meaningfully with server maintenance: We already break extensions freely in major releases, and sometimes even minor releases. We don't make any stable API promise at all. So any argument that it would make maintenance of the core server easier is weak at best. It won't help protect server runtime stability: The server is written in C, and makes heavy use of non-opaque / non-hidden types. Many of which would not be practical to hide without enormous refactoring if at all. Writing any sort of "safe" C API is very difficult even when the exposed functionality is very narrow and well defined. Even then such an API can only help prevent inadvertent mistakes, since C programs are free to grovel around in memory. Look at the mess with EXPORT_SYMBOL_GPL in the kernel for just how ugly that can get. So I don't think there's any realistic way to claim that narrowing the exposed API surface would make it safer to load and run extensions that the user has not separately checked and reviewed or obtained from a trusted source with robust testing practices. Certainly it offers no benefit at all against a bad actor. It won't make it safer to use untrusted extensions. What will it do? Not much, in the short term, except cripple existing extensions or add a pile of questionably useful code annotations. The only real benefits I see are some improvement in link-time optimization and export symbol table size. Both of which are nice, but IMO not worth the pain by themselves for a pure C program. In C++, with its enormous symbol tables it's absolutely worth it. But not really for Pg. To be clear, I actually love the idea of starting to define a solid public API, with API, ABI and semantic promises and associated tests. But to say it's a nontrivial amount of work is an enormous understatement. And unless done by an existing committer who is trusted to largely define a provisional API without bike-shedding and arguing the merits of every single part of it, it's nearly impossible to do with the way Pg is currently developed. It's completely beyond me why it's OK to export all function symbols on Windows, but not all data symbols. Or why it's OK to export all data symbols on *nix, but not on Windows.
Re: Mark all GUC variable as PGDLLIMPORT
On Tue, 24 Aug 2021 at 02:31, Robert Haas wrote: > On Mon, Aug 23, 2021 at 11:40 AM Alvaro Herrera > wrote: > > In that case, why not improve the API with functions that return the > > values in some native datatype? For scalars with native C types (int, > > floats, Boolean etc) this is easy enough; I bet it'll solve 99% of the > > problems or more. > > Sure, but ... why bother? > > The entire argument rests on the presumption that there is some harm > being done by people accessing the values directly, but I don't think > that's true. And, if it were true, it seems likely that this proposed > API would have the exact same problem, because it would let people do > exactly the same thing. And, going through this proposed API would > still be significantly more expensive than just accessing the bare > variables, because you'd at least have to do some kind of lookup based > on the GUC name to find the corresponding variable. It's just a > solution in search of a problem. > > Nothing bad happens when people write extensions that access GUC > variables directly. It works totally, completely fine. Except on > Windows. > Not only that, but postgres already exports every non-static function symbol on both *nix and Windows, and every data symbol on *nix. A lot of those function symbols are very internal and give anything that can call them the ability to wreck absolute havoc on the server's state. There is not even the thinnest pretense of Pg having a dedicated extension API or any sort of internal vs public API separation. This ongoing pain with PGDLLIMPORT on Windows is hard to see as anything but an excuse to make working with and supporting Windows harder because some of us don't like it. I happen to rather dislike working with Windows myself, but I get to do it anyway, and I'd be very happy to remove this particular source of pain.
Re: Mark all GUC variable as PGDLLIMPORT
On Mon, 23 Aug 2021 at 22:15, Tom Lane wrote: > Bruce Momjian writes: > > On Sun, Aug 22, 2021 at 09:29:01PM +0800, Julien Rouhaud wrote: > >> Then shouldn't we try to prevent direct access on all platforms rather > than > >> only one? > > > Agreed. If Julian says 99% of the non-export problems are GUCs, and we > > can just export them all, why not do it? We already export every global > > variable on Unix-like systems, and we have seen no downsides. > > By that argument, *every* globally-visible variable should be marked > PGDLLIMPORT. But the mere fact that two backend .c files need to access > some variable doesn't mean that we want any random bit of code doing so. > > And yes, I absolutely would prohibit extensions from accessing many > of them, if there were a reasonable way to do it. It would be a good > start towards establishing a defined API for extensions. > There is: -fvisibility=hidden and __attribute__((visibility("default"))) . Or if you prefer to explicitly mark private symbols, use __attribute__((visiblity("hidden"))) . In addition to API surface control this also gives you a smaller export symbol table for faster dynamic linking, and it improves link-time optimization. I could've sworn I proposed its use in the past but I can't find a relevant list thread except quite a recent one. All I can find is [1] . But that is where we should start: switch from a linker script for libpq to using PGDLLIMPORT (actually it'd be LIBPQDLLIMPORT) in libpq. When -DBUILDING_LIBPQ this would expand to __declspec("dllexport") on Windows and __attribute__((visibility("default"))) on gcc/clang. Otherwise it expands to __declspec("dllimport") on Windows and empty on other targets. This would also be a good time to rename the confusingly named BUILDING_DLL macro to BUILDING_POSTGRES . The next step would be to have PGDLLIMPORT expand to __attribute__((visibility("default"))) on gcc/clang when building the server itself. This won't do anything by itself since all symbols are already default visibility. But once the "public API" of both function and data symbol is so-annotated, we could switch to building Pg with -fvisibility=hidden by default, and on Windows, we'd disable the linker script that exports all functions using a generated .DEF file. [1] https://www.postgresql.org/message-id/CAMsr%2BYHa3TfA4rKtnZuzurwCSmxxXNQHFE3UE29BoDEQcwfuxA%40mail.gmail.com
Re: Mark all GUC variable as PGDLLIMPORT
On Sun, 22 Aug 2021 at 21:29, Julien Rouhaud wrote: > On Sun, Aug 22, 2021 at 09:19:42AM -0400, Tom Lane wrote: > > > > Uh, no, it's exactly *not* clear. There are a lot of GUCs that are only > > of interest to particular subsystems. I do not see why being a GUC makes > > something automatically more interesting than any other global variable. > > Usually, the fact that one is global is only so the GUC machinery itself > > can get at it, otherwise it'd be static in the owning module. > > > > As for "extensions should be able to get at the values", the GUC > machinery > > already provides uniform mechanisms for doing that safely. Direct access > > to the variable's internal value would be unsafe in many cases. > > Then shouldn't we try to prevent direct access on all platforms rather than > only one? > Yes. That's what I think we should be doing if we're not going to PGDLLIMPORT them all. The current API is painful because it round-trips via a text representation. We'd at least want some kind of GetConfigOptionBool(...) GetConfigOptionEnum(...) etc. I don't understand the objection to marking them all PGDLLIMPORT anyway though. Any pretense that Pg has any sort of public/private API divide is pure fantasy. Whether things are static or not, in public headers or "internal" headers, etc, is nearly random. We have absolutely no API surface control such as __attribute__((visibility("hidden"))) annotations, and proposals to add them have been soundly rejected in the past. If we have a defined API, where is it defined and annotated? If we don't, then Windows being different and incompatible is a bug, and that bug should be fixed.
Quick tip on building pg with Visual Studio Build Tools 2019 + small patches
Hi all If you're trying to build postgres with Visual Studio Build Tools 16 2019 using the optional v140 toolset that installs the Visual Studio 14 2019 C/C++ toolchain to get binaries that're fully compatible with the EDB postgres builds, you may run into some confusing issues. Use this incantation in cmd.exe (not a powershell.exe or pwsh.exe session) to select the VS 16 msbuild with vs 14 compiler: "%PROGRAMFILES(x86)%\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvarsall.bat" amd64 -vcvars_ver=14.0 all on one line, then run src\tools\msvc\build.bat as normal. If you instead attempt to use the vcvarsall.bat from the v140 toolchain that VS Build Tools 2019 installed for you, it'll appear to work, but compilation will then fail by spamming: some.vcxproj(17,3): error MSB4019: The imported project "C:\Microsoft.Cpp.Default.props" was not found. Confirm that the path in the declaration is correct, and that the file exists on disk. This is because the v140 toolset does not include the v140 msbuild. You're expected to use the v160 msbuild and configure it to use the v140 toolset instead. Similar issues occur when you try to use the CMake generator "Visual Studio 14 2015" with a VS Build Tools 2019-installed version of the 140 toolchain; you have to instead use -G "Visual Studio 16 2019" -T "v140" to select the VS 16 msbuild and tell it to use the v140 toolchain. Crazy stuff. If you instead just run: "%PROGRAMFILES(x86)%\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvarsall.bat" amd64 Then compilation will run fine, but the resulting binary will use the version 16 MSVC compilers, runtime library and redist, etc. Note that all these builds will target the default Windows 10 SDK. That should be fine; we're very conservative in postgres about new Windows features and functions, and we do dynamic lookups for a few symbols when we're not sure if they'll be available. But you can force it to compile for Windows 7 and higher with by editing Mk.pm and adding the definitions WINVER=0x0601 _WIN32_WINNT=0x0601 to your project. I didn't find a way to add custom preprocessor definitions in config.pl so for testing purposes I hacked it into MSBuildProject.pm in the clause as ;WINVER=0x0601;_WIN32_WINNT=0x0601 I've attached a patch that teaches config.pl about a new 'definitions' option to make this more graceful. See https://docs.microsoft.com/en-us/cpp/porting/modifying-winver-and-win32-winnt?view=msvc-160 If you don't have the toolchain installed, you can install Chocolatey (there's a one-liner on their website) then: choco install -y visualstudio2019buildtools choco install -y visualstudio2019-vc++ --packageparameters "--add Microsoft.VisualStudio.Component.VC.140" You may also want choco install -y winflexbison (I've attached a patch that teaches pgflex.pl and pgbision.pl to use win_flex.exe and win_bison.exe if they're found, and to accept full paths for these tools in buildenv.pl). From ad9625026bcb7768c636fdf3a37a3403db195ae2 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Tue, 13 Jul 2021 14:18:44 +1000 Subject: [PATCH v1 1/2] Teach pgflex.pl and pgbision.pl to read buildenv.pl for tool names Some distributions of flex and bison on Windows use alternate executable names such as win_flex.exe and win_bison.exe. Teach our pgflex.pl and pgbison.pl wrappers how to handle them by reading the executables to use from the new $flex and $bison variables in src/tools/msvc/buildenv.pl . These may be bare names of commands on the PATH or they may be a fully qualified path to the target executable. While we're at it, notice when the test execution of flex or bision to check the version fails and complain with a more informative error. --- src/tools/msvc/build.pl| 4 src/tools/msvc/buildenv_default.pl | 14 ++ src/tools/msvc/pgbison.pl | 27 +++ src/tools/msvc/pgflex.pl | 25 + 4 files changed, 62 insertions(+), 8 deletions(-) create mode 100644 src/tools/msvc/buildenv_default.pl diff --git a/src/tools/msvc/build.pl b/src/tools/msvc/build.pl index de50554e7e..dedd515307 100644 --- a/src/tools/msvc/build.pl +++ b/src/tools/msvc/build.pl @@ -28,6 +28,10 @@ elsif (-e "./buildenv.pl") { do "./buildenv.pl"; } +elsif (-e "src/tools/msvc/buildenv_default.pl") +{ + do "src/tools/msvc/buildenv_default.pl"; +} # set up the project our $config; diff --git a/src/tools/msvc/buildenv_default.pl b/src/tools/msvc/buildenv_default.pl new file mode 100644 index 00..b3868f2145 --- /dev/null +++ b/src/tools/msvc/buildenv_default.pl @@ -0,0 +1,14 @@ +# Copy this file to src\test\msvc\buildenv.pl and modify it as required +# for your build toolchai
Re: Detecting File Damage & Inconsistencies
On Fri, 2 Jul 2021 at 00:19, Simon Riggs wrote: > > So yeah. I think it'd be better to log the info you want at start-of-txn > unless there's a compelling reason not so, and I don't see one yet. > > AFAIK, XLOG_XACT_ASSIGNMENT does not occur for normal top-level > transactions, only for subxids. > > I don't really want to add an extra record just for this because it > will slow down applications and it won't get turned on as often. > OK, that makes sense - I was indeed operating on an incorrect assumption. I wouldn't want to add a new record either. I thought we could piggyback on XLOG_XACT_ASSIGNMENT with a new chunk that's only added when the feature is required, much like we do for replication origin info on commit records. Is it worth considering forcing XLOG_XACT_ASSIGNMENT to be logged if this feature is enabled? If you don't think the sorts of use cases I presented are worth the trouble that's fair enough. I'm not against adding it on the commit record. It's just that with logical decoding moving toward a streaming model I suspect only having it at commit time may cause us some pain later.
Re: RFC: Detailed reorder buffer stats dumps
On Thu, 6 May 2021 at 19:40, Amit Kapila wrote: > On Thu, May 6, 2021 at 9:54 AM Craig Ringer wrote: > > > At the least it'd be helpful to have pg_stat_replication (or a new > related auxiliary view like pg_stat_logical_decoding) show > > > > - walsender total bytes sent this session > > - number of txns processed this txn > > > > You might be able to derive some of the above sorts of stats from the > newly added pg_stat_replication_slots [1]. > > That's a huge improvement that I managed to totally miss the discussion of and work on. Thanks. It'll be a significant help. ' > > - number txns filtered out by output plugin this session > > - oldest xid in reorder buffer > > - reorder buffer number of txns > > - reorder buffer total size (in-memory and total inc spilled) > > - reorderbuffercommit current xid, last change lsn, total buffered size > of current tx, total bytes of buffer processed so far within the current > txn, and commit lsn if known, only when currently streaming a txn from > reorderbuffercommit > These are less statistical in nature, and more about the current activity of the walsender and logical decoding state. I'm not sure if it'd make much sense to tack them on to pg_stat_replication_slots, especially as that'd also mean they were quite delayed. But it probably isn't worth the effort of exposing this info in a new view. With the higher level info now available in pg_stat_replication_slots, I think I might look into exposing these finer details via trace markers for use with perf / systemtap / etc instead. A digression, but: It's a real shame that such tools don't give us a way to read specific tagged regions of memory with the same ease they let us probe function calls though. You generally need gdb to read the value of a global, or a moderately funky systemtap script. There's no convenient equivalent to SDT markers (TRACE_FOO) to tag variables. Wouldn't it be nice if we could perf watch postgres:walsender_reorderbuffer_oldest_xid or something like that?
Re: How to expose session vs txn lock info in pg_locks view?
On Mon, 1 Feb 2021 at 18:42, Craig Ringer wrote: > On Sun, 24 Jan 2021 at 09:12, Andres Freund wrote: > >> Hi, >> >> On 2021-01-19 14:16:07 +0800, Craig Ringer wrote: >> > AFAICS it'd be necessary to expand PROCLOG to expose this in shmem. >> > Probably by adding a small bitfield where bit 0 is set if there's a txn >> > level lock and bit 1 is set if there's a session level lock. But I'm not >> > convinced that expanding PROCLOCK is justifiable for this. >> sizeof(PROCLOCK) >> > is 64 on a typical x64 machine. Adding anything to it increases it to 72 >> > bytes. >> >> Indeed - I really don't want to increase the size, it's already a >> problem. >> >> >> > It's frustrating to be unable to tell the difference between >> session-level >> > and txn-level locks in diagnostic output. >> >> It'd be useful, I agree. >> >> >> > And the deadlock detector has no way to tell the difference when >> > selecting a victim for a deadlock abort - it'd probably make sense to >> > prefer to send a deadlock abort for txn-only lockers. >> >> I'm doubtful this is worth going for. >> >> >> > But I'm not sure I see a sensible way to add the info - PROCLOCK is >> > already free of any padding, and I wouldn't want to use hacks like >> > pointer-tagging. >> >> I think there's an easy way to squeeze out space: make groupLeader be an >> integer index into allProcs instead. That requires only 4 bytes... >> >> Alternatively, I think it'd be reasonably easy to add the scope as a bit >> in LOCKMASK - there's plenty space. >> > > I was wondering about that, but concerned that there would be impacts I > did not understand. > > I'm happy to pursue that angle. > Just so this thread isn't left dangling, I'm just not going to get time to follow up on this work with a concrete patch and test suite change. If anyone else later on wants to differentiate between session and txn LWLocks they could start with the approach proposed here.
Re: [PATCH] ProcessInterrupts_hook
On Sat, 20 Mar 2021 at 03:46, Tom Lane wrote: > Robert Haas writes: > > On Fri, Mar 19, 2021 at 3:25 PM Tom Lane wrote: > >> I'm not very comfortable about the idea of having the postmaster set > >> child processes' latches ... that doesn't sound terribly safe from the > >> standpoint of not allowing the postmaster to mess with shared memory > >> state that could cause it to block or crash. If we already do that > >> elsewhere, then OK, but I don't think we do. > > > It should be unnecessary anyway. We changed it a while back to make > > any SIGUSR1 set the latch > > Hmm, so the postmaster could send SIGUSR1 without setting any particular > pmsignal reason? Yeah, I suppose that could work. Or we could recast > this as being a new pmsignal reason. > I'd be fine with either way. I don't expect to be able to get to working on a concrete patch for this any time soon, so I'll be leaving it here unless someone else needs to pick it up for their extension work. The in-principle agreement is there for future work anyway.
Re: [PATCH] More docs on what to do and not do in extension code
Laurenz, Thanks for your comments. Sorry it's taken me so long to get back to you. Commenting inline below on anything I think needs comment; other proposed changes look good. On Sun, 30 May 2021 at 19:20, Laurenz Albe wrote: > + always use linkend="xfunc-sleeping-interrupts-blocking">PostgreSQL's > + interupt-aware APIs for the purpose. Do not > usleep(), > + system(), make blocking system calls, etc. > + > > "system" is not a verb. > When it's a function call it is, but I'm fine with your revision: Do not use usleep(), system() > or other blocking system calls. > > +and should only use the main thread. PostgreSQL generally uses > subprocesses > > Hm. If the extension does not start threads, it automatically uses the > main thread. > I think that should be removed for clarity. > IIRC I intended that to apply to the section that tries to say how to survive running your own threads in the backend if you really must do so. +primitives like WaitEventSetWait where > necessary. Any > +potentially long-running loop should periodically call > +CHECK_FOR_INTERRUPTS() to give PostgreSQL a chance to > interrupt > +the function in case of a shutdown request, query cancel, etc. > This means > > Are there other causes than shutdown or query cancellation? > At any rate, I am not fond of enumerations ending with "etc". > I guess. I wanted to emphasise that if you mess this up postgres might fail to shut down or your backend might fail to respond to SIGTERM / pg_terminate_backend, as those are the most commonly reported symptoms when such issues are encountered. +by PostgreSQL if any function that it calls makes a > +CHECK_FOR_INTERRUPTS() check. It may not > > "makes" sound kind of clumsy in my ears. > Yeah. I didn't come up with anything better right away but will look when I get the chance to return to this patch. > Attached is a new version that has my suggested changes, plus a few from > Bharath Rupireddy (I do not agree with many of his suggestions). > Thanks very much. I will try to return to this soon and review the diff then rebase and update the patch. I have a large backlog to get through, and I've recently had the pleasure of having to work on windows/powershell build system stuff, so it may still take me a while.
Adding more PGDLLIMPORTs
Hi all As a result of some recent work on Windows I have a list of PGDLLIMPORTs I'd like to add to existing exported globals. All affected variables are already extern, so this doesn't expose any new API not already available to non-Windows extensions. I've split the patch up for clarity: * v1-0001: PGDLLIMPORTs for xlog.c's XactLastRecEnd, ProcLastRecPtr and reachedConsistency . I only really need XactLastRecEnd but think it's sensible to expose all of them. * v1-0002: PGDLLIMPORT for struct WalRecv . Allows extensions to observe WAL receiver state and behaviour. * v1-0003: PGDLLIMPORT criticalSharedRelcachesBuilt and criticalRelcachesBuilt . I only really need criticalSharedRelcachesBuilt but it seems sensible to export both. Useful when extension code may run during early startup (_PG_init in shared_preload_libraries, shmem init, etc) and later during normal running. * v1-0004: PGDLLIMPORT a set of useful GUCs and vars containing GUC-derived state in xlog.h and walreceiver.h I will follow up soon with a patch that marks every GUC as PGDLLIMPORT including any vars derived from the GUC by hooks. I don't see much point doing this piecemeal since they're all externs anyway. That patch will replace patch 4 above, but not patches 1-3. I'd love to see these PGDLLIMPORTs backported to pg13. From f4c1abfbfc33ad95b5c216f1fcf132938c6377bc Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Thu, 17 Jun 2021 11:46:17 +0800 Subject: [PATCH v1 1/5] Make xlog.c vars PGDLLIMPORT Allow extensions to see XactLastRecEnd, ProcLastRecPtr and reachedConsistency. --- src/include/access/xlog.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 77187c12be..e0b3a75d4d 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -97,11 +97,11 @@ typedef enum RECOVERY_TARGET_TIMELINE_NUMERIC } RecoveryTargetTimeLineGoal; -extern XLogRecPtr ProcLastRecPtr; -extern XLogRecPtr XactLastRecEnd; +extern PGDLLIMPORT XLogRecPtr ProcLastRecPtr; +extern PGDLLIMPORT XLogRecPtr XactLastRecEnd; extern PGDLLIMPORT XLogRecPtr XactLastCommitEnd; -extern bool reachedConsistency; +extern PGDLLIMPORT bool reachedConsistency; /* these variables are GUC parameters related to XLOG */ extern int wal_segment_size; -- 2.31.1 From 9284627abdcee2b851531a55f82288a73285c9fd Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Thu, 17 Jun 2021 11:47:18 +0800 Subject: [PATCH v1 2/5] Make struct WalRecv PGDLLIMPORT for extensions This allows extensions on a standby to observe the state of the wal receiver. --- src/include/replication/walreceiver.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h index 4fd7c25ea7..016814658e 100644 --- a/src/include/replication/walreceiver.h +++ b/src/include/replication/walreceiver.h @@ -160,7 +160,7 @@ typedef struct sig_atomic_t force_reply; /* used as a bool */ } WalRcvData; -extern WalRcvData *WalRcv; +extern PGDLLIMPORT WalRcvData *WalRcv; typedef struct { -- 2.31.1 From 6ef9e486e829b9e96a622bf60d2fe6a7044a6321 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Thu, 17 Jun 2021 11:50:11 +0800 Subject: [PATCH v1 3/5] PGDLLIMPORT criticalSharedRelcachesBuilt and criticalRelcachesBuilt Allow extensions to see the state of relcache init so they can tell if relcache access is safe. Useful for extensions that register at shared_preload_libraries time. --- src/include/utils/relcache.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h index f772855ac6..202f2b10d9 100644 --- a/src/include/utils/relcache.h +++ b/src/include/utils/relcache.h @@ -145,9 +145,9 @@ extern void RelationCacheInitFilePostInvalidate(void); extern void RelationCacheInitFileRemove(void); /* should be used only by relcache.c and catcache.c */ -extern bool criticalRelcachesBuilt; +extern PGDLLIMPORT bool criticalRelcachesBuilt; /* should be used only by relcache.c and postinit.c */ -extern bool criticalSharedRelcachesBuilt; +extern PGDLLIMPORT bool criticalSharedRelcachesBuilt; #endif /* RELCACHE_H */ -- 2.31.1 From f5bc0d2fc74eb2855196f7aa02295d29059f742f Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Thu, 17 Jun 2021 11:53:21 +0800 Subject: [PATCH v1 4/5] PGDLLIMPORT a set of useful GUCs and derived variables --- src/include/access/xlog.h | 76 +-- src/include/replication/walreceiver.h | 6 +-- 2 files changed, 41 insertions(+), 41 deletions(-) diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index e0b3a75d4d..3a6dbbaef4 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -104,47 +104,47 @@ extern PGDLLIMPORT XLogRecPtr XactLastCommitEnd; extern PGDLLIMPORT bool reachedConsistency; /* these variables are GUC p
Re: Detecting File Damage & Inconsistencies
On Tue, 22 Jun 2021 at 00:24, Simon Riggs wrote: > On Thu, Mar 18, 2021 at 6:20 AM Craig Ringer > wrote: > > > > On Mon, 15 Mar 2021 at 21:01, David Steele wrote: > >> > >> On 11/18/20 5:23 AM, Simon Riggs wrote: > >> > On Wed, 18 Nov 2020 at 06:42, Craig Ringer > >> > wrote: > >> >> > >> >> On Fri, Nov 13, 2020 at 7:24 PM Simon Riggs > wrote: > >> >>> > >> >>> > >> >>> What I'm proposing is an option to add 16 bytes onto each COMMIT > >> >>> record > >> >> > >> >> > >> >> Would it make sense to write this at the time we write a topxid > assignment to WAL instead? > >> >> > >> >> Otherwise it won't be accessible to streaming-mode logical decoding. > >> > > >> > Do you mean extend the xl_xact_assignment record? My understanding is > >> > that is not sent in all cases, so not sure what you mean by "instead". > >> > >> Craig, can you clarify? > > > > > > Right. Or write a separate WAL record when the feature is enabled. But > it's probably sufficient to write it as an optional chunk on > xl_xact_assignment records. We often defer writing them so we can optimise > away xacts that never actually wrote anything, but IIRC we still write one > before we write any WAL that references the xid. That'd be fine, since we > don't need the info any sooner than that during decoding. I'd have to > double check that we write it in all cases and won't get to that too soon, > but I'm pretty sure we do... > > The commit record is optimized away if no xid is assigned, though is > still present if we didn't write any WAL records. > > But if a commit record exists in the WAL stream, we want to know where > it came from. > > A later patch will add PITR capability based on this information so > attaching it directly to the commit record is fairly important, IMHO. > Why? All the proposed info: * 8-byte session start time (from MyStartTime) * 2-byte pid (from MyProcPid) * 4-byte user oid are available at topxid assignment time. If we defer writing them until commit, we lose the ability to use this information during streaming logical decoding. That's something I believe you've wanted for other functionality in the past, such as logical decoding based audit functionality. IIRC the restart_lsn horizon already ensures that we can't miss the xl_xact_assignment at the start of a txn. We would ensure that the desired info is available throughout decoding of the txn, including at commit record processing time, by adding it to the toplevel ReorderBufferTxn. The only advantage I can see to annotating the commit record instead is that we don't have to spend a few bytes per reorder-buffered topxid to track this info between start of decoding for the tx and processing of the commit record. I don't think that's worth caring about.The advantages that having it earlier would give us are much more significant. A few examples: * Skip reorder buffering of non-target transactions early, so we can decode the WAL stream to find the target transactions much faster using less memory and I/O; * Read the database change stream and use the session info to stream info into an intrusion detection system and/or audit engine in real time, using txn streaming to avoid the need to create huge reorder buffers; * Re-decode the WAL stream to identify a target txn you know was aborted, and commit it instead, so you can recover data from aborted txns from the WAL stream using logical decoding. (Only possible if the catalog_xmin hasn't advanced past that point already though) So yeah. I think it'd be better to log the info you want at start-of-txn unless there's a compelling reason not so, and I don't see one yet.
Re: RFC: Detailed reorder buffer stats dumps
On Thu, 6 May 2021 at 02:28, Andres Freund wrote: > Hi, > > On 2021-05-05 18:33:27 +0800, Craig Ringer wrote: > > I'm thinking of piggy-backing on the approach used in the "Get memory > > contexts of an arbitrary backend process" patch in order to provide > access > > to detailed reorder buffer content statistics from walsenders on request. > > > > Right now the reorder buffer is mostly a black-box. I mostly rely on gdb > or > > on dynamic probes (perf, systemtap) to see what it's doing. I intend a > > patch soon to add a couple of fields to struct WalSnd to report some very > > coarse reorder buffer stats - at least oldest buffered xid, number of > > buffered txns, total bytes of buffered txns in memory, total bytes of > > buffered txns spilled to disk. > > > > But sometimes what I really want is details on the txns that're in the > > reorder buffer, and that's not feasible to export via always-enabled > > reporting like struct WalSnd. So I'm thinking that the same approach used > > for the memory context stats patch might work well for asking the > walsender > > for a detailed dump of reorder buffer contents. Something like a > > per-buffered-txn table of txn topxid, start-lsn, most recent change lsn, > > number of changes, number of subxids, number of invalidations, number of > > catalog changes, buffer size in memory, buffer size spilled to disk. > > > > Anyone drastically opposed to the idea? > > I am doubtful. The likelihood of ending with effectively unused code > seems very substantial here. > I can't rule that out, but the motivation for this proposal isn't development convenience. It's production support and operations. The reorder buffer is a black box right now, and when you're trying to answer the questions "what is the walsender doing," "is meaningful progress being made," and "what is slowing down replication" it's ... not easy. I currently rely on some fairly hairy gdb scripts, which I'm not keen on running on production systems if I can avoid it. I'm far from set on the approach of asking the walsender to dump a reorder buffer state summary to a file. But I don't think the current state of affairs is much fun for production use. Especially since we prevent the pg_stat_replication sent_lsn from going backwards, so reorder buffering can cause replication to appear to completely cease to progress for long periods unless you identify the socket and monitor traffic on it, or you intrude on the process with gdb. At the least it'd be helpful to have pg_stat_replication (or a new related auxiliary view like pg_stat_logical_decoding) show - walsender total bytes sent this session - number of txns processed this txn - number txns filtered out by output plugin this session - oldest xid in reorder buffer - reorder buffer number of txns - reorder buffer total size (in-memory and total inc spilled) - reorderbuffercommit current xid, last change lsn, total buffered size of current tx, total bytes of buffer processed so far within the current txn, and commit lsn if known, only when currently streaming a txn from reorderbuffercommit That way it'd be possible to tell if a logical walsender is currently processing a commit and get a much better sense of its progress within the commit. Perhaps output plugins could do some of this and expose their own custom views. But then each plugin would have to add its own. Plus they don't get a particularly good view into the reorder buffer state; they'd have a hard time maintaining good running stats. Some basic monitoring exposed for logical decoding and reorder buffering would help a lot. Does that sound more palatable? If so, I'd probably still want to be able to hook a few places in logical decoding to allow an extension to instrument it when greater insight into the inner workings is required. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?
On Wed, 5 May 2021 at 23:15, Craig Ringer wrote: > Which was fine as far as it went, but I failed to account for the xid > assignment not necessarily being durable when the client calls > txid_status(). Ahem, I meant "when the client calls txid_current()" -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?
On Tue, 9 Feb 2021 at 05:52, Andres Freund wrote: > > Craig, it kind of looks to me like you assumed it'd be guaranteed that > the xid at this point would show in-progress? > At the time I wrote that code, I don't think I understood that xid assignment wasn't necessarily durable until either (a) the next checkpoint; or (b) commit of some txn with a greater xid. IIRC I expected that after crash and recovery the tx would always be treated as aborted, because the xid had been assigned but no corresponding commit was found before end-of-recovery. No explicit abort records are written to WAL for such txns since we crashed, but the server's oldest in-progress txn threshold is used to determine that they must be aborted rather than in-progress even though their clog entries aren't set to aborted. Which was fine as far as it went, but I failed to account for the xid assignment not necessarily being durable when the client calls txid_status(). > I don't think the use of txid_status() described in the docs added in > the commit is actually ever safe? > I agree. The client can query for its xid with txid_current() but as you note there's no guarantee that the assigned xid is durable. The client would have to ensure that an xid was assigned, then ensure that the WAL was durably flushed past the point of the xid assignment before relying on the xid. If we do a txn that performs a small write, calls txid_current(), and sends a commit that the server crashes before completing, we can't know for sure that the xid we recorded client-side before the server crash is the same txn we check the status of after crash recovery. Some other txn could've re-used the xid after crash so long as no other txn with a greater xid durably committed before the crash. That scenario isn't hugely likely, but it's definitely possible on systems that don't do a lot of concurrent txns or do mostly long, heavyweight txns. The txid_status() function was originally intended to be paired with a way to report topxid assignment to the client automatically, NOTIFY or GUC_REPORT-style. But that would not make this usage safe either, unless we delayed the report until WAL was flushed past the LSN of the xid assignment *or* some other txn with a greater xid committed. This could be made safe with a variant of txid_current() that forced the xid assignment to be logged immediately if it was not already, and did not return until WAL flushed past the point of the assignment. If the client did most of the txn's work before requesting a guaranteed-durable xid, it would in practice not land up having to wait for a flush. But we'd have to keep track of when we assigned the xid in every single topxact in order to be able to promise we'd flushed it without having to immediately force a flush. That's pointless overhead all the rest of the time, just in case someone wants to get an xid for later use with txid_status(). The simplest option with no overhead on anything that doesn't care about txid_status() is to expose a function to force flush of WAL up to the current insert LSN. Then update the docs to say you have to call it after txid_current(), and before sending your commit. But at that point you might as well use 2PC, since you're paying the same double flush and double round-trip costs. The main point of txid_status() was to avoid the cost of that double-flush. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?
On Wed, 10 Feb 2021 at 04:28, Robert Haas wrote: > On Mon, Feb 8, 2021 at 4:52 PM Andres Freund wrote: > > The 011_crash_recovery.pl test test starts a transaction, creates a > > table, fetches the transaction's xid. Then shuts down the server in > > immediate mode. It then asserts that after crash recovery the previously > > assigned xid is shown as aborted, and that new xids are assigned after > > the xid. > > > > But as far as I can tell there's no guarantee that that is the case. > > I think you are right. > > Andres, I missed this mail initially. I'll look into it shortly. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
RFC: Detailed reorder buffer stats dumps
Hi all I'm thinking of piggy-backing on the approach used in the "Get memory contexts of an arbitrary backend process" patch in order to provide access to detailed reorder buffer content statistics from walsenders on request. Right now the reorder buffer is mostly a black-box. I mostly rely on gdb or on dynamic probes (perf, systemtap) to see what it's doing. I intend a patch soon to add a couple of fields to struct WalSnd to report some very coarse reorder buffer stats - at least oldest buffered xid, number of buffered txns, total bytes of buffered txns in memory, total bytes of buffered txns spilled to disk. But sometimes what I really want is details on the txns that're in the reorder buffer, and that's not feasible to export via always-enabled reporting like struct WalSnd. So I'm thinking that the same approach used for the memory context stats patch might work well for asking the walsender for a detailed dump of reorder buffer contents. Something like a per-buffered-txn table of txn topxid, start-lsn, most recent change lsn, number of changes, number of subxids, number of invalidations, number of catalog changes, buffer size in memory, buffer size spilled to disk. Anyone drastically opposed to the idea? (I know I have to finish up with the LWLock tracepoint patchset first, this is a RFC at this stage). -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
[PATCH] Faster, easier valgrind runs with make USE_VALGRIND=1 check
Hi all The attached patch adds support for running any temp-install based tests (check, isolationcheck, src/test/recovery, etc) under the control of valgrind with a simple make USE_VALGRIND=1 check It's based on a script I've been using for some time to run faster, simpler Valgrind checks on the codebase with much less irrelevant noise than the usual approaches. There are no C code changes at all in this patch, it only touches Makefile.global and adds a new src/tools/valgrind_wrapper tool. When you specify USE_VALGRIND=1, the PATH used by $(with_temp_install) is prefixed with a tmp_install/bin_valgrind_wrapper/ directory. Each binary in $(bindir) gets a corresponding wrapper script in bin_valgrind_wrapper in the temp install. The wrappers intercept execution of every command in the bindir and exec them under the control of valgrind (or skip valgrind and exec that target directly, if desired This has many advantages over the usual approaches of an installcheck-based valgrind run or "valgrind make check": * There's no custom setup, it works out of the box * It produces much less irrelevant log output and runs a lot faster because it only runs postgres-related binaries under valgrind, not irrelevant noise like perl interpreters, make, shellscripts, etc. * It's much more targeted and selective - if you're only interested in some extension or new backend feature, you can trivially set it to target just the backend, skip checking of initdb, and skip checking of psql, so you get more relevant log output and faster runs. I'll follow up to this post with some timing and log output numbers but wanted to share what I had first. -DUSE_VALGRIND is also added to CFLAGS at compile time when USE_VALGRIND=1 is passed to make. This gets rid of the need to hack pg_config_manual.h or fiddle with configure re-runs when you want to build with valgrind support. Arguably it'd be better to add a --enable-valgrind option to configure. LMK if that's preferable. Note that there's a bit of a hack in the wrapper script to work around Valgrind's inability to set the argv[0] of a process run under valgrind to anything other than the exact command-name to be executed. I have a patch for valgrind pending to add that capability (like "exec -a" in bash) but a workaround is necessary for now. It's made a bit more complicated by PostgreSQL's determination to canonicalize paths and follow symlinks in find_my_exec(). The script's hardlink based workarounds for this could be removed if we could agree to support a debug env-var or command-line option that could be used to supply an override path to be returned by find_my_exec() instead of performing normal discovery. If you'd prefer that approach to the current workaround in the script let me know. I'm also willing to add valgrind-support-detection logic that will cause valgrind launched via "make USE_VALGRIND=1" to refuse to run if it detects that the target postgres was not built with -DUSE_VALGRIND for proper instrumentation. This can be done with the valgrind --require-text-symbol option and a dummy export symbol added to the symbol table only when compiled with -DUSE_VALGRIND. If that's desirable let me know, it should be quick to add. You can find more detail in the patch commit message (attached) and in the src/test/valgrind_wrapper script it adds. If you're wondering why the valgrind options --trace-children=yes --trace-children-skip=pattern --trace-children-skip-by-arg=pattern don't solve this problem, read the script's comments. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise From e730e862a0731dc3d2786c5004a146aff7dd6bf7 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Tue, 4 May 2021 10:34:01 +0800 Subject: [PATCH v3] Make Valgrind runs simpler with make USE_VALGRIND=1 To run Valgrind based tests on postgres, it's generally been necessary to either: * Initdb and start your own cluster under the control of valgrind then use "make installcheck". This works won't work with TAP tests, and it's cumbersome to set up and run. -or- * Run "make check" under the control of valgrind. This approach runs all the uninteresting processes under valgrind, with all the associated overhead. Every make, shell, utility command, perl interpreter, etc gets run under valgrind, which slows everything down a lot and produces a great deal of uninteresting valgrind output. There's no practical way to target just the server backend this way. Provide a faster, simpler and more targeted way to run valgrind on postgres by adding support for a "USE_VALGRIND=1" parameter to the makefiles. This has two main effects: * It adds -DUSE_VALGRIND to CFLAGS, so enhanced valgrind runtime support is compiled into postgres; and * It interposes
Re: [PATCH] Identify LWLocks in tracepoints
On Wed, 5 May 2021 at 09:15, Craig Ringer wrote: >> warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] >> 1322 |TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); >> | ^ > > Odd that I didn't get that. This compiler complaint is not due to the _ENABLED() test as such. It's due to the fact that we completely define out the TRACE_POSTGRESQL_ macros with src/backend/utils/Gen_dummy_probes.sed . While explicit braces could be added around each test, I suggest fixing Gen_dummy_probes.sed to emit the usual dummy statement instead. Patch attached. From a1ca7d752b56717854cc47b31cd299e651de0430 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Wed, 5 May 2021 12:06:10 +0800 Subject: [PATCH v1] Emit dummy statements for probes.d probes when disabled MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When building without --enable-dtrace, emit dummy do {} while (0) statements for the stubbed-out TRACE_POSTGRESQL_foo() macros instead of empty macros that totally elide the original probe statement. This fixes the warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] introduced by b94409a02f. --- src/backend/utils/Gen_dummy_probes.sed | 1 + 1 file changed, 1 insertion(+) diff --git a/src/backend/utils/Gen_dummy_probes.sed b/src/backend/utils/Gen_dummy_probes.sed index aa3db59cce..6e29d86afa 100644 --- a/src/backend/utils/Gen_dummy_probes.sed +++ b/src/backend/utils/Gen_dummy_probes.sed @@ -19,5 +19,6 @@ s/([^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\})/(INT1, INT2, s/([^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\})/(INT1, INT2, INT3, INT4, INT5, INT6)/ s/([^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\})/(INT1, INT2, INT3, INT4, INT5, INT6, INT7)/ s/([^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\}, [^,)]\{1,\})/(INT1, INT2, INT3, INT4, INT5, INT6, INT7, INT8)/ +s/$/ do {} while (0)/ P s/(.*$/_ENABLED() (0)/ -- 2.30.2
Re: [PATCH] Identify LWLocks in tracepoints
On Wed, 5 May 2021, 06:15 Andres Freund, wrote: > Hi, > > warning: suggest braces around empty body in an ‘if’ statement > [-Wempty-body] > 1322 |TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); > | ^ > Odd that I didn't get that. I'll send a patch to revise shortly.
Re: [PATCH] Identify LWLocks in tracepoints
On Wed, 14 Apr 2021, 22:29 Robert Haas, wrote: > On Tue, Apr 13, 2021 at 10:42 PM Craig Ringer > wrote: > > I'd really love it if a committer could add an explanatory comment or > > two in the area though. I'm happy to draft a comment patch if anyone > > agrees my suggestion is sensible. The key things I needed to know when > > studying the code were: > > [...] > > I'm willing to review a comment patch along those lines. > Cool. I'll draft soon. I since noticed that some of the info is present, but it's in lwlock.h whereas in Pg comment detail is more often than not in the .c file. I prefer it in headers myself anyway, since it's more available to tools like doxygen. I'll add a few "see lwlock.h" hints, a short para about appropriate lwlock use in the .c into comment etc and post on a separate thread soon. > > I'm actually inclined to revise the patch I sent in order to *remove* > > the LWLock name from the tracepoint argument. > Reducing the overheads is good, but I have no opinion on what's > important for people doing tracing, because I am not one of those > people. > Truthfully I'm not convinced anyone is "those people" right now. I don't think anyone is likely to be making serious use of them due to their limitations. Certainly that'll be the case for the txn ones which are almost totally useless. They only track the localxid lifecycle, they don't track real txid allocation, WAL writing, commit (wal or shmem), etc.
Re: [PATCH] Identify LWLocks in tracepoints
On Thu, 29 Apr 2021 at 15:31, Peter Eisentraut wrote: > > So if you could produce a separate patch that adds the > > _ENABLED guards targeting PG14 (and PG13), that would be helpful. > > Here is a proposed patch for this. LGTM. Applies and builds fine on master and (with default fuzz) on REL_13_STABLE. Works as expected. This does increase the size of LWLockAcquire() etc slightly but since it skips these function calls, and the semaphores are easily predicted, I don't have any doubt it's a net win. +1 for merge.
Re: [PATCH] Identify LWLocks in tracepoints
On Wed, 14 Apr 2021 at 10:41, Craig Ringer wrote: > On Wed, 14 Apr 2021 at 02:25, Robert Haas wrote: > > You could try to identify locks by pointer addresses, but that's got > > security hazards and the addreses aren't portable across all the > > backends involved in the parallel query because of how DSM works, so > > it's not really that helpful in terms of matching stuff up. > > What I'm doing now is identifying them by LWLock* across backends. I > keep track of DSM segment mappings in each backend inside the trace > script and I relocate LWLock* pointers known to be inside DSM segments > relative to a dummy base address so they're equal across backends. BTW, one of the reasons I did this was to try to identify BDR and pglogical code that blocks or sleeps while holding a LWLock. I got stuck on that for other reasons, so it didn't go anywhere, but those issues are now resolved so I should probably return to it at some point. It'd be a nice thing to be able to run on postgres itself too.
Re: [PATCH] Identify LWLocks in tracepoints
On Wed, 14 Apr 2021 at 02:25, Robert Haas wrote: > So before the commit in question -- > 3761fe3c20bb040b15f0e8da58d824631da00caa -- T_ID() used to compute an > offset for a lock within the tranche that was supposed to uniquely > identify the lock. However, the whole idea of an array per tranche > turns out to be broken by design. Yeah, I understand that. I'd really love it if a committer could add an explanatory comment or two in the area though. I'm happy to draft a comment patch if anyone agrees my suggestion is sensible. The key things I needed to know when studying the code were: * A LWLock* is always part of a tranche, but locks within a given tranche are not necessarily co-located in memory, cross referenced or associated in any way. * A LWLock tranche does not track how many LWLocks are in the tranche or where they are in memory. It only groups up LWLocks into categories and maps the tranche ID to a name. * Not all LWLocks are part of the main LWLock array; others can be embedded in shmem structs elsewhere, including in DSM segments. * LWLocks in DSM segments may not have the same address between different backends, because the DSM base address can vary, so a LWLock* cannot be reliably compared between backends unless you know it's in the main LWLock array or in static shmem. Having that info in lwlock.c near the tranche management code or the tranche and main lwlock arrays would've been very handy. > You could try to identify locks by pointer addresses, but that's got > security hazards and the addreses aren't portable across all the > backends involved in the parallel query because of how DSM works, so > it's not really that helpful in terms of matching stuff up. What I'm doing now is identifying them by LWLock* across backends. I keep track of DSM segment mappings in each backend inside the trace script and I relocate LWLock* pointers known to be inside DSM segments relative to a dummy base address so they're equal across backends. It's a real pain, but it works. The main downside is that the trace script has to observe the DSM attach; if it's started once a backend already has the DSM segment attached, it has no idea the LWLock is in a DSM segment or how to remap it. But that's not a serious issue. > On a broader level, I agree that being able to find out what the > system is doing is really important. But I'm also not entirely > convinced that having really fine-grained information here to > distinguish between one lock and another is the way to get there. At the start of this thread I would've disagreed with you. But yeah, you and Andres are right, because the costs outweigh the benefits here. I'm actually inclined to revise the patch I sent in order to *remove* the LWLock name from the tracepoint argument. At least for the fast-path tracepoints on start-of-acquire and end-of-acquire. I think it's probably OK to report it in the lock wait tracepoints, which is where it's most important to have anyway. So the tracepoint will always report the LWLock* and tranche ID, but it won't report the tranche name for the fast-path. I'll add trace events for tranche ID registration, so trace tools can either remember the tranche ID->name mappings from there, or capture them from lock wait events and remember them. Reasonable? That way we retain the most important trace functionality, but we reduce the overheads.
Re: [PATCH] Identify LWLocks in tracepoints
On Wed, 14 Apr 2021 at 04:46, Andres Freund wrote: > > On 2021-04-13 14:25:23 -0400, Robert Haas wrote: > > On Mon, Apr 12, 2021 at 11:06 PM Andres Freund wrote: > > You could identify every lock by a tranche ID + an array offset + a > > "tranche instance ID". But where would you store the tranche instance > > ID to make it readily accessible, other than in the lock itself? > > Andres wasn't thrilled about using even 2 bytes to identify the > > LWLock, so he'll probably like having more bytes in there for that > > kind of thing even less. > > I still don't like the two bytes, fwiw ;). Especially because it's 4 > bytes due to padding right now. Aha, did I hear you say "there are two free bytes for me to shove something marginally useful and irrelevant into"? (*grin*) > I'd like to move the LWLock->waiters list to outside of the lwlock > itself - at most TotalProcs LWLocks can be waited for, so we don't need > millions of empty proclist_heads. That way we could also remove the > proclist indirection - which shows up a fair bit in contended workloads. > > And if we had a separate "lwlocks being waited for" structure, we could > also add more information to it if we wanted to... Having the ability to observe LWLock waiters would be nice. But you're right to constantly emphasise that LWLocks need to be very slim. We don't want to turn them into near-heavyweight locks by saddling them with overhead that's not strictly necessary. A simple pg_regress run (with cassert, admittedly) takes 25,000,000 LWLocks and does 24,000 LWLock waits and 130,000 condvar waits. All in less than a minute. OTOH, once someone's waiting we don't care nearly as much about bookkeeping cost, it's the un-contested fast paths that're most critical. > - We could make it work if we restricted MAX_BACKENDS to be 2**14 - but > while I personally think that's a sane upper limit, I already had a > hard time getting consensus to lower the limit to 2^18-1. 16384 backends is totally fine in sane real world deployments. But it'll probably upset marketing people when OtherDatabaseVendor starts shouting that they support 14 million connections, and postgres only has 16k. Sigh. The real answer here in the long term probably needs to be decoupling of executors from connection state inside postgres. But while we're on that topic, how about we convert the entire codebase to Rust while riding on a flying rainbow unicorn? We're stuck with the 1:1 connection to executor mapping for the foreseeable future. > - Use a 64bit integer. Then we can easily fit MAX_BACKENDS lockers, as > well as an offset to one of MAX_BACKENDS "wait lists" into LWLock. You know much more than me about the possible impacts of that on layout and caching, but I gather that it's probably undesirable to make LWLocks any bigger. > I think it's quite useful for relatively simple things like analyzing > the total amount of time spent in individual locks, without incuring > much overhead when not doing so (for which you need to identify > individual locks, otherwise your end - start time is going to be > meaningless). Yep. That's why the removal of the lock offset is a bit frustrating, because you can't identify an LWLock instance-wide by LWLock* due to the possibility of different per-backend DSM base-address mappings. Well, and ASLR on EXEC_BACKEND systems, but who cares about those? The removal was for good reasons though. And it only affects LWLocks in DSM, for everything else the LWLock* is good enough. If identifying LWLocks in DSM ever matters enough to bother to solve that problem, we can emit trace events on DSM mapping attach in each backend, and trace tools can do the work to track which LWLocks are in DSM and convert their addresses to a reference base address. Pg shouldn't have to pay the price for that unless it's something a lot of people need. > And, slightly more advanced, for analyzing what the stack > was when the lock was released - which then allows you to see what work > you're blocked on, something I found hard to figure out otherwise. > > I found that that's mostly quite doable with dynamic probes though. Yeah, it is. That's part of why my patchset here doesn't try to do a lot to LWLock tracepoints - I didn't think it was necessary to add a lot. The LWLock code is fairly stable, not usually something you have to worry about in production unless you're debugging badly behaved extensions, and usually somewhat probe-able with DWARF based dynamic probes. However, the way the wait-loop and fast-path are in the same function is a serious pain for dynamic probing; you can't tell the difference between a fast-path acquire and an acquire after a wait without using probes on function+offset or probing by source line. Both those are fine for dev work but useless in tool/library scripts. I almost wonder if we should test out moving the LWLock wait-loops out of LWLockAcquire(), LWLockAcquireOrWait() and LWLockWaitForVar() anyway, so the hot parts of the fun
Re: [PATCH] Identify LWLocks in tracepoints
On Tue, 13 Apr 2021 at 21:40, Craig Ringer wrote: > Findings: > > * A probe without arguments or with simple arguments is just a 'nop' > instruction > * Probes that require function calls, pointer chasing, other > expression evaluation etc may impose a fixed cost to collect up > arguments even if the probe is disabled. > * SDT semaphores can avoid that cost but add a branch, so should > probably be avoided unless preparing probe arguments is likely to be > expensive. Back to the topic directly at hand. Attached a disassembly of what LWLockAcquire looks like now on my current build of git master @ 5fe83adad9efd5e3929f0465b44e786dc23c7b55 . This is an --enable-debug --enable-cassert --enable-dtrace build with -Og -ggdb3. The three tracepoints in it are all of the form: movzwl 0x0(%r13),%edi call 0x801c49 nop so it's clear we're doing redundant calls to GetLWTrancheName(), as expected. Not ideal. Now if I patch it to add the _ENABLED() guards on all the tracepoints, the probes look like this: 0x00803176 <+200>: cmpw $0x0,0x462da8(%rip)# 0xc65f26 0x0080317e <+208>: jne0x80328b other interleaved code ... 0x0080328b <+477>: movzwl 0x0(%r13),%edi 0x00803290 <+482>: call 0x801c49 0x00803295 <+487>: nop 0x00803296 <+488>: jmp0x803184 so we avoid the GetLWTrancheName() call at the cost of a test and possible branch, and a small expansion in total function size. Without the semaphores, LWLockAcquire is 463 bytes. With them, it's 524 bytes, which is nothing to sneeze at for code that doesn't do anything 99.999% of the time, but we avoid a bunch of GetLWTrancheName() calls. If I instead replace T_NAME() with NULL for all tracepoints in LWLockAcquire, the disassembly shows that the tracepoints now become a simple 0x00803176 <+200>: nop which is pretty hard to be concerned about. So at the very least we should be calling GetLWTrancheName() once at the start of the function if built with dtrace support and remembering the value, instead of calling it for each tracepoint. And omitting the tranche name looks like it might be sensible for the LWLock code. In most places it won't matter, but LWLocks are hot enough that it possibly might. A simple pg_regress run hits LWLockAcquire 25 million times, so that's 75 million calls to GetLWTrancheName().
Re: [PATCH] Identify LWLocks in tracepoints
On Tue, 13 Apr 2021 at 21:05, Craig Ringer wrote: > On Tue, 13 Apr 2021 at 11:06, Andres Freund wrote: > > IIRC those aren't really comparable - the kernel actually does modify > > the executable code to replace the tracepoints with nops. > > Same with userspace static trace markers (USDTs). > > A followup mail will contain a testcase and samples to demonstrate this. Demo follows, with source attached too. gcc 10.2 compiling with -O2, using dtrace and from systemtap 4.4 . Trivial empty function definition: __attribute__((noinline)) void no_args(void) { SDT_NOOP_NO_ARGS(); } Disassembly when SDT_NOOP_NO_ARGS is defined as #define SDT_NOOP_NO_ARGS() is: : retq When built with a probes.d definition processed by the dtrace script instead, the disassembly becomes: : nop retq So ... yup, it's a nop. Now, if we introduce semaphores that changes. __attribute__((noinline)) void no_args(void) { if (SDT_NOOP_NO_ARGS_ENABLED()) SDT_NOOP_NO_ARGS(); } disassembles to: : cmpw $0x0,0x2ec4(%rip)# jne retq nopl 0x0(%rax,%rax,1) nop retq so the semaphore test is actually quite harmful and wasteful in this case. That's not surprising since this SDT is a simple marker point. But what if we supply arguments to it? It turns out that the disassembly is the same if args are passed, whether locals or globals, including globals assigned based on program input that can't be determined at compile time. Still just a nop. If I pass a function call as an argument expression to a probe, e.g. __attribute__((noinline)) static int compute_probe_argument(void) { return 100; } void with_computed_arg(void) { SDT_NOOP_WITH_COMPUTED_ARG(compute_probe_argument()); } then the disassembly with SDTs is: : callq nop retq so the function call isn't elided even if it's unused. That's somewhat expected. The same will be true if the arguments to a probe require pointer chasing or non-trivial marshalling. If a semaphore guard is added this becomes: : cmpw $0x0,0x2e2e(%rip)# jne retq nopl 0x0(%rax,%rax,1) callq nop retq so now the call to compute_probe_argument() is skipped unless the probe is enabled, but the function is longer and requires a test and jump. If I dummy up a function that does some pointer chasing, without semaphores I get : mov(%rdi),%rax mov(%rax),%rax mov(%rax),%rax nop retq so the arguments are marshalled then ignored. with semaphores I get: : cmpw $0x0,0x2d90(%rip)# jne retq nopl 0x0(%rax,%rax,1) mov(%rdi),%rax mov(%rax),%rax mov(%rax),%rax nop retq so again the probe's argument marshalling is inline in the function body, but at the end, and skipped over. Findings: * A probe without arguments or with simple arguments is just a 'nop' instruction * Probes that require function calls, pointer chasing, other expression evaluation etc may impose a fixed cost to collect up arguments even if the probe is disabled. * SDT semaphores can avoid that cost but add a branch, so should probably be avoided unless preparing probe arguments is likely to be expensive. Hideous but effective demo code attached. provider sdt_noop { probe no_args(); probe with_args(int arg1, int arg2, int arg3); probe with_global_arg(int arg1); probe with_volatile_arg(int arg1); probe with_many_args(int arg1, int arg2, int arg3, int64_t arg4, int64_t arg5, int64_t arg6, int64_t arg7, int64_t arg8); probe with_computed_arg(int arg1); probe with_pointer_chasing(int arg1); }; Makefile Description: Binary data #include #ifdef USE_SDT #include "sdt_noop_probes_enabled.h" #else #include "sdt_noop_probes_disabled.h" #endif void no_args(void); int with_args(void); int with_global_arg(void); int with_volatile_arg(void); void with_many_args(void); void with_computed_arg(void); void with_pointer_chasing(int arg); __attribute__((noinline)) void no_args(void) { #ifdef USE_SDT_SEMAPHORES if (SDT_NOOP_NO_ARGS_ENABLED()) #endif SDT_NOOP_NO_ARGS(); } __attribute__((noinline)) int with_args(void) { int arg1 = 0; int arg2 = 1; int arg3 = 2; #ifdef USE_SDT_SEMAPHORES if (SDT_NOOP_WITH_ARGS_ENABLED()) #endif SDT_NOOP_WITH_ARGS(arg1, arg2, arg3); return arg1 + arg2 + arg3; } int some_global; __attribute__((noinline)) int with_global_arg(void) { #ifdef USE_SDT_SEMAPHORES if (SDT_NOOP_WITH_GLOBAL_ARG_ENABLED()) #endif SDT_NOOP_WITH_GLOBAL_ARG(some_global); return some_global; } __attribute__((noinline)) int with_volatile_arg(void) { volatile int arg1; arg1 = 42; #ifdef USE_SDT_SEMAPHORES i
Re: [PATCH] Identify LWLocks in tracepoints
On Tue, 13 Apr 2021 at 11:06, Andres Freund wrote: > > Each backend can have different tranche IDs (right?) > > No, they have to be the same in each. Note how the tranche ID is part of > struct LWLock. Which is why LWLockNewTrancheId() has to acquire a lock > etc. Ah. I misunderstood that at some point. That makes it potentially more sensible to skip reporting tranche names. Not great, because it's much less convenient to work with trace data full of internal ordinals that must be re-mapped in post-processing. But I'm generally OK with deferring runtime costs to tooling rather than the db itself so long as doing so is moderately practical. In this case, I think we could likely get away with removing the tranche names from the tracepoints if we instead emit a trace event on each dynamic tranche registration that reports the tranche id -> name mapping. It still sucks for tools, since they have to scrape up the static tranche registrations from somewhere else, but ... it'd be tolerable. > > The kernel is packed with extremely useful trace events, and for very > > good reasons. Some on very hot paths. > > IIRC those aren't really comparable - the kernel actually does modify > the executable code to replace the tracepoints with nops. Same with userspace static trace markers (USDTs). A followup mail will contain a testcase and samples to demonstrate this. > Although I still don't really buy that static tracepoints are the best > way to measure this kind of thing, given the delay introducing them and > the cost of having them around. I think I pointed out > https://postgr.es/m/20200813004233.hdsdfvufqrbdwzgr%40alap3.anarazel.de > before. Yeah. Semaphores are something hot enough that I'd hesitate to touch them. > > LWLock lock-ordering deadlocks > > This seems unrelated to tracepoints to me. If I can observe which locks are acquired in which order by each proc, I can then detect excessive waits and report the stack of held locks of both procs and their order of acquisition. Since LWLocks shmem state doesn't AFAICS track any information on the lock holder(s) I don't see a way to do this in-process. It's not vital, it's just one of the use cases I have in mind. I suspect that any case where such deadlocks are possible represents a misuse of LWLocks anyway. > > and there's no way to know what a given non-built-in tranche ID means > > for any given backend without accessing backend-specific in-memory > > state. Including for non-user-accessible backends like bgworkers and > > auxprocs, where it's not possible to just query the state from a view > > directly. > > The only per-backend part is that some backends might not know the > tranche name for dynamically registered tranches where the > LWLockRegisterTranche() hasn't been executed in a backend. Which should > pretty much never be an aux process or such. And even for bgworkers it > seems like a pretty rare thing, because those need to be started by > something... > > It might be worth proposing a shared hashtable with tranch names and > jut reserving enough space for ~hundred entries... Yeah, that'd probably work and be cheap enough not to really matter. Might even save us a chunk of memory by not turning CoW pages into private mappings for each backend during registration. > > And you can always build without `--enable-dtrace` and ... just not care. > > Practically speaking, distributions enable it, which then incurs the > cost for everyone. Yep. That's part of why I was so surprised to notice the GetLWTrancheName() function call in LWLock tracepoints. Nearly anywhere else it wouldn't matter at all, but LWLocks are hot enough that it just might matter for the no-wait fastpath.
Re: [PATCH] Identify LWLocks in tracepoints
On Tue, 13 Apr 2021 at 02:23, Andres Freund wrote: [I've changed the order of the quoted sections a little to prioritize the key stuff] > > On 2021-04-12 14:31:32 +0800, Craig Ringer wrote: > > > It's annoying that we have to pay the cost of computing the tranche name > > though. It never used to matter, but now that T_NAME() expands to > > GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more > > on such a hot path. I might see if I can do a little comparison and see how > > much. I could add TRACE_POSTGRESQL_<>_ENABLED() guards > > since we > > do in fact build with SDT semaphore support. That adds a branch for each > > tracepoint, but they're already marshalling arguments and making a function > > call that does lots more than a single branch, so that seems pretty > > sensible. > > I am against adding any overhead for this feature. I honestly think the > probes we have right now in postgres do not provide a commensurate > benefit. I agree that the probes we have now are nearly useless, if not entirely useless. The transaction management ones are misplaced and utterly worthless. The LWLock ones don't carry enough info to be much use and are incomplete. I doubt anybody uses any of them at all, or would even notice their absence. In terms of overhead, what is in place right now is not free. It used to be very cheap, but since 29c3e2dd5a6 it's not. I'd like to reduce the current cost and improve functionality at the same time, so it's actually useful. > > * There is no easy way to look up the tranche name by ID from outside the > > backend > > But it's near trivial to add that. Really? We can expose a pg_catalog.lwlock_tranches view that lets you observe the current mappings for any given user backend I guess. But if I'm looking for performance issues caused by excessive LWLock contention or waits, LWLocks held too long, LWLock lock-ordering deadlocks, or the like, it's something I want to capture across the whole postgres instance. Each backend can have different tranche IDs (right?) and there's no way to know what a given non-built-in tranche ID means for any given backend without accessing backend-specific in-memory state. Including for non-user-accessible backends like bgworkers and auxprocs, where it's not possible to just query the state from a view directly. So we'd be looking at some kind of shm based monstrosity. That doesn't sound appealing. Worse, there's no way to solve races with it - is a given tranche ID already allocated when you see it? If not, can you look it up from the backend before the backend exits/dies? For that matter, how do you do that, since the connection to the backend is likely under the control of an application, not your monitoring and diagnostic tooling. Some trace tools can poke backend memory directly, but it generally requires debuginfo, is fragile and Pg version specific, slow, and a real pain to use. If we don't attach the LWLock names to the tracepoints in some way they're pretty worthless. Again, I don't plan to add new costs here. I'm actually proposing to reduce an existing cost. And you can always build without `--enable-dtrace` and ... just not care. Anyway - I'll do some `perf` runs shortly to quantify this: * With/without tracepoints at all * With/without names in tracepoints * With/without tracepoint refcounting (_ENABLED() semaphores) so as to rely less on handwaving. > > (Those can also be used with systemtap guru mode scripts to do things > > like turn a particular elog(DEBUG) into a PANIC at runtime for > > diagnostic purposes). > > Yikes. > Well, it's not like it can happen by accident. You have to deliberately write a script that twiddles process memory, using a tool that requires special privileges and I recently had to prepare a custom build for a customer that converted an elog(DEBUG) into an elog(PANIC) in order to capture a core with much better diagnostic info for a complex, hard to reproduce and intermittent memory management issue. It would've been rather nice to be able to do so with a trace marker instead of a custom build. > > There are a TON of probes I want to add, and I have a tree full of them > > waiting to submit progressively. Yes, ability to probe all GUCs is in > > there. So is detail on walsender, reorder buffer, and snapshot builder > > activity. Heavyweight lock SDTs. A probe that identifies the backend type > > at startup. SDT probe events emitted for every wait-event. Probes in elog.c > > to let probes observe error unwinding, capture error messages, > > etc. [...] A probe that fires whenever debug_query_string > > changes. Lots. But I can't submit them all at once, especially witho
Re: [PATCH] Identify LWLocks in tracepoints
On Mon, 22 Mar 2021 at 16:38, Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > First, a problem: 0002 doesn't build on macOS, because uint64 has been > used in the probe definitions. That needs to be handled like the other > nonnative types in that file. > Will fix. All the probe changes and additions should be accompanied by > documentation changes. > Agreed, will fix. The probes used to have an argument to identify the lock, which was > removed by 3761fe3c20bb040b15f0e8da58d824631da00caa. Huh. That's exactly the functionality I was looking for. Damn. I understand why Robert removed it, but its removal makes it much harder to identify an LWLock since it might fall in a DSM segment that could be mapped at different base addresses in different backends. Robert's patch didn't replace the offset within tranche with anything else to identify the lock. A LWLock* is imperfect due to ASLR and DSM but it's better than nothing. In theory we could even remap them in trace tools if we had tracepoints on DSM attach and detach that showed their size and base address too. CC'ing Andres, as he expressed interest in being able to globally identify LWLocks too. > The 0001 patch is > essentially trying to reinstate that, which seems sensible. Perhaps we > should also use the argument order that used to be there. It used to be > > probe lwlock__acquire(const char *, int, LWLockMode); > > and now it would be > > probe lwlock__acquire(const char *, LWLockMode, LWLock*, int); > > Also, do we need both the tranche name and the tranche id? Reasons to have the name: * There is no easy way to look up the tranche name by ID from outside the backend * A tranche ID by itself is pretty much meaningless especially for dynamic tranches * Any existing scripts will rely on the tranche name So the tranche name is really required to generate useful output for any dynamic tranches, or simple and readable output from things like perf. Reasons to have the tranche ID: * The tranche name is not guaranteed to have the same address for a given value across backends in the presence of ASLR, even for built-in tranches. So tools need to read tranche names as user-space strings, which is much more expensive than consuming an int argument from the trace args. Storing and reporting maps of events by tranche name (string) in tools is also more expensive than having a tranche id. * When the trace tool or script wants to filter for only one particular tranche,particularly when it's a built-in tranche where the tranche ID is known, having the ID is much more useful and efficient. * If we can avoid computing the tranche name, emitting just the tranche ID would be much faster. It's annoying that we have to pay the cost of computing the tranche name though. It never used to matter, but now that T_NAME() expands to GetLWTrancheName() calls as of 29c3e2dd5a6 it's going to cost a little more on such a hot path. I might see if I can do a little comparison and see how much. I could add TRACE_POSTGRESQL_<>_ENABLED() guards since we do in fact build with SDT semaphore support. That adds a branch for each tracepoint, but they're already marshalling arguments and making a function call that does lots more than a single branch, so that seems pretty sensible. The main downside of using _ENABLED() USDT semaphore guards is that not all tools are guaranteed to understand or support them. So an older perf, for example, might simply fail to fire events on guarded probes. That seems OK to me, the onus should be on the probe tool to pay any costs, not on PostgreSQL. Despite that I don't want to mark the _ENABLED() guards unlikely(), since that'd increase the observer effect where probing LWLocks changes their timing and behaviour. Branch prediction should do a very good job in such cases without being forced. I wonder a little about the possible cache costs of the _ENABLED() macros though. Their data is in a separate ELF segment and separate .o, with no locality to the traced code. It might be worth checking that before proceeding; I guess it's even possible that the GetLWTrancheName() calls could be cheaper. Will see if I can run some checks and report back. BTW, if you want some of the details on how userspace SDTs work, https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html is interesting and useful. It helps explain uprobes, ftrace, bcc, etc. Or maybe we > don't need the name, or can record it differently, which might also > address your other concern that it's too expensive to compute. In any > case, I think an argument order like > > probe lwlock__acquite(const char *, int, LWLock*, LWLockMode); > > would make more sense. > OK. In 0004, you add a probe to record the application_name setting? Would > there be any value in making that a generic probe that can record any > GUC change? > Yes, there would, but I didn't want to go and do that in the same patch, and a named probe on application_name is useful sep
Re: [PATCH] Identify LWLocks in tracepoints
On Mon, 22 Mar 2021 at 17:00, Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > On 20.03.21 01:29, Craig Ringer wrote: > > There is already support for that. See the documentation at the end > of > > this page: > > > https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS > > < > https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS > > > > > > > > Pretty sure it won't work right now. > > > > To use systemtap semaphores (the _ENABLED macros) you need to run dtrace > > -g to generate a probes.o then link that into postgres. > > > > I don't think we do that. I'll double check soon. > > We do that. (It's -G.) > Huh. I could've sworn we didn't. My mistake, it's there in src/backend/Makefile . In that case I'll amend the patch to use semaphore guards. (On a side note, systemtap's semaphore support is actually a massive pain. The way it's implemented in means that a single compilation unit may not use both probes.d style markers produced by the dtrace script and use regular DTRACE_PROBE(providername,probename) preprocessor macros. If it attempts to do so, the DTRACE_PROBE macros will emit inline asm that tries to reference probename_semaphore symbols that will not exist, resulting in linker errors or runtime link errors. But that's really a systemtap problem. Core PostgreSQL doesn't use any explicit DTRACE_PROBE(...), STAP_PROBE(...) etc.)
Re: [PATCH] More docs on what to do and not do in extension code
On Fri, 26 Mar 2021 at 06:15, Bruce Momjian wrote: > On Thu, Mar 25, 2021 at 08:49:44AM -0400, David Steele wrote: > > On 1/22/21 1:36 AM, Craig Ringer wrote: > > > > > > Would you mind attaching a revised version of the patch with your > edits? > > > Otherwise I'll go and merge them in once you've had your say on my > > > comments inline below. > > > > Bharath, do the revisions in [1] look OK to you? > > > > > Bruce, Robert, can I have an opinion from you on how best to locate and > > > structure these docs, or whether you think they're suitable for the > main > > > docs at all? See patch upthread. > > > > Bruce, Robert, any thoughts here? > > I know I sent an email earlier this month saying we shouldn't > over-document the backend hooks because the code could drift away from > the README content: > > > https://www.postgresql.org/message-id/20210309172049.GD26575%40momjian.us > > Agreed. If you document the hooks too much, it allows them to > drift > away from matching the code, which makes the hook documentation > actually > worse than having no hook documentation at all. > > However, for this doc patch, the content seem to be more strategic, so > less likely to change, and hard to figure out from the code directly. > Therefore, I think this would be a useful addition to the docs. > Thanks for the kind words. It's good to hear that it may be useful. Let me know if anything further is needed.
Re: [PATCH] Identify LWLocks in tracepoints
On Sat, 20 Mar 2021, 04:21 Peter Eisentraut, < peter.eisentr...@enterprisedb.com> wrote: > > On 18.03.21 07:34, Craig Ringer wrote: > > The main reason I didn't want to add more tracepoints than were strictly > > necessary is that Pg doesn't enable the systemtap semaphores feature, so > > right now we do a T_NAME(lock) evaluation each time we pass a tracepoint > > if --enable-dtrace is compiled in, whether or not anything is tracing. > > This was fine on pg11 where it was just: > > > > #define T_NAME(lock) \ > > (LWLockTrancheArray[(lock)->tranche]) > > > > but since pg13 it instead expands to > > > > GetLWTrancheName((lock)->tranche) > > > > where GetLWTrancheName isn't especially trivial. We'll run that function > > every single time we pass any of these tracepoints and then discard the > > result, which is ... not ideal. That applies so long as Pg is compiled > > with --enable-dtrace. I've been meaning to look at enabling the > > systemtap semaphores feature in our build so these can be wrapped in > > unlikely(TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED()) guards, but I wanted > > to wrap this patch set up first as there are some complexities around > > enabling the semaphores feature. > > There is already support for that. See the documentation at the end of > this page: > > https://www.postgresql.org/docs/devel/dynamic-trace.html#DEFINING-TRACE-POINTS Pretty sure it won't work right now. To use systemtap semaphores (the _ENABLED macros) you need to run dtrace -g to generate a probes.o then link that into postgres. I don't think we do that. I'll double check soon.
Re: [PATCH] Identify LWLocks in tracepoints
On Thu, 11 Mar 2021 at 15:57, Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > On 10.03.21 06:38, Craig Ringer wrote: > > On Wed, 3 Mar 2021 at 20:50, David Steele > <mailto:da...@pgmasters.net>> wrote: > > > > On 1/22/21 6:02 AM, Peter Eisentraut wrote: > > > > This patch set no longer applies: > > http://cfbot.cputube.org/patch_32_2927.log > > <http://cfbot.cputube.org/patch_32_2927.log>. > > > > Can we get a rebase? Also marked Waiting on Author. > > > > > > Rebased as requested. > > In patch 0001, why was the TRACE_POSTGRESQL_LWLOCK_RELEASE() call moved? > Is there some correctness issue? If so, we should explain that (at > least in the commit message, or as a separate patch). > If you want I can split it out, or drop that change. I thought it was sufficiently inconsequential, but you're right to check. The current tracepoint TRACE_POSTGRESQL_LWLOCK_RELEASE really means "releaseD". It's appropriate to emit this as soon as the lock could be acquired by anything else. By deferring it until we'd processed the waitlist and woken other backends the window during which the lock was reported as "held" was longer than it truly was, and it was easy to see one backend acquire the lock while another still appeared to hold it. It'd possibly make more sense to have a separate TRACE_POSTGRESQL_LWLOCK_RELEASING just before the `pg_atomic_sub_fetch_u32` call. But I didn't want to spam the tracepoints too hard, and there's always going to be some degree of overlap because tracing tools cannot intercept and act during the atomic swap, so they'll always see a slightly premature or slightly delayed release. This window should be as short as possible though, hence moving the tracepoint. Side note: The main reason I didn't want to add more tracepoints than were strictly necessary is that Pg doesn't enable the systemtap semaphores feature, so right now we do a T_NAME(lock) evaluation each time we pass a tracepoint if --enable-dtrace is compiled in, whether or not anything is tracing. This was fine on pg11 where it was just: #define T_NAME(lock) \ (LWLockTrancheArray[(lock)->tranche]) but since pg13 it instead expands to GetLWTrancheName((lock)->tranche) where GetLWTrancheName isn't especially trivial. We'll run that function every single time we pass any of these tracepoints and then discard the result, which is ... not ideal. That applies so long as Pg is compiled with --enable-dtrace. I've been meaning to look at enabling the systemtap semaphores feature in our build so these can be wrapped in unlikely(TRACE_POSTGRESQL_LWLOCK_RELEASE_ENABLED()) guards, but I wanted to wrap this patch set up first as there are some complexities around enabling the semaphores feature.
Re: Detecting File Damage & Inconsistencies
On Mon, 15 Mar 2021 at 21:01, David Steele wrote: > On 11/18/20 5:23 AM, Simon Riggs wrote: > > On Wed, 18 Nov 2020 at 06:42, Craig Ringer > > wrote: > >> > >> On Fri, Nov 13, 2020 at 7:24 PM Simon Riggs > wrote: > >>> > >>> > >>> What I'm proposing is an option to add 16 bytes onto each COMMIT > >>> record > >> > >> > >> Would it make sense to write this at the time we write a topxid > assignment to WAL instead? > >> > >> Otherwise it won't be accessible to streaming-mode logical decoding. > > > > Do you mean extend the xl_xact_assignment record? My understanding is > > that is not sent in all cases, so not sure what you mean by "instead". > > Craig, can you clarify? > Right. Or write a separate WAL record when the feature is enabled. But it's probably sufficient to write it as an optional chunk on xl_xact_assignment records. We often defer writing them so we can optimise away xacts that never actually wrote anything, but IIRC we still write one before we write any WAL that references the xid. That'd be fine, since we don't need the info any sooner than that during decoding. I'd have to double check that we write it in all cases and won't get to that too soon, but I'm pretty sure we do...
Re: [PATCH] Identify LWLocks in tracepoints
On Wed, 3 Mar 2021 at 20:50, David Steele wrote: > On 1/22/21 6:02 AM, Peter Eisentraut wrote: > > This patch set no longer applies: > http://cfbot.cputube.org/patch_32_2927.log. > > Can we get a rebase? Also marked Waiting on Author. > Rebased as requested. I'm still interested in whether Andres will be able to do anything about identifying LWLocks in a cross-backend manner. But this work doesn't really depend on that; it'd benefit from it, but would be easily adapted to it later if needed. From 36c7ddcbca2dbbcb2967f01cb92aa1f61620c838 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Thu, 19 Nov 2020 17:38:45 +0800 Subject: [PATCH 1/4] Pass the target LWLock* and tranche ID to LWLock tracepoints Previously the TRACE_POSTGRESQL_LWLOCK_ tracepoints only received a pointer to the LWLock tranche name. This made it impossible to identify individual locks. Passing the lock pointer itself isn't perfect. If the lock is allocated inside a DSM segment then it might be mapped at a different address in different backends. It's safe to compare lock pointers between backends (assuming !EXEC_BACKEND) if they're in the individual lock tranches or an extension-requested named tranche, but not necessarily for tranches in BuiltinTrancheIds or tranches >= LWTRANCHE_FIRST_USER_DEFINED that were directly assigned with LWLockNewTrancheId(). Still, it's better than nothing; the pointer is stable within a backend, and usually between backends. --- src/backend/storage/lmgr/lwlock.c | 35 +++ src/backend/utils/probes.d| 18 +--- 2 files changed, 32 insertions(+), 21 deletions(-) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 8cb6a6f042..5c8744d316 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -1321,7 +1321,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode, lock, +lock->tranche); for (;;) { @@ -1343,7 +1344,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode, lock, +lock->tranche); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockAcquire", lock, "awakened"); @@ -1352,7 +1354,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) result = false; } - TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode, lock, lock->tranche); /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; @@ -1403,14 +1405,16 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode) RESUME_INTERRUPTS(); LOG_LWDEBUG("LWLockConditionalAcquire", lock, "failed"); - TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode, lock, +lock->tranche); } else { /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; held_lwlocks[num_held_lwlocks++].mode = mode; - TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode, lock, +lock->tranche); } return !mustwait; } @@ -1482,7 +1486,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode, lock, + lock->tranche); for (;;) { @@ -1500,7 +1505,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) Assert(nwaiters < MAX_BACKENDS); } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode, lock, + lock->tranche); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockAcquireOrWait", lock, "awakened"); @@ -1530,7 +1536,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) /* Failed to get lock, so release interrupt holdoff */ RESUME_INTERRUPTS(); LOG_LWDEBUG("LWLockAcquireOrWait", lock, "failed"); - TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode, lock, +lock->tranche); } else { @@ -1538,7 +1545,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; held_lwlocks[num_held_lwlocks++].mode = mode; - TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode, lock, +lock->tranch
Re: libpq compression
On Thu, 11 Feb 2021, 21:09 Daniil Zakhlystov, wrote:: > > 3. Chunked compression allows to compress only well compressible messages > and save the CPU cycles by not compressing the others > 4. Chunked compression introduces some traffic overhead compared to the > permanent (1.2810G vs 1.2761G TX data on pg_restore of IMDB database dump, > according to results in my previous message) > 5. From the protocol point of view, chunked compression seems a little bit > more flexible: > - we can inject some uncompressed messages at any time without the need > to decompress/compress the compressed data > - we can potentially switch the compression algorithm at any time (but I > think that this might be over-engineering) > Chunked compression also potentially makes it easier to handle non blocking sockets, because you aren't worrying about yet another layer of buffering within the compression stream. This is a real pain with SSL, for example. It simplifies protocol analysis. It permits compression to be decided on the fly, heuristically, based on message size and potential compressibility. It could relatively easily be extended to compress a group of pending small messages, e.g. by PQflush. That'd help mitigate the downsides with small messages. So while stream compression offers better compression ratios, I'm inclined to suspect we'll want message level compression. >
Re: PATCH: Batch/pipelining support for libpq
On Wed, 17 Feb 2021, 07:13 Alvaro Herrera, wrote: > Here's a new version, where I've renamed everything to "pipeline". Wow. Thanks. I > think the docs could use some additional tweaks now in order to make a > coherent story on pipeline mode, how it can be used in a batched > fashion, etc. > I'll do that soon and send an update. > >
libpq PQresultErrorMessage vs PQerrorMessage API issue
Hi all This morning for the the umpteenth time I saw: some error message: [blank here] output from a libpq program. That's because passing a NULL PGresult to PQgetResultErrorMessage() returns "". But a NULL PGresult is a normal result from PQexec when it fails to submit a query due to an invalid connection, when PQgetResult can't get a result from an invalid connection, etc. E.g. this pattern: res = PQexec(conn, "something"); if (PQstatus(res) != PGRES_TUPLES_OK) { report_an_error("some error message: %s", PQresultErrorMessage(res)); } ... where "res" is NULL because the connection was invalid, so PQresultErrorMessage(res) emits the empty string. As a result, using PQerrorMessage(conn) is actually better in most cases, despite the static error buffer issues. It'll do the right thing when the connection itself is bad. Alternately you land up with the pattern res == NULL ? PQerrorMessage(conn) : PQresultErrorMessage(res) I'm not quite sure what to do about this. Ideally PQresultErrorMessage() would take the PGconn* too, but it doesn't, and it's not too friendly to add an extra argument. Plus arguably they mean different things. Maybe it's as simple as changing the docs to say that you should prefer PQerrorMessage() if you aren't using multiple PGresult s at a time, and mentioning the need to copy the error string. But I'd kind of like to instead return a new non-null PGresult PGRES_RESULT_ERROR whenever we'd currently return a NULL PGresult due to a failure. Embed the error message into the PGresult, so PQresultErrorMessage() can fetch it. Because a PGresult needs to be owned by a PGconn and a NULL PGconn can't own anything, we'd instead return a pointer to a static const global PGresult with value PGRES_NO_CONNECTION if any function is passed a NULL PGconn*. That way it doesn't matter if it gets PQclear()ed or not. And PQclear() could test for (res == PGresult_no_connection) and not try to free it if found. The main issue I see there is that existing code may expect a NULL PGresult and may test for (res == NULL) as an indicator of a query-sending failure from PQexec etc. So I suspect we'd need a libpq-global option to enable this behaviour for apps that are aware of it - we wouldn't want to add new function signature variants after all. Similar changes would make sense for returning NULL when there are no result sets remaining after a PQsendQuery, and for returning NULL after row-by-row fetch mode runs out of rows.
Re: PATCH: Batch/pipelining support for libpq
On Tue, 16 Feb 2021 at 09:19, Craig Ringer wrote: > > So long as there is a way to "send A", "send B", "send C", "read results > from A", "send D", and there's a way for the application to associate some > kind of state (an application specific id or index, a pointer to an > application query-queue struct, whatever) so it can match queries to > results ... then I'm happy. > FWIW I'm also thinking of revising the docs to mostly use the term "pipeline" instead of "batch". Use "pipelining and batching" in the chapter topic, and mention "batch" in the index, and add a para that explains how to run batches on top of pipelining, but otherwise use the term "pipeline" not "batch". That's because pipelining isn't actually batching, and using it as a naïve batch interface will get you in trouble. If you PQsendQuery(...) a long list of queries without consuming results, you'll block unless you're in libpq's nonblocking-send mode. You'll then be deadlocked because the server can't send results until you consume some (tx buffer full) and can't consume queries until it can send some results, but you can't consume results because you're blocked on a send that'll never end. An actual batch interface where you can bind and send a long list of queries might be worthwhile, but should be addressed separately, and it'd be confusing if this pipelining interface claimed the term "batch". I'm not convinced enough application developers actually code directly against libpq for it to be worth creating a pretty batch interface where you can submit (say) an array of struct PQbatchEntry { const char * query, int nparams, ... } to a PQsendBatch(...) and let libpq handle the socket I/O for you. But if someone wants to add one later it'll be easier if we don't use the term "batch" for the pipelining feature. I'll submit a reworded patch in a bit.
Re: PATCH: Batch/pipelining support for libpq
On Thu, 11 Feb 2021 at 07:51, Alvaro Herrera wrote: > On 2021-Jan-21, Alvaro Herrera wrote: > > > As you can see in an XXX comment in the libpq test program, the current > > implementation has the behavior that PQgetResult() returns NULL after a > > batch is finished and has reported PGRES_BATCH_END. I don't know if > > there's a hard reason to do that, but I'd like to supress it because it > > seems weird and out of place. > > Hello Craig, a question for you since this API is your devising. I've > been looking at changing the way this works to prevent those NULL > returns from PQgetResult. That is, instead of having what seems like a > "NULL separator" of query results, you'd just get the PGRES_BATCH_END to > signify a batch end (not a NULL result after the BATCH_END); and the > normal PGRES_COMMAND_OK or PGRES_TUPLES_OK etc when the result of a > command has been sent. It's not working yet so I'm not sending an > updated patch, but I wanted to know if you had a rationale for including > this NULL return "separator" or was it just a convenience because of how > the code grew together. > The existing API for libpq actually specifies[1] that you should call PQgetResult() until it returns NULL: > After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done. Similarly, in single-row mode, the existing API specifies that you should call PQgetResult() until it returns NULL. Also, IIRC the protocol already permits multiple result sets to be returned, and the caller cannot safely assume that a single PQsendQuery() will produce only a single result set. (I really should write a test extension that exercises that and how libpq reacts to it.) That's why I went for the NULL return. I think. It's been a while now so it's a bit fuzzy. I would definitely like an API that does not rely on testing for a NULL return. Especially since NULL return can be ambiguous in the context of row-at-a-time mode. New explicit enumerations for PGresult would make a lot more sense. So +1 from me for the general idea. I need to look at the patch as it has evolved soon too. Remember that the original patch I submitted for this was a 1-day weekend hack and proof of concept to show that libpq could be modified to support query pipelining (and thus batching too), so I could illustrate the performance benefits that could be attained by doing so. I'd been aware of the benefits and the protocol's ability to support it for some time thanks to working on PgJDBC, but couldn't get anyone interested without some C code to demonstrate it, so I wrote some. I am not going to argue that the API I added for it is ideal in any way, and happy to see improvements. The only change I would very strongly object to would be anything that turned this into a *batch* mode without query-pipelining support. If you have to queue all the queries up in advance then send them as a batch and wait for all the results, you miss out on a lot of the potential round-trip-time optimisations and you add initial latency. So long as there is a way to "send A", "send B", "send C", "read results from A", "send D", and there's a way for the application to associate some kind of state (an application specific id or index, a pointer to an application query-queue struct, whatever) so it can match queries to results ... then I'm happy. [1] https://www.postgresql.org/docs/current/libpq-async.html#LIBPQ-PQSENDQUERY
Re: How to expose session vs txn lock info in pg_locks view?
On Sun, 24 Jan 2021 at 09:12, Andres Freund wrote: > Hi, > > On 2021-01-19 14:16:07 +0800, Craig Ringer wrote: > > AFAICS it'd be necessary to expand PROCLOG to expose this in shmem. > > Probably by adding a small bitfield where bit 0 is set if there's a txn > > level lock and bit 1 is set if there's a session level lock. But I'm not > > convinced that expanding PROCLOCK is justifiable for this. > sizeof(PROCLOCK) > > is 64 on a typical x64 machine. Adding anything to it increases it to 72 > > bytes. > > Indeed - I really don't want to increase the size, it's already a > problem. > > > > It's frustrating to be unable to tell the difference between > session-level > > and txn-level locks in diagnostic output. > > It'd be useful, I agree. > > > > And the deadlock detector has no way to tell the difference when > > selecting a victim for a deadlock abort - it'd probably make sense to > > prefer to send a deadlock abort for txn-only lockers. > > I'm doubtful this is worth going for. > > > > But I'm not sure I see a sensible way to add the info - PROCLOCK is > > already free of any padding, and I wouldn't want to use hacks like > > pointer-tagging. > > I think there's an easy way to squeeze out space: make groupLeader be an > integer index into allProcs instead. That requires only 4 bytes... > > Alternatively, I think it'd be reasonably easy to add the scope as a bit > in LOCKMASK - there's plenty space. > I was wondering about that, but concerned that there would be impacts I did not understand. I'm happy to pursue that angle.
Re: Preventing hangups in bgworker start/stop during DB shutdown
On Fri, 25 Dec 2020 at 06:07, Tom Lane wrote: > I wrote: > > Bharath Rupireddy writes: > >> 4) IIUC, in the patch we mark slot->terminate = true only for > >> BGW_NEVER_RESTART kind bg workers, what happens if a bg worker has > >> bgw_restart_time seconds and don't we hit the hanging issue(that we > >> are trying to solve here) for those bg workers? > > > The hang problem is not with the worker itself, it's with anything > > that might be waiting around for the worker to finish. It doesn't > > seem to me to make a whole lot of sense to wait for a restartable > > worker; what would that mean? > > Upon further looking around, I noted that autoprewarm's > autoprewarm_start_worker() function does that, so we can't really > dismiss it. > > However, what we can do instead is to change the condition to be > "cancel pending bgworker requests if there is a waiting process". > Almost all of the time, that means it's a dynamic bgworker with > BGW_NEVER_RESTART, so there's no difference. In the exceptional > cases like autoprewarm_start_worker, this would result in removing > a bgworker registration record for a restartable worker ... but > since we're shutting down, that record would have no effect before > the postmaster exits, anyway. I think we can live with that, at > least till such time as somebody redesigns this in a cleaner way. > > I pushed a fix along those lines. > > Thanks for the change. Cleanups like this in the BGW API definitely make life easier.
Re: [PATCH] More docs on what to do and not do in extension code
Hi Thanks so much for reading over this! Would you mind attaching a revised version of the patch with your edits? Otherwise I'll go and merge them in once you've had your say on my comments inline below. Bruce, Robert, can I have an opinion from you on how best to locate and structure these docs, or whether you think they're suitable for the main docs at all? See patch upthread. On Tue, 19 Jan 2021 at 22:33, Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > > Here are some comments: > > [1] >background worker's main function, and must be unblocked by it; this is > to > allow the process to customize its signal handlers, if necessary. > - Signals can be unblocked in the new process by calling > - BackgroundWorkerUnblockSignals and blocked by > calling > - BackgroundWorkerBlockSignals. > + It is important that all background workers set up and unblock signal > + handling before they enter their main loops. Signal handling in > background > + workers is discussed separately in . > > I wasn't sure either way on that, see your [3] below. [2] > + interupt-aware APIs for the purpose. Do not > usleep(), > + system(), make blocking system calls, etc. > + > > Is it "Do not use usleep(), > system() or make blocking system calls etc." ? > Right. Good catch. [3] IMO, we can remove following from "bgworker-signals" if we retain > it where currently it is, as discussed in [1]. > +Signals can be unblocked in the new process by calling > +BackgroundWorkerUnblockSignals and blocked by > calling > +BackgroundWorkerBlockSignals. > If so, need to mention that they start blocked and link to the main text where that's mentioned. That's part of why I moved this chunk into the signal section. [4] Can we say > +The default signal handlers set up for background workers do > +default background worker signal handlers, it should call > > instead of > +The default signal handlers installed for background workers > do > +default background worker signal handling it should call > Huh? I don't understand this proposal. s/install/set up/g? [5] IMO, we can have something like below > +request, etc. Set up these handlers before unblocking signals as > +shown below: > > instead of > +request, etc. To install these handlers, before unblocking interrupts > +run: > Ditto [6] I think logs and errors either elog() or ereport can be used, so how > about > +Use elog() or ereport() > for > +logging output or raising errors instead of any direct stdio > calls. > > instead of > +Use elog() and > ereport() for > +logging output and raising errors instead of any direct stdio > calls. > OK. [7] Can we use child processes instead of subprocess ? If okay in > other places in the patch as well. > Fine with me. The point is really that they're non-postgres processes being spawned by a backend, and that doing so must be done carefully. [8] Why should file descriptor manager API be used to execute > subprocesses/child processes? > +To execute subprocesses and open files, use the routines provided > by > +the file descriptor manager like > BasicOpenFile > +and OpenPipeStream instead of a direct > Yeah, that wording is confusing, agreed. The point was that you shouldn't use system() or popen(), you should OpenPipeStream(). And similarly, you should avoid fopen() etc and instead use the Pg wrapper BasicOpenFile(). " PostgreSQL backends are required to limit the number of file descriptors they open. To open files, use postgres file descriptor manager routines like BasicOpenFile() instead of directly using open() or fopen(). To open pipes to or from external processes, use OpenPipeStream() instead of popen(). " ? > [9] "should always be"? even if it's a blocking extesion, does it > work? If our intention is to recommend the developers, maybe we should > avoid using the term "should" in the patch in other places as well. > The trouble is that it's a bit ... fuzzy. You can get away with blocking for short periods without responding to signals, but it's a "how long is a piece of string" issue. "should be" is fine. A hard "must" or "must not" would be misleading. But this isn't the place to go into all the details of how time sensitive (or not) interrupt handling of different kinds is in different places for different worker types. > [11] I think it is > +block if this happens. So cleanup of resources is not > entirely managed by PostgreSQL, it > + must be handled using appropriate callbacks provided by PostgreSQL > > instead of > +block if this happens. So all cleanup of resources not already > +managed by the PostgreSQL runtime must be handled using > appropriate > I don't agree with the proposed new wording here. Delete the "So all" from my original, or ... Cleanup of any resources that are not managed > by the PostgreSQL runtime must be handled using appr
Re: Printing backtrace of postgres processes
On Thu, 21 Jan 2021 at 09:56, Tom Lane wrote: > Craig Ringer writes: > > On Wed, 20 Jan 2021 at 05:23, Tom Lane wrote: > >> BTW, it also looks like the patch is doing nothing to prevent the > >> backtrace from being sent to the connected client. > > > I don't see a good reason to send a bt to a client. Even though these > > backtraces won't be analysing debuginfo and populating args, locals, etc, > > it should still just go to the server log. > > Yeah. That's easier than I was thinking, we just need to > s/LOG/LOG_SERVER_ONLY/. > > >> Maybe, given all of these things, we should forget using elog at > >> all and just emit the trace with fprintf(stderr). > > > That causes quite a lot of pain with MemoryContextStats() already > > True. Given the changes discussed in the last couple messages, I don't > see any really killer reasons why we can't ship the trace through elog. > We can always try that first, and back off to fprintf if we do find > reasons why it's too unstable. > Yep, works for me. Thanks for being open to considering this. I know lots of this stuff can seem like a pointless sidetrack because the utility of it is not obvious on dev systems or when you're doing your own hands-on expert support on systems you own and operate yourself. These sorts of things really only start to make sense when you're touching many different postgres systems "in the wild" - such as commercial support services, helping people on -general, -bugs or stackoverflow, etc. I really appreciate your help with it.
Re: Add docs stub for recovery.conf
On Wed, 20 Jan 2021 at 02:45, Stephen Frost wrote: > Greetings, > > * Stephen Frost (sfr...@snowman.net) wrote: > > * Craig Ringer (craig.rin...@enterprisedb.com) wrote: > > > On Thu, 14 Jan 2021 at 03:44, Stephen Frost > wrote: > > > > Alright, how does this look? The new entries are all under the > > > > 'obsolete' section to keep it out of the main line, but should work > to > > > > 'fix' the links that currently 404 and provide a bit of a 'softer' > > > > landing for the other cases that currently just forcibly redirect > using > > > > the website doc alias capability. > > > > > > Thanks for expanding the change to other high profile obsoleted or > renamed > > > features and tools. > > > > Thanks for taking the time to review it and comment on it! > > > > > One minor point. I'm not sure this is quite the best way to spell the > index > > > entries: > > > > > > + > > > + obsolete > > > + pg_receivexlog > > > + > > > > > > as it will produce an index term "obsolete" with a list of various > > > components under it. While that concentrates them nicely, it means > people > > > won't actually find them if they're using the index alphabetically. > > > > Ah, yeah, that's definitely a good point and one that I hadn't really > > spent much time thinking about. > > > > > I'd slightly prefer > > > > > > + > > > + pg_receivexlog > > > + pg_receivewal > > > + > > > > > > even though that bulks the index up a little, because then people are > a bit > > > more likely to find it. > > > > Yup, makes sense, updated patch attached which makes that change. > > > > > > I ended up not actually doing this for the catalog -> view change of > > > > pg_replication_slots simply because I don't really think folks will > > > > misunderstand or be confused by that redirect since it's still the > same > > > > relation. If others disagree though, we could certainly change that > > > > too. > > > > > > I agree with you. > > > > Ok, great. > > > > How does the attached look then? > Pretty good to me. Thanks so much for your help and support with this. Index entries render as e.g. pg_xlogdump, The pg_xlogdump command (see also pg_waldump) wheras with the obsolete subhead they would render as something like: obsolete, Obsolete or renamed features, settings and files pg_xlogdump, The pg_xlogdump command The see also spelling is much easier to find in the index but doesn't make it as obvious that it's obsoleted/replaced. A look at the doxygen docs suggest we should use not for these. A quick sed -i -e 's///g' -e 's/<\/seealso>/<\/see>/g' doc/src/sgml/appendix-obsolete* causes them to render much better: pg_receivexlog, The pg_receivexlog command (see pg_receivewal) It might be worth changing the s too, so I've done so in the attached. The terms now render as: pg_receivexlog, pg_receivexlog renamed to pg_recievewal (see pg_receivewal) which is good enough in my opinion. The duplication is messy but an expected artifact of index generation. I don't see any docbook attribute that lets you suppress insertion of the of the section containing the , and it's not worth fiddling to try to eliminate it with structural hacks. The attached changes the titles, changes to , and also updates the comments in the obsolete entries SGML docs to specify that the id must be unchanged + give a recommended index term format. From c11e0f079c07c669abac0f00ae3f1ebdfc18eae7 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Thu, 21 Jan 2021 10:01:29 +0800 Subject: [PATCH v3] Add a docs section for obsoleted and renamed functions and settings The new appendix groups information on renamed or removed settings, commands, etc into an out-of-the-way part of the docs. The original id elements are retained in each subsection to ensure that the same filenames are produced for HTML docs. This prevents /current/ links on the web from breaking, and allows users of the web docs to follow links from old version pages to info on the changes in the new version. Prior to this change, a link to /current/ for renamed sections like the recovery.conf docs would just 404. Similarly if someone searched for recovery.conf they would find the pg11 docs, but there would be no /12/ or /current/ link, so they couldn't easily find out that it was removed in pg12 or how to adapt. In
Re: Printing backtrace of postgres processes
On Wed, 20 Jan 2021 at 05:23, Tom Lane wrote: > > Recursion is scary, but it should (I think) not be possible if this > is driven off CHECK_FOR_INTERRUPTS. There will certainly be none > of those in libbacktrace. > We can also hold interrupts for the call, and it might be wise to do so. One point here is that it might be a good idea to suppress elog.c's > calls to functions in the error context stack. As we saw in another > connection recently, allowing that to happen makes for a *very* > large increase in the footprint of code that you are expecting to > work at any random CHECK_FOR_INTERRUPTS call site. > I strongly agree. Treat it as errhidecontext(). BTW, it also looks like the patch is doing nothing to prevent the > backtrace from being sent to the connected client. I'm not sure > what I think about whether it'd be okay from a security standpoint > to do that on the connection that requested the trace, but I sure > as heck don't want it to happen on connections that didn't. I don't see a good reason to send a bt to a client. Even though these backtraces won't be analysing debuginfo and populating args, locals, etc, it should still just go to the server log. > Maybe, given all of these things, we should forget using elog at > all and just emit the trace with fprintf(stderr). That causes quite a lot of pain with MemoryContextStats() already as it's frequently difficult to actually locate the output given the variations that exist in customer logging configurations. Sometimes stderr goes to a separate file or to journald. It's also much harder to locate the desired output since there's no log_line_prefix. I have a WIP patch floating around somewhere that tries to teach MemoryContextStats to write to the ereport channel when not called during an actual out-of-memory situation for that reason; an early version is somewhere in the archives. This is one of those "ok in development, painful in production" situations. So I'm not a big fan of pushing it to stderr, though I'd rather have that than not have the ability at all.
Re: Printing backtrace of postgres processes
On Wed, 20 Jan 2021 at 01:31, Robert Haas wrote: > On Sat, Jan 16, 2021 at 3:21 PM Tom Lane wrote: > > I'd argue that backtraces for those processes aren't really essential, > > and indeed that trying to make the syslogger report its own backtrace > > is damn dangerous. > > I agree. Ideally I'd like to be able to use the same mechanism > everywhere and include those processes too, but surely regular > backends and parallel workers are going to be the things that come up > most often. > > > (Personally, I think this whole patch fails the safety-vs-usefulness > > tradeoff, but I expect I'll get shouted down.) > > You and I are frequently on opposite sides of these kinds of > questions, but I think this is a closer call than many cases. I'm > convinced that it's useful, but I'm not sure whether it's safe. On the > usefulness side, backtraces are often the only way to troubleshoot > problems that occur on production systems. I wish we had better > logging and tracing tools instead of having to ask for this sort of > thing, but we don't. Agreed. In theory we should be able to do this sort of thing using external trace and diagnostic tools like perf, systemtap, etc. In practice, these tools tend to be quite version-sensitive and hard to get right without multiple rounds of back-and-forth to deal with specifics of the site's setup, installed debuginfo or lack thereof, specific tool versions, etc. It's quite common to have to fall back on attaching gdb with a breakpoint on a function in the export symbol table (so it works w/o debuginfo), saving a core, and then analysing the core on a separate system on which debuginfo is available for all the loaded modules. It's a major pain. The ability to get a basic bt from within Pg is strongly desirable. IIRC gdb's basic unwinder works without external debuginfo, if not especially well. libunwind produces much better results, but that didn't pass the extra-dependency bar when backtracing support was introduced to core postgres. On a side note, to help get better diagnostics I've also been meaning to look into building --enable-debug with -ggdb3 so we can embed macro info, and using dwz to deduplicate+compress the debuginfo so we can encourage people to install it by default on production. I also want to start exporting pointers to all the important data symbols for diagnostic use, even if we do so in a separate ELF section just for debug use.
Re: [PATCH] ProcessInterrupts_hook
On Tue, 19 Jan 2021 at 12:44, Craig Ringer wrote: > > > We're about halfway there already, see 7e784d1dc. I didn't do the >> > other half because it wasn't necessary to the problem, but exposing >> > the shutdown state more fully seems reasonable. >> > > Excellent, I'll take a look. Thanks. > That looks very handy already. Extending it to be set before SIGTERM too would be handy. My suggestion, which I'm happy to post in patch form if you think it's reasonable: * Change QuitSignalReason to ExitSignalReason to cover both SIGTERM (fast) and SIGQUIT (immediate) * Rename PMQUIT_FOR_STOP to PMEXIT_IMMEDIATE_SHUTDOWN * Add enumeration values PMEXIT_SMART_SHUTDOWN and PMEXIT_FAST_SHUTDOWN * For a fast shutdown, in pmdie()'s SIGINT case call SetExitSignalReason(PMEXIT_FAST_SHUTDOWN), so that when PostmasterStateMachine() calls SignalSomeChildren(SIGTERM, ...) in response to PM_STOP_BACKENDS, the reason is already available. * For smart shutdown, in pmdie()'s SIGTERM case call SetExitSignalReason(PMEXIT_SMART_SHUTDOWN) and set the latch of every live backend. There isn't any appropriate PROCSIG so unless we want to overload PROCSIG_WALSND_INIT_STOPPING (ick), but I think it'd generally be sufficient to check GetExitSignalReason() in backend main loops. The fast shutdown case seems like a no-brainer extension of your existing patch. I'm not entirely sure about the smart shutdown case. I don't want to add a ProcSignal slot just for this and the info isn't urgent anyway. I think that by checking for postmaster shutdown in the backend main loop we'd be able to support eager termination of idle backends on smart shutdown (immediately, or after an idle grace period), which is something I've wanted for quite some time. It shouldn't be significantly expensive especially in the idle loop. Thoughts? (Also I want a hook in PostgresMain's idle loop for things like this).
How to expose session vs txn lock info in pg_locks view?
Presently there doesn't seem to be a way to tell whether a lock is session-level or transaction-level in the pg_locks view. I was expecting this to be a quick patch, but the comment on the definition of PROCLOCKTAG in lock.h notes that shmem state for heavyweight locks does not track whether the lock is session-level or txn-level. That explains why it's not already exposed in pg_locks. AFAICS it'd be necessary to expand PROCLOG to expose this in shmem. Probably by adding a small bitfield where bit 0 is set if there's a txn level lock and bit 1 is set if there's a session level lock. But I'm not convinced that expanding PROCLOCK is justifiable for this. sizeof(PROCLOCK) is 64 on a typical x64 machine. Adding anything to it increases it to 72 bytes. (gdb) ptype /o struct PROCLOCK /* offset| size */ type = struct PROCLOCK { /*0 |16 */PROCLOCKTAG tag; /* 16 | 8 */PGPROC *groupLeader; /* 24 | 4 */LOCKMASK holdMask; /* 28 | 4 */LOCKMASK releaseMask; /* 32 |16 */SHM_QUEUE lockLink; /* 48 |16 */SHM_QUEUE procLink; /* 64 | 1 */unsigned char locktypes; /* XXX 7-byte padding */ /* total size (bytes): 72 */ } Going over 64 sets off possible alarm bells about cache line sizing to me, but maybe it's not that critical? It'd also require (8 * max_locks_per_xact * (MaxBackends+max_prepared_xacts)) extra shmem space; that could land up being 128k on a default setup and a couple of megabytes on a big system. Not huge, but not insignificant if it's hot data. It's frustrating to be unable to tell the difference between session-level and txn-level locks in diagnostic output. And the deadlock detector has no way to tell the difference when selecting a victim for a deadlock abort - it'd probably make sense to prefer to send a deadlock abort for txn-only lockers. But I'm not sure I see a sensible way to add the info - PROCLOCK is already free of any padding, and I wouldn't want to use hacks like pointer-tagging. Thoughts anyone?
Re: [PATCH] ProcessInterrupts_hook
On Tue, 19 Jan 2021, 02:01 Robert Haas, wrote: > On Mon, Jan 18, 2021 at 11:56 AM Tom Lane wrote: > > > I've wanted this in the past, too, so +1 from me. > > > > I dunno, this seems pretty scary and easily abusable. There's not all > > that much that can be done safely in ProcessInterrupts(), and we should > > not be encouraging extensions to think they can add random processing > > there. > > We've had this disagreement before about other things, and I just > don't agree. If somebody uses a hook for something wildly unsafe, that > will break their stuff, not ours. Generally yeah. And we have no shortage of hooks with plenty of error or abuse potential and few safeguards already. I'd argue that in C code any external code is inherently unsafe anyway. So it's mainly down to whether the hook actively encourages unsafe actions without providing commensurate benefits, and whether there's a better/safer way to achieve the same thing. That's not to say I endorse adding hooks for random purposes in random places. In particular, if it's > impossible to use a particular hook in a reasonably safe way, that's a > sign that the hook is badly-designed and that we should not have it. > Yep. Agreed. Any hook is possible to abuse or write incorrectly, from simple fmgr loadable functions right on up. The argument that a hook could be abused would apply just as well to exposing pqsignal() itself to extensions. Probably more so. Also to anything like ProcessUtility_hook. > > We're about halfway there already, see 7e784d1dc. I didn't do the > > other half because it wasn't necessary to the problem, but exposing > > the shutdown state more fully seems reasonable. > Excellent, I'll take a look. Thanks.
[PATCH] ProcessInterrupts_hook
Hi folks A few times lately I've been doing things in extensions that've made me want to be able to run my own code whenever InterruptPending is true and CHECK_FOR_INTERRUPTS() calls ProcessInterrupts() So here's a simple patch to add ProcessInterrupts_hook. It follows the usual pattern like ProcessUtility_hook and standard_ProcessUtility. Why? Because sometimes I want most of the behaviour of die(), but the option to override it with some bgworker-specific choices occasionally. HOLD_INTERRUPTS() is too big a hammer. What I really want to go along with this is a way for any backend to observe the postmaster's pmState and its "Shutdown" variable's value, so any backend can tell if we're in FastShutdown, SmartShutdown, etc. Copies in shmem only obviously. But I'm not convinced it's right to just copy these vars as-is to shmem, and I don't want to use the memory for a ProcSignal slot for something that won't be relevant for most backends for most of the postmaster lifetime. Ideas welcomed. From 1c8c477a5814420011fa034323e82d8eabc6bc5f Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Mon, 18 Jan 2021 14:14:46 +0800 Subject: [PATCH v1 1/4] Provide a hook for ProcessInterrupts() Code may now register a ProcessInterrupts_hook to fire its own logic before and/or after the main ProcessInterrupts handler for signal processing. The hook must call standard_ProcessInterrupts to execute the normal interrupt handling or set InterruptsPending to true to cause the interrupt to be re-processed at the next opportunity. This follows the consistent pattern used by other PostgreSQL hooks like the ProcessUtility_hook and standard_ProcessUtility() function. The purpose of this hook is to allow extensions to react to their own custom interrupt flags during CHECK_FOR_INTERRUPTS() calls invoked in main PostgreSQL code paths. --- src/backend/tcop/postgres.c | 20 src/include/miscadmin.h | 3 +++ 2 files changed, 23 insertions(+) diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 28055680aa..8cf1359e33 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -102,6 +102,8 @@ int max_stack_depth = 100; /* wait N seconds to allow attach from a debugger */ int PostAuthDelay = 0; +/* Intercept CHECK_FOR_INTERRUPTS() responses */ +ProcessInterrupts_hook_type ProcessInterrupts_hook = NULL; /* @@ -3064,8 +3066,26 @@ ProcessInterrupts(void) /* OK to accept any interrupts now? */ if (InterruptHoldoffCount != 0 || CritSectionCount != 0) return; + InterruptPending = false; + if (ProcessInterrupts_hook) + ProcessInterrupts_hook(); + else + standard_ProcessInterrupts(); +} + +/* + * Implement the default signal handling behaviour for most backend types + * including user backends and bgworkers. + * + * This is where CHECK_FOR_INTERRUPTS() eventually lands up unless intercepted + * by ProcessInterrupts_hook. + */ +void +standard_ProcessInterrupts(void) +{ + if (ProcDiePending) { ProcDiePending = false; diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 1bdc97e308..f082d04448 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -94,6 +94,9 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount; /* in tcop/postgres.c */ extern void ProcessInterrupts(void); +typedef void (*ProcessInterrupts_hook_type)(void); +extern ProcessInterrupts_hook_type ProcessInterrupts_hook; +extern void standard_ProcessInterrupts(void); #ifndef WIN32 -- 2.29.2 From 58b9ac884ef5bd35a2aac9da85079e24097612be Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Mon, 18 Jan 2021 15:26:43 +0800 Subject: [PATCH v1 2/4] Test for ProcessInterrupts_hook --- src/test/modules/Makefile | 3 +- src/test/modules/test_hooks/.gitignore| 4 + src/test/modules/test_hooks/Makefile | 28 .../expected/test_processinterrupt_hook.out | 0 src/test/modules/test_hooks/hooks.conf| 1 + .../sql/test_processinterrupt_hook.sql| 16 ++ .../modules/test_hooks/test_hooks--1.0.sql| 8 + src/test/modules/test_hooks/test_hooks.c | 37 + .../modules/test_hooks/test_hooks.control | 4 + src/test/modules/test_hooks/test_hooks.h | 20 +++ .../test_hooks/test_process_interrupts_hook.c | 140 ++ 11 files changed, 260 insertions(+), 1 deletion(-) create mode 100644 src/test/modules/test_hooks/.gitignore create mode 100644 src/test/modules/test_hooks/Makefile create mode 100644 src/test/modules/test_hooks/expected/test_processinterrupt_hook.out create mode 100644 src/test/modules/test_hooks/hooks.conf create mode 100644 src/test/modules/test_hooks/sql/test_processinterrupt_hook.sql create mode 100644 src/test/modules/test_hooks/test_hooks--1.0.sql create mode 100644 src/test/modules/test_hooks/test_hooks.c create mode 100644 src/test/modules/test_hooks/test_h
[PATCH] More docs on what to do and not do in extension code
Hi folks The attached patch expands the xfunc docs and bgworker docs a little, providing a starting point for developers to learn how to do some common tasks the Postgres Way. It mentions in brief these topics: * longjmp() based exception handling with elog(ERROR), PG_CATCH() and PG_RE_THROW() etc * Latches, spinlocks, LWLocks, heavyweight locks, condition variables * shm, DSM, DSA, shm_mq * syscache, relcache, relation_open(), invalidations * deferred signal handling, CHECK_FOR_INTERRUPTS() * Resource cleanup hooks and callbacks like on_exit, before_shmem_exit, the resowner callbacks, etc * signal handling in bgworkers All very superficial, but all things I really wish I'd known a little about, or even that I needed to learn about, when I started working on postgres. I'm not sure it's in quite the right place. I wonder if there should be a separate part of xfunc.sgml that covers the slightly more advanced bits of postgres backend and function coding like this, lists relevant README files in the source tree, etc. I avoided going into details like how resource owners work. I don't want the docs to have to cover all that in detail; what I hope to do is start providing people with clear references to the right place in the code, READMEs, etc to look when they need to understand specific topics. From 96ce89cb7e1a97d9d247fbacba73ade85a01ea38 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Mon, 18 Jan 2021 14:05:15 +0800 Subject: [PATCH v1 2/2] Expand docs on PostgreSQL extension coding Expand the docs on PostgreSQL extension coding and background worker development a little so that key topics like allocation, interrupt handling, exit callbacks, transaction callbacks, PG_TRY()/PG_CATCH(), resource owners, transaction and snapshot state, etc are at least briefly mentioned with a few pointers to where to learn more. Add some detail on background worker signal handling. --- doc/src/sgml/bgworker.sgml | 86 ++-- doc/src/sgml/xfunc.sgml| 162 - 2 files changed, 239 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml index 7fd673ab54..9216b8e0ea 100644 --- a/doc/src/sgml/bgworker.sgml +++ b/doc/src/sgml/bgworker.sgml @@ -1,6 +1,6 @@ - + Background Worker Processes @@ -29,6 +29,12 @@ + + All code that will be executed in PostgreSQL background workers is expected + to follow the basic rules set out for all PostgreSQL extension code in . + + Background workers can be initialized at the time that PostgreSQL is started by including the module name in @@ -95,6 +101,11 @@ typedef struct BackgroundWorker buffers, or any custom data structures which the worker itself may wish to create and use. + + See for information on how to + request extension shared memory allocations, LWLocks, + etc. + @@ -212,9 +223,9 @@ typedef struct BackgroundWorker Signals are initially blocked when control reaches the background worker's main function, and must be unblocked by it; this is to allow the process to customize its signal handlers, if necessary. - Signals can be unblocked in the new process by calling - BackgroundWorkerUnblockSignals and blocked by calling - BackgroundWorkerBlockSignals. + It is important that all background workers set up and unblock signal + handling before they enter their main loops. Signal handling in background + workers is discussed separately in . @@ -296,13 +307,74 @@ typedef struct BackgroundWorker - The src/test/modules/worker_spi module - contains a working example, - which demonstrates some useful techniques. + Background workers that require inter-process communication and/or + synchronisation should use PostgreSQL's built-in IPC and concurrency + features as discussed in + and . + + + + If a background worker needs to sleep or wait for activity it should + always use PostgreSQL's + interupt-aware APIs for the purpose. Do not usleep(), + system(), make blocking system calls, etc. + + + + The src/test/modules/worker_spi and + src/test/modules/test_shm_mq contain working examples + that demonstrates some useful techniques. The maximum number of registered background workers is limited by . + + + Signal Handling in Background Workers + + +Signals can be unblocked in the new process by calling +BackgroundWorkerUnblockSignals and blocked by calling +BackgroundWorkerBlockSignals. +The default signal handlers installed for background workers do +not interrupt queries or blocking calls into other postgres code +so they are only suitable for simple background workers that frequently and +predictably return control to their main loop. If your worker uses the +default background worker signal handling it should call +Hand
[PATCH] Cross-reference comments on signal handling logic
Hi all The attached comments-only patch expands the signal handling section in miscadmin.h a bit so that it mentions ProcSignal, deferred signal handling during blocking calls, etc. It adds cross-refs between major signal handling routines and the miscadmin comment to help readers track the various scattered but inter-related code. I hope this helps some new developers in future. From a16d0a0f8f502ba3631d20d51c7bb50cedea6d57 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Mon, 18 Jan 2021 12:21:18 +0800 Subject: [PATCH v1 1/2] Comments and cross-references for signal handling To help new developers come to terms with the complexity of signal handling in PostgreSQL's various types backend, expand the main comment on signal handling in miscadmin.h. Cover the ProcSignal mechanism for SIGUSR1 multiplexing. Mention that blocking calls into 3rd party code and uninterruptible syscalls should be avoided in backend code because of Pg's deferred signal processing logic. Provide a set of cross-references to key routines related to signal handling. In various signal handling routines, cross-reference the high level overview comment in miscadmin.c. This should possibly become part of the developer documentation rather than a comment in a header file, but since we already have the comment, it seemed sensible to start by updating it and making it more discoverable. --- src/backend/postmaster/interrupt.c | 21 src/backend/tcop/postgres.c| 26 - src/include/miscadmin.h| 84 -- src/port/pgsleep.c | 2 +- 4 files changed, 113 insertions(+), 20 deletions(-) diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c index dd9136a942..e5256179ec 100644 --- a/src/backend/postmaster/interrupt.c +++ b/src/backend/postmaster/interrupt.c @@ -10,6 +10,15 @@ * src/backend/postmaster/interrupt.c * *- + * + * This file defines bare-bones interrupt handlers for secondary helper + * processes run by the postmaster - the walwriter and bgwriter, and + * potentially some background workers. + * + * These handlers are NOT used by normal user backends as they do not support + * interruption of normal execution to respond to a signal, query cancellation, + * etc. See miscadmin.h for details on interrupt handling used by normal + * postgres backends - CHECK_FOR_INTERRUPTS(), ProcessInterrupts(), die(), etc. */ #include "postgres.h" @@ -28,6 +37,9 @@ volatile sig_atomic_t ShutdownRequestPending = false; /* * Simple interrupt handler for main loops of background processes. + * + * See also CHECK_FOR_INTERRUPTS() and ProcessInterrupts() for the user-backend + * variant of this function. */ void HandleMainLoopInterrupts(void) @@ -51,6 +63,8 @@ HandleMainLoopInterrupts(void) * Normally, this handler would be used for SIGHUP. The idea is that code * which uses it would arrange to check the ConfigReloadPending flag at * convenient places inside main loops, or else call HandleMainLoopInterrupts. + * + * Most backends use this handler. */ void SignalHandlerForConfigReload(SIGNAL_ARGS) @@ -67,6 +81,9 @@ SignalHandlerForConfigReload(SIGNAL_ARGS) * Simple signal handler for exiting quickly as if due to a crash. * * Normally, this would be used for handling SIGQUIT. + * + * See also quickdie() and die() for the separate signal handling logic + * used by normal user backends. */ void SignalHandlerForCrashExit(SIGNAL_ARGS) @@ -99,6 +116,10 @@ SignalHandlerForCrashExit(SIGNAL_ARGS) * * ShutdownRequestPending should be checked at a convenient place within the * main loop, or else the main loop should call HandleMainLoopInterrupts. + * + * See also die() for the extended version of this handler that's used in + * backends that may need to be interrupted while performing long-running + * actions. */ void SignalHandlerForShutdownRequest(SIGNAL_ARGS) diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 8cf1359e33..907b524e0f 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -2852,8 +2852,15 @@ quickdie(SIGNAL_ARGS) } /* - * Shutdown signal from postmaster: abort transaction and exit - * at soonest convenient time + * Shutdown signal from postmaster: set flags to ask ProcessInterrupts() to + * abort the current transaction and exit at the soonest convenient time. + * + * This handler is used as the SIGTERM handler by most backend types that may + * run arbitrary backend code, user queries, etc. Some backends use different + * handlers and sometimes different flags. + * + * See "System interrupt and critical section handling" in miscadmin.h for a + * higher level explanation of signal handling. */ void die(SIGNAL_ARGS) @@ -2924,6 +2931,9 @@ FloatExceptionHandler(SIGNAL_ARGS) * handling following receipt of SIGUSR1. Desi
Re: Printing backtrace of postgres processes
On Mon, 18 Jan 2021 at 00:56, vignesh C wrote: > > Thanks for your comments Andres, I will ignore it for the processes > which do not have access to ProcSignal. I will make the changes and > post a patch for this soon. > I think that's sensible. I've had variable results with glibc's backtrace(), especially on older platforms and/or with external debuginfo, but it's much better than nothing. It's often not feasible to get someone to install gdb and run commands on their production systems - they can be isolated and firewalled or hobbled by painful change policies. Something basic built-in to postgres, even if basic, is likely to come in very handy.
Re: Add docs stub for recovery.conf
On Thu, 14 Jan 2021 at 03:44, Stephen Frost wrote: > > Alright, how does this look? The new entries are all under the > 'obsolete' section to keep it out of the main line, but should work to > 'fix' the links that currently 404 and provide a bit of a 'softer' > landing for the other cases that currently just forcibly redirect using > the website doc alias capability. > Thanks for expanding the change to other high profile obsoleted or renamed features and tools. One minor point. I'm not sure this is quite the best way to spell the index entries: + + obsolete + pg_receivexlog + as it will produce an index term "obsolete" with a list of various components under it. While that concentrates them nicely, it means people won't actually find them if they're using the index alphabetically. I'd slightly prefer + + pg_receivexlog + pg_receivewal + even though that bulks the index up a little, because then people are a bit more likely to find it. Your extended and revised patch retains the above style for + + trigger_file + promote_trigger_file + ... + + standby_mode + standby.signal + so if you intend to change it, that entry needs changing too. > I ended up not actually doing this for the catalog -> view change of > pg_replication_slots simply because I don't really think folks will > misunderstand or be confused by that redirect since it's still the same > relation. If others disagree though, we could certainly change that > too. > I agree with you.
Re: [PATCH] Identify LWLocks in tracepoints
On Thu, 14 Jan 2021 at 15:56, Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > On 2020-12-19 06:00, Craig Ringer wrote: > > Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be > > fired from LWLockWaitForVar, despite that function never actually > > acquiring the lock. > > This was added in 68a2e52bbaf when LWLockWaitForVar() was first > introduced. It looks like a mistake to me too, but maybe Heikki wants > to comment. > I'm certain it's a copy/paste bug.
Re: [PATCH] Identify LWLocks in tracepoints
On Wed, 13 Jan 2021 at 19:19, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Sat, Dec 19, 2020 at 01:00:01PM +0800, Craig Ringer wrote: > > > > The attached patch set follows on from the discussion in [1] "Add LWLock > > blocker(s) information" by adding the actual LWLock* and the numeric > > tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint. > > > > This does not provide complete information on blockers, because it's not > > necessarily valid to compare any two LWLock* pointers between two process > > address spaces. The locks could be in DSM segments, and those DSM > segments > > could be mapped at different addresses. > > > > I wasn't able to work out a sensible way to map a LWLock* to any sort of > > (tranche-id, lock-index) because there's no requirement that locks in a > > tranche be contiguous or known individually to the lmgr. > > > > Despite that, the patches improve the information available for LWLock > > analysis significantly. > > Thanks for the patches, this could be indeed useful. I've looked through > and haven't noticed any issues with either the tracepoint extensions or > commentaries, except that I find it is not that clear how trance_id > indicates a re-initialization here? > > /* Re-initialization of individual LWLocks is not permitted */ > Assert(tranche_id >= NUM_INDIVIDUAL_LWLOCKS || !IsUnderPostmaster); > There should be no reason for anything to call LWLockInitialize(...) on an individual LWLock, since they are all initialized during postmaster startup. Doing so must be a bug. But that's a trivial change that can be done separately. > > Patch 2 adds the tranche id and lock pointer for each trace hit. This > makes > > it possible to differentiate between individual locks within a tranche, > and > > (so long as they aren't tranches in a DSM segment) compare locks between > > processes. That means you can do lock-order analysis etc, which was not > > previously especially feasible. > > I'm curious in which kind of situations lock-order analysis could be > helpful? > If code-path 1 does LWLockAcquire(LockA, LW_EXCLUSIVE); ... LWLockAcquire(LockB, LW_EXCLUSIVE); and code-path 2 does: LWLockAcquire(LockB, LW_EXCLUSIVE); ... LWLockAcquire(LockA, LW_EXCLUSIVE); then they're subject to deadlock. But you might not actually hit that often in test workloads if the timing required for the deadlock to occur is tight and/or occurs on infrequent operations. It's not always easy to reason about or prove things about lock order when they're potentially nested deep within many layers of other calls and callbacks. Obviously something we try to avoid with LWLocks, but not impossible. If you trace a workload and derive all possible nestings of lock acquire order, you can then prove things about whether there are any possible ordering conflicts and where they might arise. A PoC to do so is on my TODO. > Traces also don't have to do userspace reads for the tranche name all > > the time, so the trace can run with lower overhead. > > This one is also interesting. Just for me to clarify, wouldn't there be > a bit of overhead anyway (due to switching from kernel context to user > space when a tracepoint was hit) that will mask name read overhead? Or > are there any available numbers about it? > I don't have numbers on that. Whether it matters will depend way too much on how you're using the probe points and collecting/consuming the data anyway. It's a bit unfortunate (IMO) that we make a function call for each tracepoint invocation to get the tranche names. Ideally I'd prefer to be able to omit the tranche names lookups for these probes entirely for something as hot as LWLocks. But it's a bit of a pain to look up the tranche names from an external trace tool, so instead I'm inclined to see if we can enable systemtap's semaphores and only compute the tranche name if the target probe is actually enabled. But that'd be separate to this patch and require a build change in how systemtap support is compiled and linked. BTW, a user->kernel->user context switch only occurs when the trace tool's probes use kernel space - such as for perf based probes, or for systemtap's kernel-runtime probes. The same markers can be used by e.g. systemtap's "dyninst" runtime that runs entirely in userspace.
Re: Logical decoding without slots: decoding in lockstep with recovery
On Sat, 26 Dec 2020 at 06:51, Andres Freund wrote: > Hi, > > On 2020-12-23 14:56:07 +0800, Craig Ringer wrote: > > I want to share an idea I've looked at a few times where I've run into > > situations where logical slots were inadvertently dropped, or where it > > became necessary to decode changes in the past on a slot. > > > > As most of you will know you can't just create a logical slot in the > past. > > Even if it was permitted, it'd be unsafe due to catalog_xmin retention > > requirements and missing WAL. > > > > But if we can arrange a physical replica to replay the WAL of interest > and > > decode each commit as soon as it's replayed by the startup process, we > know > > the needed catalog rows must all exist, so it's safe to decode the > change. > > > > So it should be feasible to run logical decoding in standby, even > without a > > replication slot, so long as we: > > > > * pause startup process after each xl_xact_commit > > * wake the walsender running logical decoding > > * decode and process until ReorderBufferCommit for the just-committed > xact > > returns > > * wake the startup process to decode the up to the next commit > > I don't think it's safe to just do this for each xl_xact_commit - we can > remove needed rows at quite a few places, not just around transaction > commit. Good point. I vaguely recall spotting a possible decoding-on-standby issue with eager removal of rows that are still ahead of the global xmin if the primary "knows" can't be needed based on info about currently running backends. But when looking over code related to HOT, visibility, and vacuum now I can't for the life of me remember exactly what it was or find it. Hopefully I just misunderstood at the time or was getting confused between decoding on standby and xact streaming. > Rows needed to correctly decode rows earlier in the transaction > might not be available by the time the commit record was logged. > When can that happen? I think you'd basically have to run logical decoding in lockstep with > WAL replay, i.e. replay one record, then call logical decoding for that > record, replay the next record, ... > That sounds likely to be unusably slow. The only way I can see it having any hope of moving at a reasonable rate would be to run a decoding session inside the startup process itself so we don't have to switch back/forth for each record. But I imagine that'd probably cause a whole other set of problems. > Can anyone see any obvious problem with this? > > The patch for logical decoding on the standby > https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de > should provide some of the infrastructure to do this properly. Should > really commit it. /me adds an entry to the top of the todo list. > That would certainly be helpful for quite a number of things.
Re: [PATCH] Identify LWLocks in tracepoints
On Sat, 19 Dec 2020 at 13:00, Craig Ringer wrote: > Hi all > > The attached patch set follows on from the discussion in [1] "Add LWLock > blocker(s) information" by adding the actual LWLock* and the numeric > tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint. > > I've attached a systemtap script that makes use of the information exported by the enhanced LWLock tracepoints. It offers something akin to dynamic -DLWLOCK_STATS with automatic statistical aggregation and some selective LWLOCK_DEBUG output. The script isn't super pretty. I didn't try to separate event-data collection from results output, and there's some duplication in places. But it gives you an idea what's possible when we report lock pointers and tranche IDs to tracepoints and add entry/exit tracepoints. Key features: * Collect statistical aggregates on lwlock hold and wait durations across all processes. Stats are grouped by lockmode (shared or exclusive) and by tranche name, as well as rollup stats across all tranches. * Report lock wait and hold durations for each process when that process exits. Again, reported by mode and tranche name. * For long lock waits, print the waiter pid and waiting pid, along with each process's backend type and application_name if known, the acquire mode, and the acquire function The output is intended to be human-readable, but it'd be quite simple to convert it into raw tsv-style output suitable for ingestion into statistical postprocessing or graphing tools. It should be fairly easy to break down the stats by acquire function if desired, so LWLockAcquire(), LWLockWaitForVar(), and LWLockAcquireOrWait() are reported separately. They're combined in the current output. Capturing the current query string is pretty simple if needed, but I didn't think it was likely to be especially useful. Sample output for a pg_regress run attached. Abridged version follows. Here the !!W!! lines are "waited a long time", the !!H!! lines are "held a long time". Then [pid]:MyBackendType tranche_name wait_time_us (wait_time) in wait_func (appliation_name) => [blocker_pid] (blocker_application_name) . If blocker pid wasn't identified it won't be reported - I know how to fix that and will do so soon. !!W!! [ 93030]:3 BufferContent12993 (0m0.012993s) in lwlock__acquire__start (pg_regress/text) !!W!! [ 93036]:3LockManager14540 (0m0.014540s) in lwlock__acquire__start (pg_regress/float8) => [ 93045] (pg_regress/regproc) !!W!! [ 93035]:3 BufferContent12608 (0m0.012608s) in lwlock__acquire__start (pg_regress/float4) => [ 93034] (pg_regress/oid) !!W!! [ 93036]:3LockManager10301 (0m0.010301s) in lwlock__acquire__start (pg_regress/float8) !!W!! [ 93043]:3LockManager10356 (0m0.010356s) in lwlock__acquire__start (pg_regress/pg_lsn) !!H!! [ 93033]:3 BufferContent20579 (0m0.020579s) (pg_regress/int8) !!W!! [ 93027]:3 BufferContent10766 (0m0.010766s) in lwlock__acquire__start (pg_regress/char) => [ 93037] (pg_regress/bit) !!W!! [ 93036]:3 OidGen12876 (0m0.012876s) in lwlock__acquire__start (pg_regress/float8) ... Then the summary rollup at the end of the run. This can also be output periodically during the run. Abbreviated for highlights: wait locks: all procstranche modecount total avg variance min max W LW_EXCLUSIVE (all)E54185 14062734 259 1850265144177 W LW_SHARED (all)S 3668 1116022 304 1527261218642 held locks: all procstranche modecount total avg variance min max H LW_EXCLUSIVE (all)E 10438060 153077259 14370351 195043 H LW_SHARED (all)S 14199902 654669344 5318144030 all procs by tranchetranche modecount total avg variance min max W tranche (all)S 3668 1116022 304 1527261218642 W tranche (all)E54185 14062734 259 1850265144177 W tranche WALInsertE 9839 2393229 243 1180294214209 W tranche BufferContentE 3012 1726543 573 3869409228186 W tranche BufferContentS 1664 657855 395 2185694218642 W trancheLockFastPathE28314 6327801 223 1278053126133 W trancheLockFastPathS 87 59401 682 3703217 19 9454 W tranche
Re: [PATCH] Identify LWLocks in tracepoints
On Mon, 28 Dec 2020 at 20:09, Masahiko Sawada wrote: > Hi Craig, > > On Sat, Dec 19, 2020 at 2:00 PM Craig Ringer > wrote: > > > > Hi all > > > > The attached patch set follows on from the discussion in [1] "Add LWLock > blocker(s) information" by adding the actual LWLock* and the numeric > tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint. > > > > This does not provide complete information on blockers, because it's not > necessarily valid to compare any two LWLock* pointers between two process > address spaces. The locks could be in DSM segments, and those DSM segments > could be mapped at different addresses. > > > > I wasn't able to work out a sensible way to map a LWLock* to any sort of > (tranche-id, lock-index) because there's no requirement that locks in a > tranche be contiguous or known individually to the lmgr. > > > > Despite that, the patches improve the information available for LWLock > analysis significantly. > > > > Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be > fired from LWLockWaitForVar, despite that function never actually acquiring > the lock. > > > > Patch 2 adds the tranche id and lock pointer for each trace hit. This > makes it possible to differentiate between individual locks within a > tranche, and (so long as they aren't tranches in a DSM segment) compare > locks between processes. That means you can do lock-order analysis etc, > which was not previously especially feasible. Traces also don't have to do > userspace reads for the tranche name all the time, so the trace can run > with lower overhead. > > > > Patch 3 adds a single-path tracepoint for all lock acquires and > releases, so you only have to probe the lwlock__acquired and > lwlock__release events to see all acquires/releases, whether conditional or > otherwise. It also adds start markers that can be used for timing wallclock > duration of LWLock acquires/releases. > > > > Patch 4 adds some comments on LWLock tranches to try to address some > points I found confusing and hard to understand when investigating this > topic. > > > > You sent in your patch to pgsql-hackers on Dec 19, but you did not > post it to the next CommitFest[1]. If this was intentional, then you > need to take no action. However, if you want your patch to be > reviewed as part of the upcoming CommitFest, then you need to add it > yourself before 2021-01-01 AoE[2]. Thanks for your contributions. > > Regards, > > [1] https://commitfest.postgresql.org/31/ > [2] https://en.wikipedia.org/wiki/Anywhere_on_Earth > Thanks. CF entry created at https://commitfest.postgresql.org/32/2927/ . I don't think it's urgent and will have limited review time so I didn't try to wedge it into the current CF.
Re: [PATCH] Runtime control of CLOBBER_CACHE_ALWAYS
On Wed, 6 Jan 2021 at 18:23, Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2021-01-06 00:30, Craig Ringer wrote: > > Perhaps debug_invalidate_system_caches_always ? It's a bit long but we > > use the debug prefix elsewhere too. > > Committed with that name. > Thanks very much. > In your original patch, this hunk in pg_config_manual.h: > > + * You can define these by editing pg_config_manual.h but it's usually > + * sufficient to pass CFLAGS to configure, e.g. > + * > + * ./configure --enable-debug --enable-cassert CFLAGS="-DUSE_VALGRIND" > + * > + * The flags are persisted in Makefile.global so they will be correctly > + * applied to extensions, including those built by PGXS. > > I don't think that necessarily works for all settings. Maybe we should > make a pass through it and ensure it all works sensibly, but that seems > like a separate project. > It should work for everything, since we persist the CFLAGS string, rather than individual settings. But I didn't mean to leave that in the same patch anyway, it's separate.
Re: [PATCH] Runtime control of CLOBBER_CACHE_ALWAYS
On Tue, 5 Jan 2021, 22:41 Peter Eisentraut, < peter.eisentr...@2ndquadrant.com> wrote: > On 2020-12-03 07:01, Craig Ringer wrote: > > To try it out, apply the patch (git am), build with --enable-cassert, > > then compare: > > > > make -C src/test/regress check > > > > and > > > > PGOPTIONS="-c debug_clobber_cache_depth=1" \ > > make -C src/test/regress check > > > > The speed difference will be obvious if nothing else! > > This is a really useful feature change. I have a version that I'm happy > to commit, but I wanted to check on the name of the setting. The > proposed name arose during the discussion when it was just to set the > recursion depth but not enable the feature altogether, so I think that > name is a bit misleading now. We could reuse the old macro name, as in > clobber_cache_always=N, which is very recognizable. But the feature > itself doesn't clobber anything (that's done by CLOBBER_FREED_MEMORY), > so most accurate would be something like > invalidate_system_caches_always=N. Thoughts? > Modulo typo, I think that's a better name. Perhaps debug_invalidate_system_caches_always ? It's a bit long but we use the debug prefix elsewhere too. > -- > Peter Eisentraut > 2ndQuadrant, an EDB company > https://www.2ndquadrant.com/ > > >
Re: [PATCH] LWLock self-deadlock detection
On Wed, 30 Dec 2020 at 10:11, Andres Freund wrote: > Hi, > > On 2020-11-27 20:22:41 +0200, Heikki Linnakangas wrote: > > On 26/11/2020 04:50, Tom Lane wrote: > > > Craig Ringer writes: > > > > On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat < > ashutosh.bapat@gmail.com> > > > > wrote: > > > > > I'd prefer to make the lock self deadlock check run for production > > > > > builds, not just cassert builds. > > > > > > I'd like to register a strong objection to spending any cycles > whatsoever > > > on this. If you're using LWLocks in a way that could deadlock, you're > > > doing it wrong. The entire point of that mechanism is to be Light > Weight > > > and not have all the overhead that the heavyweight lock mechanism has. > > > Building in deadlock checks and such is exactly the wrong direction. > > > If you think you need that, you should be using a heavyweight lock. > > > > > > Maybe there's some case for a cassert-only check of this sort, but > running > > > it in production is right out. > > > > > > I'd also opine that checking for self-deadlock, but not any more > general > > > deadlock, seems pretty pointless. Careless coding is far more likely > > > to create opportunities for the latter. (Thus, I have little use for > > > the existing assertions of this sort, either.) > > > > I've made the mistake of forgetting to release an LWLock many times, > leading > > to self-deadlock. And I just reviewed a patch that did that this week > [1]. > > You usually find that mistake very quickly when you start testing > though, I > > don't think I've seen it happen in production. > > I think something roughly along Craig's original patch, basically adding > assert checks against holding a lock already, makes sense. Compared to > the other costs of running an assert build this isn't a huge cost, and > it's helpful. > > I entirely concur that doing this outside of assertion builds is a > seriously bad idea. > Yeah, given it only targets developer error that's sensible.
Re: Logical decoding without slots: decoding in lockstep with recovery
On Wed, 23 Dec 2020, 18:57 Amit Kapila, wrote: > On Wed, Dec 23, 2020 at 12:26 PM Craig Ringer > wrote: > > > > Hi all > > > > I want to share an idea I've looked at a few times where I've run into > situations where logical slots were inadvertently dropped, or where it > became necessary to decode changes in the past on a slot. > > > > As most of you will know you can't just create a logical slot in the > past. Even if it was permitted, it'd be unsafe due to catalog_xmin > retention requirements and missing WAL. > > > > But if we can arrange a physical replica to replay the WAL of interest > and decode each commit as soon as it's replayed by the startup process, we > know the needed catalog rows must all exist, so it's safe to decode the > change. > > > > So it should be feasible to run logical decoding in standby, even > without a replication slot, so long as we: > > > > * pause startup process after each xl_xact_commit > > * wake the walsender running logical decoding > > * decode and process until ReorderBufferCommit for the just-committed > xact returns > > * wake the startup process to decode the up to the next commit > > > > How will you deal with subscriber restart? I think you need some way > to remember confirmed_flush_lsn and restart_lsn and then need to teach > WAL machinery to restart from some previous point. > The simplest option, albeit slow, would be to require the subscriber to confirm flush before allowing more WAL redo. That's what I'd initially assumed. This is a bit of a corner case situation after all, it's never going to be fast with the switching back and forth. More efficient would be to decode and output to a local spool file and fsync it. Then separately send that to the subscriber, like has been discussed in other work on more efficient logical decoding. So the output plugin output would go to a local spool. That can be implemented with pg_recvlogical if needed. > -- > With Regards, > Amit Kapila. >
Re: Improving LWLock wait events
On Wed, 23 Dec 2020 at 15:51, Craig Ringer wrote: > > I've struggled with this quite a bit myself. > > By the way, I sent in a patch to enhance the static tracepoints available for LWLocks. See https://www.postgresql.org/message-id/cagry4nxjo+-hcc2i5h93ttsz4gzo-fsddcwvkb-qafq1zdx...@mail.gmail.com . It'd benefit significantly from the sort of changes you mentioned in #4. For most purposes I've been able to just use the raw LWLock* but having a nice neat (tranche,index) value would be ideal. The trace patch has helped me identify some excessively long LWLock waits in tools I work on. I'll share another of the systemtap scripts I used with it soon.
Re: Improving LWLock wait events
On Mon, 21 Dec 2020 at 05:27, Andres Freund wrote: > Hi, > > The current wait events are already pretty useful. But I think we could > make them more informative without adding real runtime overhead. > > All 1-3 sound pretty sensible to me. I also think there's a 4, but I think the tradeoffs are a bit more > complicated: > > 4) For a few types of lwlock just knowing the tranche isn't > sufficient. E.g. knowing whether it's one or different buffer mapping locks > are being waited on is important to judge contention. > I've struggled with this quite a bit myself. In particular, for tools that validate acquire-ordering safety it's desirable to be able to identify a specific lock in a backend-independent way. The hardest part would be to know how to identify individual locks. The > easiest would probably be to just mask in a parts of the lwlock address > (e.g. shift it right by INTALIGN, and then mask in the result into the > eventId). That seems a bit unsatisfying. > It also won't work reliably for locks in dsm segments, since the lock can be mapped to a different address in different backends. We could probably do a bit better: We could just store the information about > tranche / offset within tranche at LWLockInitialize() time, instead of > computing something just before waiting. While LWLock.tranche is only > 16bits > right now, the following two bytes are currently padding... > > That'd allow us to have proper numerical identification for nearly all > tranches, without needing to go back to the complexity of having tranches > specify base & stride. > That sounds appealing. It'd work for any lock in MainLWLockArray - all built-in individual LWLocks, LWTRANCHE_BUFFER_MAPPING, LWTRANCHE_LOCK_MANAGER, LWTRANCHE_PREDICATE_LOCK_MANAGER, any lock allocated by RequestNamedLWLockTranche(). Some of the other tranches allocate locks in contiguous fixed blocks or in ways that would let them maintain a counter. We'd need some kind of "unknown" placeholder value for LWLocks where that doesn't make sense, though, like most locks allocated by callers that make their own LWLockNewTrancheId() call and locks in some of the built-in tranches not allocated in MainLWLockArray. So I suggest retaining the current LWLockInitialize() and making it a wrapper for LWLockInitializeWithIndex() or similar. Use a 1-index and keep 0 as unknown, or use 0-index and use (max-1) as unknown.
TAP PostgresNode function to gdb stacks and optional cores for all backends
Hi all I recently wrote a utility that adds a $node->gdb_backends() method to PostgresNode instances - figured I'd share it here in case anyone finds it useful, or wants to adopt it into the features of the TAP tools. This function provides a one-line way to dump stacks for all running backends to per-pid files or to the main test log, as well as the values of various global variables that are potentially of interest. A default set of globals will be dumped for each backend and the caller can specify additional expressions of interest. If requested, cores will be dumped for each running backend. A subset of backends may be passed by pid instead, so you can easily target specific backends you're interested in. I initially wrote this to help debug a variety of issues with shutdown, where I hacked the PostgresNode stop() method to trap failed shutdowns and report stacks for all surviving processes + the postmaster in my wrapper class for PostgresNode: sub stop { my ($self, $mode) = @_; local($@); eval { PostgresNode::stop($self, $mode); }; if ($@) { $node->gdb_backends(want_cores => 1); die $@; } } # # This is an excerpt from a subclass of PostgresNode # # Generate backtraces and optionally core files for all user backends and # walsenders associated with this node. Requires gdb to be present. Cores # will be labeled by node name. sub gdb_backends { my ($self, %kwargs) = @_; $kwargs{backtrace_timeout_s} //= '60'; $kwargs{core_timeout_s} //= '60'; $kwargs{want_cores} //= 0; $kwargs{core_name_pattern} //= 'core.{{pid}}'; $kwargs{gdb_logfile_pattern} //= ''; my $postmaster_pid = $self->{_pid}; my $pgname = $self->name; # Globals # TODO make these conditional on an expression to filter them. # TODO handle statics that vary across files # TODO add typecasts for when we don't have debuginfo # TODO useful GUCs # my @print_exprs = ( # All backends 'IsPostmasterEnvironment', 'IsUnderPostmaster', 'PostmasterPid', 'LocalRecoveryInProgress', '*MyProc', 'MyAuxProcType', '*XLogCtl', '*ControlFile', # Generic signal handling 'InterruptPending', 'ProcDiePending', 'ShutdownRequestPending', 'ConfigReloadPending', # user backend / postgres 'xact_started', 'doing_extended_query_message', 'ignore_till_sync', # startup process 'ThisTimeLineID', 'LastRec', 'ArchiveRecoveryRequested', 'InArchiveRecovery', 'PrimaryConnInfo', 'PrimarySlotName', 'StandbyMode', # autovac 'am_autovacuum_launcher', 'am_autovacuum_worker', 'got_SIGHUP', 'got_SIGUSR2', 'got_SIGTERM', "'autovacuum.c':got_SIGTERM", # for walsenders 'am_walsender', 'am_cascading_walsender', 'am_db_walsender', '*MyWalSnd', '*xlogreader', 'sendTimeLine', 'sentPtr', 'streamingDoneSending', 'streamingDoneReceiving', "'walsender.c':got_SIGTERM", 'got_STOPPING', 'got_SIGUSR2', 'replication_active', '*logical_decoding_ctx', 'logical_startptr', # walreceiver 'recvFileTLI', '*wrconn', # checkpointer '*CheckpointerShmem', 'last_checkpoint_time', 'ckpt_active', # for bgworkers 'IsBackgroundWorker', # for pgl backends '*MyPGLogicalWorker', '*MyPGLSubscription', # for bdr backends '*MyBdrSubscription', # postmaster 'pmState', ); # Add your own print expressions by passing print_exprs => ['var1', 'var2'] push @print_exprs, @{$kwargs{print_exprs}} if (defined($kwargs{print_exprs})); my @pids; if (defined($kwargs{pids})) { if (ref($kwargs{pids}) eq 'ARRAY') { # arrayref pid-list @pids = @{$kwargs{pids}}; } elsif (ref($kwargs{pids}) eq '') { # Scalar pid-list @pids = split(qr/[\r\n]/, $kwargs{pids}); } else { die("keyword argument 'pids' must be undef, an arrayref,
Logical decoding without slots: decoding in lockstep with recovery
Hi all I want to share an idea I've looked at a few times where I've run into situations where logical slots were inadvertently dropped, or where it became necessary to decode changes in the past on a slot. As most of you will know you can't just create a logical slot in the past. Even if it was permitted, it'd be unsafe due to catalog_xmin retention requirements and missing WAL. But if we can arrange a physical replica to replay the WAL of interest and decode each commit as soon as it's replayed by the startup process, we know the needed catalog rows must all exist, so it's safe to decode the change. So it should be feasible to run logical decoding in standby, even without a replication slot, so long as we: * pause startup process after each xl_xact_commit * wake the walsender running logical decoding * decode and process until ReorderBufferCommit for the just-committed xact returns * wake the startup process to decode the up to the next commit Can anyone see any obvious problem with this? I don't think the potential issues with WAL commit visibility order vs shmem commit visibility order should be a concern. I see this as potentially useful in data recovery, where you might want to be able to extract a change stream for a subset of tables from PITR recovery, for example. Also for audit use.
Re: Preventing hangups in bgworker start/stop during DB shutdown
On Wed, 23 Dec 2020 at 05:40, Tom Lane wrote: > Here's an attempt at closing the race condition discussed in [1] > (and in some earlier threads, though I'm too lazy to find them). > > The core problem is that the bgworker management APIs were designed > without any thought for exception conditions, notably "we're not > gonna launch any more workers because we're shutting down the database". > A process waiting for a worker in WaitForBackgroundWorkerStartup or > WaitForBackgroundWorkerShutdown will wait forever, so that the database > fails to shut down without manual intervention. > > I'd supposed that we would need some incompatible changes in those APIs > in order to fix this, but after further study it seems like we could > hack things by just acting as though a request that won't be serviced > has already run to completion. I'm not terribly satisfied with that > as a long-term solution --- it seems to me that callers should be told > that there was a failure. But this looks to be enough to solve the > lockup condition for existing callers, and it seems like it'd be okay > to backpatch. > > Thoughts? > Callers who launch bgworkers already have to cope with conditions such as the worker failing immediately after launch, or before attaching to the shmem segment used for worker management by whatever extension is launching it. So I think it's reasonable to lie and say we launched it. The caller must already cope with this case to behave correctly. Patch specifics: > This function should only be called from the postmaster It'd be good to Assert(IsPostmasterEnvironment && !IsUnderPostmaster) in these functions. Otherwise at first read the patch and rationale looks sensible to me. (When it comes to the bgw APIs in general I have a laundry list of things I'd like to change or improve around signal handling, error trapping and recovery, and lots more, but that's for another thread.)
[PATCH] Identify LWLocks in tracepoints
Hi all The attached patch set follows on from the discussion in [1] "Add LWLock blocker(s) information" by adding the actual LWLock* and the numeric tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint. This does not provide complete information on blockers, because it's not necessarily valid to compare any two LWLock* pointers between two process address spaces. The locks could be in DSM segments, and those DSM segments could be mapped at different addresses. I wasn't able to work out a sensible way to map a LWLock* to any sort of (tranche-id, lock-index) because there's no requirement that locks in a tranche be contiguous or known individually to the lmgr. Despite that, the patches improve the information available for LWLock analysis significantly. Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be fired from LWLockWaitForVar, despite that function never actually acquiring the lock. Patch 2 adds the tranche id and lock pointer for each trace hit. This makes it possible to differentiate between individual locks within a tranche, and (so long as they aren't tranches in a DSM segment) compare locks between processes. That means you can do lock-order analysis etc, which was not previously especially feasible. Traces also don't have to do userspace reads for the tranche name all the time, so the trace can run with lower overhead. Patch 3 adds a single-path tracepoint for all lock acquires and releases, so you only have to probe the lwlock__acquired and lwlock__release events to see all acquires/releases, whether conditional or otherwise. It also adds start markers that can be used for timing wallclock duration of LWLock acquires/releases. Patch 4 adds some comments on LWLock tranches to try to address some points I found confusing and hard to understand when investigating this topic. [1] https://www.postgresql.org/message-id/CAGRY4nz%3DSEs3qc1R6xD3max7sg3kS-L81eJk2aLUWSQAeAFJTA%40mail.gmail.com . From 583c818e3121c0f7c6506b434497c81ae94ee9cb Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Thu, 19 Nov 2020 17:30:47 +0800 Subject: [PATCH v1 4/4] Comments on LWLock tranches --- src/backend/storage/lmgr/lwlock.c | 49 +-- 1 file changed, 46 insertions(+), 3 deletions(-) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index cfdfa7f328..123bcc463e 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -112,11 +112,14 @@ extern slock_t *ShmemLock; * * 1. The individually-named locks defined in lwlocknames.h each have their * own tranche. The names of these tranches appear in IndividualLWLockNames[] - * in lwlocknames.c. + * in lwlocknames.c. The LWLock structs are allocated in MainLWLockArray. * * 2. There are some predefined tranches for built-in groups of locks. * These are listed in enum BuiltinTrancheIds in lwlock.h, and their names - * appear in BuiltinTrancheNames[] below. + * appear in BuiltinTrancheNames[] below. The LWLock structs are allocated + * elsewhere under the control of the subsystem that manages the tranche. The + * LWLock code does not know or care where in shared memory they are allocated + * or how many there are in a tranche. * * 3. Extensions can create new tranches, via either RequestNamedLWLockTranche * or LWLockRegisterTranche. The names of these that are known in the current @@ -196,6 +199,13 @@ static int LWLockTrancheNamesAllocated = 0; * This points to the main array of LWLocks in shared memory. Backends inherit * the pointer by fork from the postmaster (except in the EXEC_BACKEND case, * where we have special measures to pass it down). + * + * This array holds individual LWLocks and LWLocks allocated in named tranches. + * + * It does not hold locks for any LWLock that's separately initialized with + * LWLockInitialize(). Locks in tranches listed in BuiltinTrancheIds or + * allocated with LWLockNewTrancheId() can be embedded in other structs + * anywhere in shared memory. */ LWLockPadded *MainLWLockArray = NULL; @@ -593,6 +603,12 @@ InitLWLockAccess(void) * Caller needs to retrieve the requested number of LWLocks starting from * the base lock address returned by this API. This can be used for * tranches that are requested by using RequestNamedLWLockTranche() API. + * + * The locks are already initialized. + * + * This function can not be used for locks in builtin tranches or tranches + * registered with LWLockRegisterTranche(). There is no way to look those locks + * up by name. */ LWLockPadded * GetNamedLWLockTranche(const char *tranche_name) @@ -647,6 +663,14 @@ LWLockNewTrancheId(void) * * The tranche name will be user-visible as a wait event name, so try to * use a name that fits the style for those. + * + * The tranche ID should be a user-defined tranche ID acquired from + * LWLockNewTrancheId(). It is not necessary to
Re: [PATCH] Runtime control of CLOBBER_CACHE_ALWAYS
On Thu, 3 Dec 2020 at 15:53, Craig Ringer wrote: > > On Tue, 27 Oct 2020 at 16:34, Peter Eisentraut < > peter.eisentr...@2ndquadrant.com> wrote: > >> On 2020-09-25 09:40, Craig Ringer wrote: >> > While working on extensions I've often wanted to enable cache >> clobbering >> > for a targeted piece of code, without paying the price of running it >> for >> > all backends during postgres startup and any initial setup tasks that >> > are required. >> > >> > So here's a patch that, when CLOBBER_CACHE_ALWAYS or >> > CLOBBER_CACHE_RECURSIVE are defined, adds a GUC named >> > clobber_cache_depth . It defaults to 1 for CLOBBER_CACHE_ALWAYS or 3 >> for >> > CLOBBER_CACHE_RECURSIVE to match the existing compiled-in behaviour. >> But >> > with this change it's now possible to run Pg with clobber_cache_depth=0 >> > then set it to 1 only for targeted tests. >> >> I think it would be great if the cache clobber code is always compiled >> in (but turned off) when building with assertions. The would reduce the >> number of hoops to jump through to actually use this locally and reduce >> the number of surprises from the build farm. > > > Updated per your comments Peter, see mail immediately up-thread. Moved to this CF as https://commitfest.postgresql.org/31/2749/ .
Re: Logical archiving
On Tue, 8 Dec 2020 at 17:54, Andrey Borodin wrote: > > In pglogical3 we already support streaming decoded WAL data to > alternative writer downstreams including RabbitMQ and Kafka via writer > plugins. > Yes, Yandex.Cloud Transfer Manger supports it too. But it has to be > resynced after physical failover. And internal installation of YC have > mandatory drills: few times in a month one datacenter is disconnected and > failover happens for thousands a DBS. > You'll want to look at https://wiki.postgresql.org/wiki/Logical_replication_and_physical_standby_failover#All-logical-replication_HA then.
Re: Blocking I/O, async I/O and io_uring
On Tue, 8 Dec 2020 at 15:04, Andres Freund wrote: > Hi, > > On 2020-12-08 04:24:44 +, tsunakawa.ta...@fujitsu.com wrote: > > I'm looking forward to this from the async+direct I/O, since the > > throughput of some write-heavy workload decreased by half or more > > during checkpointing (due to fsync?) > > Depends on why that is. The most common, I think, cause is that your WAL > volume increases drastically just after a checkpoint starts, because > initially all page modification will trigger full-page writes. There's > a significant slowdown even if you prevent the checkpointer from doing > *any* writes at that point. I got the WAL AIO stuff to the point that I > see a good bit of speedup at high WAL volumes, and I see it helping in > this scenario. > > There's of course also the issue that checkpoint writes cause other IO > (including WAL writes) to slow down and, importantly, cause a lot of > jitter leading to unpredictable latencies. I've seen some good and some > bad results around this with the patch, but there's a bunch of TODOs to > resolve before delving deeper really makes sense (the IO depth control > is not good enough right now). > > A third issue is that sometimes checkpointer can't really keep up - and > that I think I've seen pretty clearly addressed by the patch. I have > managed to get to ~80% of my NVMe disks top write speed (> 2.5GB/s) by > the checkpointer, and I think I know what to do for the remainder. > > Thanks for explaining this. I'm really glad you're looking into it. If I get the chance I'd like to try to apply some wait-analysis and blocking stats tooling to it. I'll report back if I make any progress there.
Re: Blocking I/O, async I/O and io_uring
On Tue, 8 Dec 2020 at 12:02, Andres Freund wrote: > Hi, > > On 2020-12-08 10:55:37 +0800, Craig Ringer wrote: > > A new kernel API called io_uring has recently come to my attention. I > > assume some of you (Andres?) have been following it for a while. > > Yea, I've spent a *lot* of time working on AIO support, utilizing > io_uring. Recently Thomas also joined in the fun. I've given two talks > referencing it (last pgcon, last pgday brussels), but otherwise I've not > yet written much about. Things aren't *quite* right yet architecturally, > but I think we're getting there. > That's wonderful. Thankyou. I'm badly behind on the conference circuit due to geographic isolation and small children. I'll hunt up your talks. The current state is at https://github.com/anarazel/postgres/tree/aio > (but it's not a very clean history at the moment). > Fantastic! Have you done much bpf / systemtap / perf based work on measurement and tracing of latencies etc? If not that's something I'd be keen to help with. I've mostly been using systemtap so far but I'm trying to pivot over to bpf. I hope to submit a big tracepoints patch set for PostgreSQL soon to better expose our wait points and latencies, improve visibility of blocking, and help make activity traceable through all the stages of processing. I'll Cc you when I do. > > io_uring appears to offer a way to make system calls including reads, > > writes, fsync()s, and more in a non-blocking, batched and pipelined > manner, > > with or without O_DIRECT. Basically async I/O with usable buffered I/O > and > > fsync support. It has ordering support which is really important for us. > > My results indicate that we really want to have have, optional & not > enabled by default of course, O_DIRECT support. We just can't benefit > fully of modern SSDs otherwise. Buffered is also important, of course. > Even more so for NVDRAM, Optane and all that, where zero-copy and low context switches becomes important too. We're a long way from that being a priority but it's still not to be dismissed. I'm pretty sure that I've got the basics of this working pretty well. I > don't think the executor architecture is as big an issue as you seem to > think. There are further benefits that could be unlocked if we had a > more flexible executor model (imagine switching between different parts > of the query whenever blocked on IO - can't do that due to the stack > right now). > Yep, that's what I'm talking about being an issue. Blocked on an index read? Move on to the next tuple and come back when the index read is done. I really like what I see of the io_uring architecture so far. It's ideal for callback-based event-driven flow control. But that doesn't fit postgres well for the executor. It's better for redo etc. > The way it currently works is that things like sequential scans, vacuum, > etc use a prefetching helper which will try to use AIO to read ahead of > the next needed block. That helper uses callbacks to determine the next > needed block, which e.g. vacuum uses to skip over all-visible/frozen > blocks. There's plenty other places that should use that helper, but we > already can get considerably higher throughput for seqscans, vacuum on > both very fast local storage, and high-latency cloud storage. > > Similarly, for writes there's a small helper to manage a write-queue of > configurable depth, which currently is used to by checkpointer and > bgwriter (but should be used in more places). Especially with direct IO > checkpointing can be a lot faster *and* less impactful on the "regular" > load. > Sure sounds like a useful interim step. That's great. I've got asynchronous writing of WAL mostly working, but need to > redesign the locking a bit further. Right now it's a win in some cases, > but not others. The latter to a significant degree due to unnecessary > blocking > That's where io_uring's I/O ordering operations looked interesting. But I haven't looked closely enough to see if they're going to help us with I/O ordering in a multiprocessing architecture like postgres. In an ideal world we could tell the kernel about WAL-to-heap I/O dependencies and even let it apply WAL then heap changes out-of-order so long as they didn't violate any ordering constraints we specify between particular WAL records or between WAL writes and their corresponding heap blocks. But I don't know if the io_uring interface is that capable. I did some basic experiments a while ago with using write barriers between WAL records and heap writes instead of fsync()ing, but as you note, the increased blocking and reduction in the ke
Re: Blocking I/O, async I/O and io_uring
References to get things started: * https://lwn.net/Articles/810414/ * https://unixism.net/loti/what_is_io_uring.html * https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework * https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/ You'll probably notice how this parallels my sporadic activities around pipelining in other areas, and the PoC libpq pipelining patch I sent in a few years ago.
Blocking I/O, async I/O and io_uring
Hi all A new kernel API called io_uring has recently come to my attention. I assume some of you (Andres?) have been following it for a while. io_uring appears to offer a way to make system calls including reads, writes, fsync()s, and more in a non-blocking, batched and pipelined manner, with or without O_DIRECT. Basically async I/O with usable buffered I/O and fsync support. It has ordering support which is really important for us. This should be on our radar. The main barriers to benefiting from linux-aio based async I/O in postgres in the past has been its reliance on direct I/O, the various kernel-version quirks, platform portability, and its maybe-async-except-when-it's-randomly-not nature. The kernel version and portability remain an issue with io_uring so it's not like this is something we can pivot over to completely. But we should probably take a closer look at it. PostgreSQL spends a huge amount of time waiting, doing nothing, for blocking I/O. If we can improve that then we could potentially realize some major increases in I/O utilization especially for bigger, less concurrent workloads. The most obvious candidates to benefit would be redo, logical apply, and bulk loading. But I have no idea how to even begin to fit this into PostgreSQL's executor pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative in nature, with a push/pull executor pipeline. It seems to have been recognised for some time that this is increasingly hurting our performance and scalability as platforms become more and more parallel. To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we have to be able to dispatch I/O and do something else while we wait for the results. So we need the ability to pipeline the executor and pipeline redo. I thought I'd start the discussion on this and see where we can go with it. What incremental steps can be done to move us toward parallelisable I/O without having to redesign everything? I'm thinking that redo is probably a good first candidate. It doesn't depend on the guts of the executor. It is much less sensitive to ordering between operations in shmem and on disk since it runs in the startup process. And it hurts REALLY BADLY from its single-threaded blocking approach to I/O - as shown by an extension written by 2ndQuadrant that can double redo performance by doing read-ahead on btree pages that will soon be needed. Thoughts anybody?
Re: POC: Better infrastructure for automated testing of concurrency issues
On Wed, 25 Nov 2020 at 22:11, Alexander Korotkov wrote: > Hackers, > > PostgreSQL is a complex multi-process system, and we are periodically > faced with complicated concurrency issues. While the postgres community > does a great job on investigating and fixing the problems, our ability to > reproduce concurrency issues in the source code test suites is limited. > > I think we currently have two general ways to reproduce the concurrency > issues. > 1. A text scenario for manual reproduction of the issue, which could > involve psql sessions, gdb sessions etc. Couple of examples are [1] and > [2]. This method provides reliable reproduction of concurrency issues. But > it's hard to automate, because it requires external instrumentation > (debugger) and it's not stable in terms of postgres code changes (that is > particular line numbers for breakpoints could be changed). I think this is > why we currently don't have such scenarios among postgres test suites. > 2. Another way is to reproduce the concurrency issue without directly > touching the database internals using pgbench or other way to simulate the > workload (see [3] for example). This way is easier to automate, because it > doesn't need external instrumentation and it's not so sensitive to source > code changes. But at the same time this way is not reliable and is > resource-consuming. > Agreed. For a useful but limited set of cases there's (3) the isolation tester and pg_isolation_regress. But IIRC the patches to teach it to support multiple upstream nodes never got in, so it's essentially useless for any replication related testing. There's also (4), write a TAP test that uses concurrent psql sessions via IPC::Run. Then play games with heavyweight or advisory lock waits to order events, use instance starts/stops, change ports or connstrings to simulate network issues, use SIGSTOP/SIGCONTs, add src/test/modules extensions that inject faults or provide custom blocking wait functions for the event you want, etc. I've done that more than I'd care to, and I don't want to do it any more than I have to in future. In some cases I've gone further and written tests that use systemtap in "guru" mode (read/write, with embedded C enabled) to twiddle the memory of the target process(es) when a probe is hit, e.g. to modify a function argument or return value or inject a fault. Not exactly portable or convenient, though very powerful. In the view of above, I'd like to propose a POC patch, which implements new > builtin infrastructure for reproduction of concurrency issues in automated > test suites. The general idea is so-called "stop events", which are > special places in the code, where the execution could be stopped on some > condition. Stop event also exposes a set of parameters, encapsulated into > jsonb value. The condition over stop event parameters is defined using > jsonpath language. > The patched PostgreSQL used by 2ndQuadrant internally has a feature called PROBE_POINT()s that is somewhat akin to this. Since it's not a customer facing feature I'm sure I can discuss it here, though I'll need to seek permission before I can show code. TL;DR: PROBE_POINT()s let you inject ERRORs, sleeps, crashes, and various other behaviour at points in the code marked by name, using GUCs, hooks loaded from test extensions, or even systemtap scripts to control what fires and when. Performance impact is essentially zero when no probes are currently enabled at runtime, so they're fine for cassert builds. Details: A PROBE_POINT() is a macro that works as a marker, a bit like a TRACE_POSTGRESQL_ dtrace macro. But instead of the super lightweight tracepoint that SDT marker points emit, a PROBE_POINT tests an unlikely(probe_points_enabled) flag, and if true, it prepares arguments for the probe handler: A probe name, probe action, sleep duration, and a hit counter. The default probe action and sleep duration come from GUCs. So your control of the probe is limited to the granularity you can easily manage GUCs at. That's often sufficient But if you want finer control for something, there are two ways to achieve it. After picking the default arguments to the handler, the probe point checks for a hook. If defined, it calls it with the probe point name and pointers to the action and sleep duration values, so the hook function can modify them per probe-point hit. That way you can use in src/test/modules extensions or your own test extensions first, with the probe point name as an argument and the action and sleep duration as out-params, as well as any accessible global state, custom GUCs you define in your test extension, etc. That's usually enough to target a probe very specifically but it's a bit of a hassle. Another option is to use a systemtap script. You can write your code in systemtap with its language. When the systemtap marker for a probe point event fires, decide if it's the one you want and twiddle the target process variables that store the probe acti
Re: Single transaction in the tablesync worker?
On Mon, 7 Dec 2020 at 11:44, Peter Smith wrote: > > Basically, I was wondering why can't the "tablesync" worker just > gather messages in a similar way to how the current streaming feature > gathers messages into a "changes" file, so that they can be replayed > later. > > See the related thread "Logical archiving" https://www.postgresql.org/message-id/20d9328b-a189-43d1-80e2-eb25b9284...@yandex-team.ru where I addressed some parts of this topic in detail earlier today. A) The "tablesync" worker (after the COPY) does not ever apply any of > the incoming messages, but instead it just gobbles them into a > "changes" file until it decides it has reached SYNCDONE state and > exits. > This has a few issues. Most importantly, the sync worker must cooperate with the main apply worker to achieve a consistent end-of-sync cutover. The sync worker must have replayed the pending changes in order to make this cut-over, because the non-sync apply worker will need to start applying changes on top of the resync'd table potentially as soon as the next transaction it starts applying, so it needs to see the rows there. Doing this would also add another round of write multiplication since the data would get spooled then applied to WAL then heap. Write multiplication is already an issue for logical replication so adding to it isn't particularly desirable without a really compelling reason. With the write multiplication comes disk space management issues for big transactions as well as the obvious performance/throughput impact. It adds even more latency between upstream commit and downstream apply, something that is again already an issue for logical replication. Right now we don't have any concept of a durable and locally flushed spool. It's not impossible to do as you suggest but the cutover requirement makes it far from simple. As discussed in the logical archiving thread I think it'd be good to have something like this, and there are times the write multiplication price would be well worth paying. But it's not easy. B) Then, when the "apply" worker proceeds, if it detects the existence > of the "changes" file it will replay/apply_dispatch all those gobbled > messages before just continuing as normal. > That's going to introduce a really big stall in the apply worker's progress in many cases. During that time it won't be receiving from upstream (since we don't spool logical changes to disk at this time) so the upstream lag will grow. That will impact synchronous replication, pg_wal size management, catalog bloat, etc. It'll also leave the upstream logical decoding session idle, so when it resumes it may create a spike of I/O and CPU load as it catches up, as well as a spike of network traffic. And depending on how close the upstream write rate is to the max decode speed, network throughput max, and downstream apply speed max, it may take some time to catch up over the resulting lag. Not a big fan of that approach.
RFC: Deadlock detector hooks for edge injection
Hi all Related to my other post about deadlock detector hooks for victim selection, I'd like to gauge opinions here about whether it's feasible to inject edges into the deadlock detector's waits-for graph. Doing so would help with detecting deadlocks relating to shm_mq waits, help with implementing distributed deadlock detection solutions, make it possible to spot deadlocks relating to condition-variable waits, etc. I'm not sure quite how the implementation would look yet, this is an early RFC and sanity check so I don't invest any work into it if it has no hope of going anywhere. I expect we'd want to build the graph only when the detector is triggered, rather than proactively maintain such edges, so the code implementing the hook would be responsible for keeping track of whatever state it needs to in order to do so. When called, it'd append "struct EDGE" s to the deadlock detector's waits-for list. We'd need to support a node representation other than a LOCK* for struct EDGE, and to abstract edge sorting (TopoSort etc) to allow for other edge types. So it wouldn't be a trivial change to make, hence opening with this RFC. I expect it'd be fine to require each EDGE* to have a PGPROC and to require the PGPROC for waits-for and waiting-for not be the same proc. Distributed systems that use libpq connections to remote nodes, or anything else, would have to register the local-side PGPROC as the involved waiter or waited-on object, and handle any mapping between the remote object and local resource holder/acquirer themselves, probably using their own shmem state. Bonus points if the callback could assign weights to the injected edges to bias victim selection more gently. Or a way to tag an waited-for node as not a candidate victim for cancellation. General thoughts?
RFC: Deadlock detector hooks for victim selection and edge injection
Hi folks Now that we're well on track for streaming logical decoding, it's becoming more practical to look at parallel logical apply. The support for this in pglogical3 benefits from a deadlock detector hook. It was added in the optional patched postgres pglogical can use to enable various extra features that weren't possible without core changes, but isn't present in community postgres yet. I'd like to add it. The main benefit is that it lets the logical replication support tell the deadlock detector that it should prefer to kill the victim whose txn has a higher upstream commit lsn. That helps encourage parallel logical apply to make progress in the face of deadlocks between concurrent txns. Any in-principle objections?
RFC: Giving bgworkers walsender-like grace during shutdown (for logical replication)
Hi folks TL;DR: Anyone object to a new bgworker flag that exempts background workers (such as logical apply workers) from the first round of fast shutdown signals, and instead lets them to finish their currently in-progress txn and exit? This is a change proposal raised for comment before patch submission so please consider it. Explanation of why I think we need it comes first, then proposed implementation. Rationale: Currently a fast shutdown causes logical replication subscribers to abort their currently in-progress transaction and terminate along with user backends. This means they cannot finish receiving and flushing the currently in-progress transaction, possibly wasting a very large amount of work. After restart the subscriber must reconnect, decode and reorder buffer from the restart_lsn up to the current confirmed_flush_lsn, receive the whole txn on the wire all over again, and apply the whole txn again locally. We don't currently spool received txn change-streams to disk on the subscriber and flush them so we can't repeat just the local apply part (see the related thread "Logical archiving" for relevant discussion there). This can create a lot of bloat, a lot of excess WAL, etc, if a big txn was in progress at the time. I'd like to add a bgworker flag that tells the postmaster to treat the logical apply bgworker (or extension equivalents) somewhat like a walsender for the purpose of fast shutdown. Instead of immediately terminating it like user backends on fast shutdown, the bgworker should be sent a ProcSignal warning that shutdown is pending and instructing it to finish receiving and applying its current transaction, then exit gracefully. It's not quite the same as the walsender, since there we try to flush changes to downstreams up to the end of the last commit before shutting down. That doesn't make sense on a subscriber because the upstream is likely still generating txns. We just want to avoid wasting our effort on any in-flight txn. Any objections? Proposed implementation: * Add new bgworker flag like BGW_DELAYED_SHUTDOWN * Define new ProcSignal PROCSIG_SHUTDOWN_REQUESTED. On fast shutdown send this instead of a SIGTERM to bgworker backends flagged BGW_DELAYED_SHUTDOWN. On smart shutdown send it to all backends when the shutdown request arrives, since that could be handy for other uses too. * Flagged bgworker is expected to finish its current txn and exit promptly. Impose a grace period after which they get SIGTERM'd anyway. Also send a SIGTERM if the postmaster receives a second fast shutdown request. * Defer sending PROCSIG_WALSND_INIT_STOPPING to walsenders until all BGW_DELAYED_SHUTDOWN flagged bgworkers have exited, so we can ensure that cascaded downstreams receive any txns applied from the upstream. This doesn't look likely to be particularly complicated to implement. It might be better to use a flag in PGPROC rather than the bgworker struct, in case we want to extend this to other backend types in future. Also to make it easier for the postmaster to check the flag during shutdown. Could just claim a bit from statusFlags for the purpose. Thoughts?
Re: Logical archiving
Actually CC'd Petr this time. On Mon, 7 Dec 2020 at 11:05, Craig Ringer wrote: > Reply follows inline. I addressed your last point first, so it's out of > order. > > On Fri, 4 Dec 2020 at 15:33, Andrey Borodin wrote > > > If OLAP cannot consume data fast enough - we are out of space due to > repl slot. > > There is a much simpler solution to this than logical PITR. > > What we should be doing is teaching xlogreader how to invoke the > restore_command to fetch archived WALs for decoding. > > Replication slots already have a WAL retention limit, but right now when > that limit is reached the slot is invalidated and becomes useless, it's > effectively dropped. Instead, if WAL archiving is enabled, we should leave > the slot as valid. If a consumer of the slot needs WAL that no longer > exists in pg_wal, we should have the walsender invoke the restore_command > to read the missing WAL segment, decode it, and remove it again. > > This would not be a technically difficult patch, and it's IMO one of the > more important ones for improving logical replication. > > > I was discussing problems of CDC with scientific community and they > asked this simple question: "So you have efficient WAL archive on a very > cheap storage, why don't you have a logical archive too?" > > I've done work in this area, as has Petr (CC'd). > > In short, logical archiving and PITR is very much desirable, but we're not > nearly ready for it yet and we're missing a lot of the foundations needed > to make it really useful. > > IMO the strongest pre-requisite is that we need integrated DDL capture and > replication in Pg. While this could be implemented in the > publisher/subscriber logic for logical replication, it would make much more > sense (IMO) to make it easier to feed DDL events into any logical > replication output plugin. > > pglogical3 (the closed one) has quite comprehensive DDL replication > support. Doing it is not simple though - there are plenty of complexities: > > * Reliably identifying the target objects and mapping them to replication > set memberships for DML-replication > * Capturing, replicating and managing the search_path and other DDL > execution context (DateStyle and much more) reliably > >- Each statement type needs specific logic to indicate whether it >needs DDL replication (and often filter functions since we have lots of >sub-types where some need replication and some don't) >- Handling DDL affecting global objects in pg_global correctly, like >those affecting roles, grants, database security labels etc. There's no one >right answer for this, it depends on the deployment and requires the user >to cooperate. >- Correct handling of transactions that mix DDL and DML (mostly only >an issue for multimaster). >- Identifying statements that target a mix of replicated and >non-replicated objects and handling them appropriately, including for >CASCADEs >- Gracefully handling DDL statements that mix TEMPORARY and persistent >targets. We can do this ok for DROPs but it still requires care. Anything >else gets messier. >- Lack of hooks into table rewrite operations and the extremely clumsy >and inefficient way logical decoding currently exposes decoding of the >temp-table data during decoding of rewrites means handling table-rewriting >DDL is difficult and impractical to do correctly. In pglogical we punt on >it entirely and refuse to permit DDL that would rewrite a table except >where we can prove it's reliant only on immutable inputs so we can discard >the upstream rewrite and rely on statement replication. >- As a consequence of the above, reliably determining whether a given >statement will cause a table rewrite. >- Handling re-entrant ProcessUtility_hook calls for ALTER TABLE etc. >- Handling TRUNCATE's pseudo-DDL pseudo-DML halfway state, doing >something sensible for truncate cascade. >- Probably more I've forgotten > > > If we don't handle these, then any logical change-log archives will become > largely useless as soon as there's any schema change. > > So we kind of have to solve DDL replication first IMO. > > Some consideration is also required for metadata management. Right now > relation and type metadata has session-lifetime, but you'd want to be able > to discard old logical change-stream archives and have the later ones still > be usable. So we'd need to define some kind of restartpoint where we repeat > the metadata, or we'd have to support externalizing the metadata so it can > be retained when the main change a