Re: Incorrect snapshots while promoting hot standby node when 2PC is used
> 3 мая 2021 г., в 23:10, Andres Freund написал(а): > > Hi, > > On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote: >> I'm investigating somewhat resemblant case. >> We have an OLTP sharded installation where shards are almost always under >> rebalancing. Data movement is implemented with 2pc. >> Switchover happens quite often due to datacenter drills. The installation is >> running on PostgreSQL 12.6. > > If you still have the data it would be useful if you could check if the > LSNs of the corrupted pages are LSNs from shortly after standby > promotion/switchover? That's a neat idea, I'll check if I can restore backup with corruptions. I have a test cluster with corruptions, but it has undergone tens of switchovers... >> Or, perhaps, it looks more like a hardware problem? Data_checksums are >> on, but few years ago we observed ssd firmware that was loosing >> updates, but passing checksums. I'm sure that we would benefit from >> having separate relation fork for checksums or LSNs. > > Right - checksums are "page local". They can only detect if a page is > corrupted, not if e.g. an older version of the page (with correct > checksum) has been restored. While there are schemes to have stronger > error detection properties, they do come with substantial overhead (at > least the ones I can think of right now). We can have PTRACK-like fork with page LSNs. It can be flushed on checkpoint and restored from WAL on crash. So we always can detect stale page version. Like LSN-track :) We will have much faster rewind and delta-backups for free. Though I don't think it worth an effort until we at least checksum CLOG. Thanks! Best regards, Andrey Borodin.
Re: Replication slot stats misgivings
On Tue, May 4, 2021 at 9:48 AM Masahiko Sawada wrote: > > On Mon, May 3, 2021 at 10:21 PM Amit Kapila wrote: > > > > On Mon, May 3, 2021 at 5:48 PM Masahiko Sawada > > wrote: > > > > > > On Mon, May 3, 2021 at 2:27 PM Amit Kapila > > > wrote: > > > > > > > > On Thu, Apr 29, 2021 at 10:37 AM Amit Kapila > > > > wrote: > > > > > > > > > > On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada > > > > > wrote: > > > > > > > > > > > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > I am not sure if any of these alternatives are a good idea. What > > > > > > > do > > > > > > > you think? Do you have any other ideas for this? > > > > > > > > > > > > I've been considering some ideas but don't come up with a good one > > > > > > yet. It’s just an idea and not tested but how about having > > > > > > CreateDecodingContext() register before_shmem_exit() callback with > > > > > > the > > > > > > decoding context to ensure that we send slot stats even on > > > > > > interruption. And FreeDecodingContext() cancels the callback. > > > > > > > > > > > > > > > > Is it a good idea to send stats while exiting and rely on the same? I > > > > > think before_shmem_exit is mostly used for the cleanup purpose so not > > > > > sure if we can rely on it for this purpose. I think we can't be sure > > > > > that in all cases we will send all stats, so maybe Vignesh's patch is > > > > > sufficient to cover the cases where we avoid losing it in cases where > > > > > we would have sent a large amount of data. > > > > > > > > > > > > > Sawada-San, any thoughts on this point? > > > > > > before_shmem_exit is mostly used to the cleanup purpose but how about > > > on_shmem_exit()? pgstats relies on that to send stats at the > > > interruption. See pgstat_shutdown_hook(). > > > > > > > Yeah, that is worth trying. Would you like to give it a try? > > Yes. > > In this approach, I think we will need to have a static pointer in > logical.c pointing to LogicalDecodingContext that we’re using. At > StartupDecodingContext(), we set the pointer to the just created > LogicalDecodingContext and register the callback so that we can refer > to the LogicalDecodingContext on that callback. And at > FreeDecodingContext(), we reset the pointer to NULL (however, since > FreeDecodingContext() is not called when an error happens we would > need to ensure resetting it somehow). But, after more thought, if we > have the static pointer in logical.c it would rather be better to have > a global function that sends slot stats based on the > LogicalDecodingContext pointed by the static pointer and can be called > by ReplicationSlotRelease(). That way, we don’t need to worry about > erroring out cases as well as interruption cases, although we need to > have a new static pointer. > > I've attached a quick-hacked patch. I also incorporated the change > that calls UpdateDecodingStats() at FreeDecodingContext() so that we > can send slot stats also in the case where we spilled/streamed changes > but finished without commit/abort/prepare record. > > > I think > > it still might not cover the cases where we error out in the backend > > while decoding via APIs because at that time we won't exit, maybe for > > that we can consider Vignesh's patch. > > Agreed. It seems to me that the approach of the attached patch is > better than the approach using on_shmem_exit(). So if we want to avoid > having the new static pointer and function for this purpose we can > consider Vignesh’s patch. > I'm ok with using either my patch or Sawada san's patch, Even I had the same thought of whether we should have a static variable thought pointed out by Sawada san. Apart from that I had one minor comment: This comment needs to be corrected "andu sed to sent" +/* + * Pointing to the currently-used logical decoding context andu sed to sent + * slot statistics on releasing slots. + */ +static LogicalDecodingContext *MyLogicalDecodingContext = NULL; + Regards, Vignesh
Re: AlterSubscription_refresh "wrconn" wrong variable?
On Tue, May 4, 2021 at 2:31 PM Andres Freund wrote: > > Hi, > > On 2021-05-04 09:29:42 +1000, Peter Smith wrote: > > While reviewing some logical replication code I stumbled across a > > variable usage that looks suspicious to me. > > > Note that the AlterSubscription_refresh function (unlike other > > functions in the subscriptioncmds.c) is using the global variable > > "wrconn" instead of a local stack variable of the same name. I was > > unable to think of any good reason why it would be deliberately doing > > this, so my guess is that it is simply an accidental mistake that has > > gone unnoticed because the compiler was silently equally happy just > > using the global var. > > > Apparently, this is not causing any reported problems because it seems > > like the code has been this way for ~4 years [1]. > > This sounded vaguely familiar. After a bit of searching I found that's > because I debugged a crash related to it: > https://www.postgresql.org/message-id/2020215820.qihhrz7fayu6myfi%40alap3.anarazel.de > Oh! No wonder it sounded familiar. It looks like I've just re-discovered the identical problem 5 months after your post. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: AlterSubscription_refresh "wrconn" wrong variable?
On Tue, May 4, 2021 at 1:56 PM Bharath Rupireddy wrote: > > On Tue, May 4, 2021 at 5:00 AM Peter Smith wrote: > > > > While reviewing some logical replication code I stumbled across a > > variable usage that looks suspicious to me. > > > > Note that the AlterSubscription_refresh function (unlike other > > functions in the subscriptioncmds.c) is using the global variable > > "wrconn" instead of a local stack variable of the same name. I was > > unable to think of any good reason why it would be deliberately doing > > this, so my guess is that it is simply an accidental mistake that has > > gone unnoticed because the compiler was silently equally happy just > > using the global var. > > > > Apparently, this is not causing any reported problems because it seems > > like the code has been this way for ~4 years [1]. > > > > Even so, it doesn't look intentional to me and I felt that there may > > be unknown consequences (e.g. resource leakage?) of just blatting over > > the global var. So, PSA a small patch to make this > > AlterSubscription_refresh function use a stack variable consistent > > with the other nearby functions. > > > > Thoughts? > > +1. It looks like the global variable wrconn defined/declared in > worker_internal.h/worker.c, is for logical apply/table sync worker and > it doesn't make sense to use it for CREATE/ALTER subscription refresh > code that runs on a backend. And I couldn't think of any unknown > consequences/resource leakage, because that global variable is being > used by different processes which have their own copy. > > And, the patch basically looks good to me, except a bit of rewording > the commit message to something like "Use local variable wrconn in > AlterSubscription_refresh instead of global a variable with the same > name which is meant to be used for logical apply/table sync worker. > Having the wrconn global variable in AlterSubscription_refresh doesn't > cause any real issue as such but it keeps the code in > subscriptioncmds.c inconsistent with other functions which use a local > variable named wrconn." or some other better wording? > > Regression tests were passed on my dev system with the patch. > Thanks for your feedback. I can post another patch (or same patch with an improved commit comment) later, but I will just wait a day first in case there is more information to say about it. e.g. my suspicion that there would be "consequences" seems to have come to fruition after all [1] although I never would have thought of that tricky trigger / refresh scenario. -- [1] https://www.postgresql.org/message-id/20210504043149.vg4w66vuh4qjrbph%40alap3.anarazel.de Kind Regards, Peter Smith. Fujitsu Australia
Re: MaxOffsetNumber for Table AMs
Hi, On 2021-04-30 11:51:07 -0700, Peter Geoghegan wrote: > I think that it's reasonable to impose some cost on index AMs here, > but that needs to be bounded sensibly and unambiguously. For example, > it would probably be okay if you had either 6 byte or 8 byte TIDs, but > no other variations. You could require index AMs (the subset of index > AMs that are ever able to store 8 byte TIDs) to directly encode which > width they're dealing with at the level of each IndexTuple. That would > create some problems for nbtree deduplication, especially in boundary > cases, but ISTM that you can manage the complexity by sensibly > restricting how the TIDs work across the board. > For example, the TIDs should always work like unsigned integers -- the > table AM must be willing to work with that restriction. Isn't that more a question of the encoding than the concrete representation? > You'd then have posting lists tuples in nbtree whose TIDs were all > either 6 bytes or 8 bytes wide, with a mix of each possible (though > not particularly likely) on the same leaf page. Say when you have a > table that exceeds the current MaxBlockNumber restrictions. It would > be relatively straightforward for nbtree deduplication to simply > refuse to mix 6 byte and 8 byte datums together to avoid complexity in > boundary cases. The deduplication pass logic has the flexibility that > this requires already. Which nbtree cases do you think would have an easier time supporting switching between 6 or 8 byte tids than supporting fully variable width tids? Given that IndexTupleData already is variable-width, it's not clear to me why supporting two distinct sizes would be harder than a fully variable size? I assume it's things like BTDedupState->htids? > > What's wrong with varlena headers? It would end up being a 1-byte > > header in practically every case, and no variable-width representation > > can do without a length word of some sort. I'm not saying varlena is > > as efficient as some new design could hypothetically be, but it > > doesn't seem like it'd be a big enough problem to stress about. If you > > used a variable-width representation for integers, you might actually > > save bytes in a lot of cases. An awful lot of the TIDs people store in > > practice probably contain several zero bytes, and if we make them > > wider, that's going to be even more true. > > Maybe all of this is true, and maybe it works out to be the best path > forward in the long term, all things considered. But whether or not > that's true is crucially dependent on what real practical table AMs > (of which there will only ever be a tiny number) actually need to do. > Why should we assume that the table AM cannot accept some > restrictions? What good does it do to legalistically define the > problem as a problem for index AMs to solve? I don't think anybody is arguing that AMs cannot accept any restrictions? I do think it's pretty clear that it's not entirely obvious what the concrete set of proper restrictions would be, where we won't end up needing to re-evaluate limits in a few years are. If you add to that the fact that variable-width tids will often end up considerably smaller than our current tids, it's not obvious why we should use bitspace somewhere to indicate an 8 byte tid instead of a a variable-width tid? Greetings, Andres Freund
Re: AlterSubscription_refresh "wrconn" wrong variable?
Hi, On 2021-05-04 09:29:42 +1000, Peter Smith wrote: > While reviewing some logical replication code I stumbled across a > variable usage that looks suspicious to me. > Note that the AlterSubscription_refresh function (unlike other > functions in the subscriptioncmds.c) is using the global variable > "wrconn" instead of a local stack variable of the same name. I was > unable to think of any good reason why it would be deliberately doing > this, so my guess is that it is simply an accidental mistake that has > gone unnoticed because the compiler was silently equally happy just > using the global var. > Apparently, this is not causing any reported problems because it seems > like the code has been this way for ~4 years [1]. This sounded vaguely familiar. After a bit of searching I found that's because I debugged a crash related to it: https://www.postgresql.org/message-id/2020215820.qihhrz7fayu6myfi%40alap3.anarazel.de Peter? Greetings, Andres Freund
Re: .ready and .done files considered harmful
Hi, On 2021-05-03 16:49:16 -0400, Robert Haas wrote: > I have two possible ideas for addressing this; perhaps other people > will have further suggestions. A relatively non-invasive fix would be > to teach pgarch.c how to increment a WAL file name. After archiving > segment N, check using stat() whether there's an .ready file for > segment N+1. If so, do that one next. If not, then fall back to > performing a full directory scan. Hm. I wonder if it'd not be better to determine multiple files to be archived in one readdir() pass? > As far as I can see, this is just cheap insurance. If archiving is > keeping up, the extra stat() won't matter much. If it's not, this will > save more system calls than it costs. Since during normal operation it > shouldn't really be possible for files to show up in pg_wal out of > order, I don't really see a scenario where this changes the behavior, > either. If there are gaps in the sequence at startup time, this will > cope with it exactly the same as we do now, except with a better > chance of finishing before I retire. There's definitely gaps in practice :(. Due to the massive performance issues with archiving there are several tools that archive multiple files as part of one archive command invocation (and mark the additional archived files as .done immediately). > However, that's still pretty wasteful. Every time we have to wait for > the next file to be ready for archiving, we'll basically fall back to > repeatedly scanning the whole directory, waiting for it to show up. Hm. That seems like it's only an issue because .done and .ready are in the same directory? Otherwise the directory would be empty while we're waiting for the next file to be ready to be archived. I hate that that's a thing but given teh serial nature of archiving, with high per-call overhead, I don't think it'd be ok to just break that without a replacement :(. > But perhaps we could work around this by allowing pgarch.c to access > shared memory, in which case it could examine the current timeline > whenever it wants, and probably also whatever LSNs it needs to know > what's safe to archive. FWIW, the shared memory stats patch implies doing that, since the archiver reports stats. > If we did that, could we just get rid of the .ready and .done files > altogether? Are they just a really expensive IPC mechanism to avoid a > shared memory connection, or is there some more fundamental reason why > we need them? What kind of shared memory mechanism are you thinking of? Due to timelines and history files I don't think simple position counters would be quite enough. I think the aforementioned "batching" archive commands are part of the problem :(. > And is there any good reason why the archiver shouldn't be connected > to shared memory? It is certainly nice to avoid having more processes > connected to shared memory than necessary, but the current scheme is > so inefficient that I think we end up worse off. I think there is no fundamental for avoiding shared memory in the archiver. I guess there's a minor robustness advantage, because the forked shell to start the archvive command won't be attached to shared memory. But that's only until the child exec()s to the archive command. There is some minor performance advantage as well, not having to process the often large and contended memory mapping for shared_buffers is probably measurable - but swamped by the cost of needing to actually archive the segment. My only "concern" with doing anything around this is that I think the whole approach of archive_command is just hopelessly broken, with even just halfway busy servers only able to keep up archiving if they muck around with postgres internal data during archive command execution. Add to that how hard it is to write a robust archive command (e.g. the one in our docs still suggests test ! -f && cp, which means that copy failing in the middle yields an incomplete archive)... While I don't think it's all that hard to design a replacement, it's however likely still more work than addressing the O(n^2) issue, so ... Greetings, Andres Freund
Re: Replication slot stats misgivings
On Mon, May 3, 2021 at 10:21 PM Amit Kapila wrote: > > On Mon, May 3, 2021 at 5:48 PM Masahiko Sawada wrote: > > > > On Mon, May 3, 2021 at 2:27 PM Amit Kapila wrote: > > > > > > On Thu, Apr 29, 2021 at 10:37 AM Amit Kapila > > > wrote: > > > > > > > > On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada > > > > wrote: > > > > > > > > > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila > > > > > wrote: > > > > > > > > > > > > > > > > > I am not sure if any of these alternatives are a good idea. What do > > > > > > you think? Do you have any other ideas for this? > > > > > > > > > > I've been considering some ideas but don't come up with a good one > > > > > yet. It’s just an idea and not tested but how about having > > > > > CreateDecodingContext() register before_shmem_exit() callback with the > > > > > decoding context to ensure that we send slot stats even on > > > > > interruption. And FreeDecodingContext() cancels the callback. > > > > > > > > > > > > > Is it a good idea to send stats while exiting and rely on the same? I > > > > think before_shmem_exit is mostly used for the cleanup purpose so not > > > > sure if we can rely on it for this purpose. I think we can't be sure > > > > that in all cases we will send all stats, so maybe Vignesh's patch is > > > > sufficient to cover the cases where we avoid losing it in cases where > > > > we would have sent a large amount of data. > > > > > > > > > > Sawada-San, any thoughts on this point? > > > > before_shmem_exit is mostly used to the cleanup purpose but how about > > on_shmem_exit()? pgstats relies on that to send stats at the > > interruption. See pgstat_shutdown_hook(). > > > > Yeah, that is worth trying. Would you like to give it a try? Yes. In this approach, I think we will need to have a static pointer in logical.c pointing to LogicalDecodingContext that we’re using. At StartupDecodingContext(), we set the pointer to the just created LogicalDecodingContext and register the callback so that we can refer to the LogicalDecodingContext on that callback. And at FreeDecodingContext(), we reset the pointer to NULL (however, since FreeDecodingContext() is not called when an error happens we would need to ensure resetting it somehow). But, after more thought, if we have the static pointer in logical.c it would rather be better to have a global function that sends slot stats based on the LogicalDecodingContext pointed by the static pointer and can be called by ReplicationSlotRelease(). That way, we don’t need to worry about erroring out cases as well as interruption cases, although we need to have a new static pointer. I've attached a quick-hacked patch. I also incorporated the change that calls UpdateDecodingStats() at FreeDecodingContext() so that we can send slot stats also in the case where we spilled/streamed changes but finished without commit/abort/prepare record. > I think > it still might not cover the cases where we error out in the backend > while decoding via APIs because at that time we won't exit, maybe for > that we can consider Vignesh's patch. Agreed. It seems to me that the approach of the attached patch is better than the approach using on_shmem_exit(). So if we want to avoid having the new static pointer and function for this purpose we can consider Vignesh’s patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c index 00543ede45..f32a2da565 100644 --- a/src/backend/replication/logical/logical.c +++ b/src/backend/replication/logical/logical.c @@ -51,6 +51,12 @@ typedef struct LogicalErrorCallbackState XLogRecPtr report_location; } LogicalErrorCallbackState; +/* + * Pointing to the currently-used logical decoding context andu sed to sent + * slot statistics on releasing slots. + */ +static LogicalDecodingContext *MyLogicalDecodingContext = NULL; + /* wrappers around output plugin callbacks */ static void output_plugin_error_callback(void *arg); static void startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions *opt, @@ -290,6 +296,13 @@ StartupDecodingContext(List *output_plugin_options, MemoryContextSwitchTo(old_context); + /* + * Keep holding the decoding context until freeing the decoding context + * or releasing the logical slot. + */ + Assert(MyLogicalDecodingContext == NULL); + MyLogicalDecodingContext = ctx; + return ctx; } @@ -626,10 +639,12 @@ FreeDecodingContext(LogicalDecodingContext *ctx) if (ctx->callbacks.shutdown_cb != NULL) shutdown_cb_wrapper(ctx); + UpdateDecodingStats(ctx); ReorderBufferFree(ctx->reorder); FreeSnapshotBuilder(ctx->snapshot_builder); XLogReaderFree(ctx->reader); MemoryContextDelete(ctx->context); + MyLogicalDecodingContext = NULL; } /* @@ -1811,3 +1826,17 @@ UpdateDecodingStats(LogicalDecodingContext *ctx) rb->totalTxns = 0; rb->totalBytes = 0; } + +/* + * The function called at releasing a logical replication slot, sending
Re: AlterSubscription_refresh "wrconn" wrong variable?
On Tue, May 4, 2021 at 5:00 AM Peter Smith wrote: > > While reviewing some logical replication code I stumbled across a > variable usage that looks suspicious to me. > > Note that the AlterSubscription_refresh function (unlike other > functions in the subscriptioncmds.c) is using the global variable > "wrconn" instead of a local stack variable of the same name. I was > unable to think of any good reason why it would be deliberately doing > this, so my guess is that it is simply an accidental mistake that has > gone unnoticed because the compiler was silently equally happy just > using the global var. > > Apparently, this is not causing any reported problems because it seems > like the code has been this way for ~4 years [1]. > > Even so, it doesn't look intentional to me and I felt that there may > be unknown consequences (e.g. resource leakage?) of just blatting over > the global var. So, PSA a small patch to make this > AlterSubscription_refresh function use a stack variable consistent > with the other nearby functions. > > Thoughts? +1. It looks like the global variable wrconn defined/declared in worker_internal.h/worker.c, is for logical apply/table sync worker and it doesn't make sense to use it for CREATE/ALTER subscription refresh code that runs on a backend. And I couldn't think of any unknown consequences/resource leakage, because that global variable is being used by different processes which have their own copy. And, the patch basically looks good to me, except a bit of rewording the commit message to something like "Use local variable wrconn in AlterSubscription_refresh instead of global a variable with the same name which is meant to be used for logical apply/table sync worker. Having the wrconn global variable in AlterSubscription_refresh doesn't cause any real issue as such but it keeps the code in subscriptioncmds.c inconsistent with other functions which use a local variable named wrconn." or some other better wording? Regression tests were passed on my dev system with the patch. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Mon, 2021-05-03 at 18:12 -0700, Peter Geoghegan wrote: > But look at the details: tidbitmap.c uses MaxHeapTuplesPerPage as its > MAX_TUPLES_PER_PAGE, which seems like a problem -- that's 291 with > default BLCKSZ. I doubt that that restriction is something that you > can afford to live with, even just for the time being. Oh, you're right. I missed that MaxHeapTuplesPerPage was an order of magnitude smaller. > I don't see why that's necessarily a problem. Why, in general, should > every table AM be able to support every index AM? I didn't propose that every table AM needs to support every index type, just that we should do something or at least document something. It's pretty frustrating to have to fall back to manually managing the indexes for dozens or hundreds of partitions when you make use of multiple table AMs. We might be conflating support for index AMs with support for features like bitmap scans. If a certain kind of index fails at CREATE INDEX time, that's painful for the partitioning case. But here it's more like the CREATE INDEX would succeed but it would just never be used, which is a different kind of frustrating. Whatever we do or don't do, we should try to avoid surprises. I expect table AMs to be used heavily with partitioning. Regards, Jeff Davis
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 5:15 PM Jeff Davis wrote: > I don't see why in-core changes are a strict requirement. It doesn't > make too much difference if a lossy TID doesn't correspond exactly to > the columnar layout -- it should be fine as long as there's locality, > right? But look at the details: tidbitmap.c uses MaxHeapTuplesPerPage as its MAX_TUPLES_PER_PAGE, which seems like a problem -- that's 291 with default BLCKSZ. I doubt that that restriction is something that you can afford to live with, even just for the time being. > > It seems senseless to *require* table AMs to support something like a > > bitmap scan. > > I am not yet convinced that it's "senseless", but it is optional and > there's probably a reason that it's not required. I mean it's senseless to require it in the general case. > We still need to address the fact that two features have had a minor > collision: indexes on a partitioned table and table AMs that don't > necessarily support all index types. It's not good to just throw an > error, because we could be forcing the user to manually manage the > indexes on hundreds of partitions just because some tables have a > different AM and it doesn't support the index type. I don't see why that's necessarily a problem. Why, in general, should every table AM be able to support every index AM? I find it puzzling that nobody can find one single thing that the table AM interface *can't* do. What are the chances that the abstraction really is perfect? > > I don't think it's a coincidence that GIN is the index AM > > that looks like it presents at least 2 problems for the columnar > > table > > AM. To me this suggests that this will need a much higher level > > discussion. > > One problem is that ginpostinglist.c restricts the use of offset > numbers higher than MaxOffsetNumber - 1. At best, that's a confusing > and unnecessary off-by-one error that we happen to be stuck with > because it affects the on-disk format. Now that I'm past that > particular confusion, I can live with a workaround until we do > something better. > > What is the other problem with GIN? I just meant the tidbitmap.c stuff, and so on. There is really one big problem: GIN leverages the fact that bitmap scans are all that it supports in many different ways. The reality is that it was designed to work with heapam -- that's how it evolved. It seems rather unlikely that problems are confined to this ginpostinglist.c representational issue -- which is very surface-level. The only way to figure it out is to try to make it work and see what happens, though, so perhaps it isn't worth discussing any further until that happens. -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
On Mon, 2021-05-03 at 15:07 -0700, Peter Geoghegan wrote: > Sure, but it either makes sense for the columnar table AM to support > bitmap scans (or some analogous type of scan that works only slightly > differently) or it doesn't. It's not at all clear which it is right > now. It makes sense for my columnar table AM -- there's TID locality. > If it makes sense then it will of course be necessary to describe > what > "bitmap scan" actually means with the columnar storage table AM (plus > you'll still need to make some in-core changes to places like > tidbitmap.c). OTOH if it doesn't make sense then that's that -- it's > going to be a bit annoying in the partitioning scenario you describe, > but some things are bound to be *inherently* impossible, so it can't > be > helped. I don't see why in-core changes are a strict requirement. It doesn't make too much difference if a lossy TID doesn't correspond exactly to the columnar layout -- it should be fine as long as there's locality, right? > It seems senseless to *require* table AMs to support something like a > bitmap scan. I am not yet convinced that it's "senseless", but it is optional and there's probably a reason that it's not required. We still need to address the fact that two features have had a minor collision: indexes on a partitioned table and table AMs that don't necessarily support all index types. It's not good to just throw an error, because we could be forcing the user to manually manage the indexes on hundreds of partitions just because some tables have a different AM and it doesn't support the index type. We probably want to do something about that, but as far as I can tell, it's not a problem for columnar right now. > I don't think it's a coincidence that GIN is the index AM > that looks like it presents at least 2 problems for the columnar > table > AM. To me this suggests that this will need a much higher level > discussion. One problem is that ginpostinglist.c restricts the use of offset numbers higher than MaxOffsetNumber - 1. At best, that's a confusing and unnecessary off-by-one error that we happen to be stuck with because it affects the on-disk format. Now that I'm past that particular confusion, I can live with a workaround until we do something better. What is the other problem with GIN? Regards, Jeff Davis
AlterSubscription_refresh "wrconn" wrong variable?
While reviewing some logical replication code I stumbled across a variable usage that looks suspicious to me. Note that the AlterSubscription_refresh function (unlike other functions in the subscriptioncmds.c) is using the global variable "wrconn" instead of a local stack variable of the same name. I was unable to think of any good reason why it would be deliberately doing this, so my guess is that it is simply an accidental mistake that has gone unnoticed because the compiler was silently equally happy just using the global var. Apparently, this is not causing any reported problems because it seems like the code has been this way for ~4 years [1]. Even so, it doesn't look intentional to me and I felt that there may be unknown consequences (e.g. resource leakage?) of just blatting over the global var. So, PSA a small patch to make this AlterSubscription_refresh function use a stack variable consistent with the other nearby functions. Thoughts? -- [1] https://github.com/postgres/postgres/commit/7c4f52409a8c7d85ed169bbbc1f6092274d03920# Kind Regards, Peter Smith. Fujitsu Australia v1-0001-Fix-wrconn.-Use-stack-variable.patch Description: Binary data
Re: Performance Evaluation of Result Cache by using TPC-DS
Thanks for doing further analysis on this. On Mon, 26 Apr 2021 at 20:31, Yuya Watari wrote: > Thank you for running experiments on your machine and I really > appreciate your deep analysis. > > Your results are very interesting. In 5 queries, the result cache is > cheaper but slower. Especially, in query 88, although the cost with > result cache is cheaper, it has 34.23% degradation in query execution > time. This is big regression. That's certainly one side of it. On the other side, it's pretty important to also note that in 4 of 23 queries the result cache plan executed faster but the planner costed it as more expensive. I'm not saying the costing is perfect, but what I am saying is, as you noted above, in 5 of 23 queries the result cache was cheaper and slower, and, as I just noted, in 4 of 23 queries, result cache was more expensive and faster. We know that costing is never going to be a perfect representation of what the execution time will be However, in these examples, we've just happened to get quite a good balance. If we add a penalty to result cache then it'll just subtract from one problem group and add to the other. Overall, in my tests execution was 1.15% faster with result cache enabled than it was without. I could maybe get on board with adding a small fixed cost penalty. I'm not sure exactly what it would be, maybe a cpu_tuple_cost instead of a cpu_operator_cost and count it in for forming/deforming cached tuples. I think the patch you wrote to add the resultcache_cost_factor is only suitable for running experiments with. The bigger concerns I have with the costing are around what happens when an n_distinct estimate is far too low on one of the join columns. I think it is more likely to be concerns like that one which would cause us to default enable_resultcache to off. David
Re: Simplify backend terminate and wait logic in postgres_fdw test
Michael Paquier writes: > On Tue, Apr 13, 2021 at 04:39:58PM +0900, Michael Paquier wrote: >> Looks fine to me. Let's wait a bit first to see if Fujii-san has any >> objections to this cleanup as that's his code originally, from >> 32a9c0bd. > And hearing nothing, done. The tests of postgres_fdw are getting much > faster for me now, from basically 6s to 4s (actually that's roughly > 1.8s of gain as pg_wait_until_termination waits at least 100ms, > twice), so that's a nice gain. The buildfarm is showing that one of these test queries is not stable under CLOBBER_CACHE_ALWAYS: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-05-01%2007%3A44%3A47 of which the relevant part is: diff -U3 /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out --- /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out 2021-05-01 03:44:45.022300613 -0400 +++ /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out 2021-05-03 09:11:24.051379288 -0400 @@ -9215,8 +9215,7 @@ WHERE application_name = 'fdw_retry_check'; pg_terminate_backend -- - t -(1 row) +(0 rows) -- This query should detect the broken connection when starting new remote -- transaction, reestablish new connection, and then succeed. I can reproduce that locally by setting alter system set debug_invalidate_system_caches_always = 1; and running "make installcheck" in contrib/postgres_fdw. (It takes a good long time to run the whole test script though, so you might want to see if running just these few queries is enough.) There's no evidence of distress in the postmaster log, so I suspect this might just be a timing instability, e.g. remote process already gone before local process looks. If so, it's probably hopeless to make this test stable as-is. Perhaps we should just take it out. regards, tom lane
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 2:03 PM Jeff Davis wrote: > Another point: the idea of supporting only some kinds of indexes > doesn't mix well with partitioning. If you declare an index on the > parent, we should do something reasonable if one partition's table AM > doesn't support that index AM. Sure, but it either makes sense for the columnar table AM to support bitmap scans (or some analogous type of scan that works only slightly differently) or it doesn't. It's not at all clear which it is right now. If it makes sense then it will of course be necessary to describe what "bitmap scan" actually means with the columnar storage table AM (plus you'll still need to make some in-core changes to places like tidbitmap.c). OTOH if it doesn't make sense then that's that -- it's going to be a bit annoying in the partitioning scenario you describe, but some things are bound to be *inherently* impossible, so it can't be helped. It seems senseless to *require* table AMs to support something like a bitmap scan. I don't think it's a coincidence that GIN is the index AM that looks like it presents at least 2 problems for the columnar table AM. To me this suggests that this will need a much higher level discussion. -- Peter Geoghegan
Re: Avoiding smgrimmedsync() during nbtree index builds
So, I've written a patch which avoids doing the immediate fsync for index builds either by using shared buffers or by queueing sync requests for the checkpointer. If a checkpoint starts during the index build and the backend is not using shared buffers for the index build, it will need to do the fsync. The reviewer will notice that _bt_load() extends the index relation for the metapage before beginning the actual load of leaf pages but does not actually write the metapage until the end of the index build. When using shared buffers, it was difficult to create block 0 of the index after creating all of the other blocks, as the block number is assigned inside of ReadBuffer_common(), and it doesn't really work with the current bufmgr API to extend a relation with a caller-specified block number. I am not entirely sure of the correctness of doing an smgrextend() (when not using shared buffers) without writing any WAL. However, the metapage contents are not written until after WAL logging them later in _bt_blwritepage(), so, perhaps it is okay? I am also not fond of the change to the signature of _bt_uppershutdown() that this implementation forces. Now, I must pass the shared buffer (when using shared buffers) that I've reserved (pinned and locked) for the metapage and, if not using shared buffers, the page I've allocated for the metapage, before doing the index build to _bt_uppershutdown() after doing the rest of the index build. I don't know that it seems incorrect -- more that it feels a bit messy (and inefficient) to hold onto that shared buffer or memory for the duration of the index build, during which I have no intention of doing anything with that buffer or memory. However, the alternative I devised was to change ReadBuffer_common() or to add a new ReadBufferExtended() mode which indicated that the caller would specify the block number and whether or not it was an extend, which also didn't seem right. For the extensions of the index done during index build, I use ReadBufferExtended() directly instead of _bt_getbuf() for a few reasons. I thought (am not sure) that I don't need to do LockRelationForExtension() during index build. Also, I decided to use RBM_ZERO_AND_LOCK mode so that I had an exclusive lock on the buffer content instead of doing _bt_lockbuf() (which is what _bt_getbuf() does). And, most of the places I added the call to ReadBufferExtended(), the non-shared buffer code path is already initializing the page, so it made more sense to just share that codepath. I considered whether or not it made sense to add a new btree utility function which calls ReadBufferExtended() in this way, however, I wasn't sure how much that would buy me. The other place it might be able to be used is btvacuumpage(), but that case is different enough that I'm not even sure what the function would be called -- basically it would just be an alternative to _bt_getbuf() for a couple of somewhat unrelated edge cases. On Thu, Jan 21, 2021 at 5:51 PM Andres Freund wrote: > > Hi, > > On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote: > > On 21/01/2021 22:36, Andres Freund wrote: > > > A quick hack (probably not quite correct!) to evaluate the benefit shows > > > that the attached script takes 2m17.223s with the smgrimmedsync and > > > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in > > > the former case, CPU bound in the latter. > > > > > > Creating lots of tables with indexes (directly or indirectly through > > > relations having a toast table) is pretty common, particularly after the > > > introduction of partitioning. > > > Moving index builds of indexes which would fit in shared buffers back into shared buffers has the benefit of eliminating the need to write them out and fsync them if they will be subsequently used and thus read right back into shared buffers. This avoids some of the unnecessary fsyncs Andres is talking about here as well as avoiding some of the extra IO required to write them and then read them into shared buffers. I have dummy criteria for whether or not to use shared buffers (if the number of tuples to be indexed is > 1000). I am considering using a threshold of some percentage of the size of shared buffers as the actual criteria for determining where to do the index build. > > > > > > Thinking through the correctness of replacing smgrimmedsync() with sync > > > requests, the potential problems that I can see are: > > > > > > 1) redo point falls between the log_newpage() and the > > > write()/register_dirty_segment() in smgrextend/smgrwrite. > > > 2) redo point falls between write() and register_dirty_segment() > > > > > > But both of these are fine in the context of initially filling a newly > > > created relfilenode, as far as I can tell? Otherwise the current > > > smgrimmedsync() approach wouldn't be safe either, as far as I can tell? > > > > Hmm. If the redo point falls between write() and the > > register_dirty_segment(), and the checkpointer finishes t
Re: PG in container w/ pid namespace is init, process exits cause restart
Andres Freund writes: > On 2021-05-03 16:20:43 -0400, Tom Lane wrote: >> Hmm, by that argument, any unexpected child PID in reaper() ought to be >> grounds for a restart, regardless of its exit code. Which'd be fine by >> me. I'm on board with being more restrictive about this, not less so. > Are there any holes / races that could lead to this "legitimately" > happening? To me the signal blocking looks like it should prevent that? If it did happen it would imply a bug in the postmaster's child-process bookkeeping. (Or, I guess, some preloaded module deciding that launching its own children was OK, whether or not it could find out whether they succeeded.) > I'm a bit worried that we'd find some harmless corner cases under adding > a new instability. So personally I'd be inclined to just make it a > warning, but ... Well, I wouldn't recommend adding such a check in released branches, but I'd be in favor of changing it in HEAD (or waiting till v15 opens). Meanwhile, it seems like we both thought of complaining if the postmaster's PID is 1. I'm not quite sure if there are any portability hazards from that, but on the whole it sounds like a good way to detect badly-configured containers. regards, tom lane
Re: PG in container w/ pid namespace is init, process exits cause restart
Tom Lane writes: > Alvaro Herrera writes: >> On 2021-May-03, Andres Freund wrote: >>> The issue turns out to be that postgres was in a container, with pid >>> namespaces enabled. Because postgres was run directly in the container, >>> without a parent process inside, it thus becomes pid 1. Which mostly >>> works without a problem. Until, as the case here with the archive >>> command, a sub-sub process exits while it still has a child. Then that >>> child gets re-parented to postmaster (as init). > >> Hah .. interesting. I think we should definitely make this work, since >> containerized stuff is going to become more and more prevalent. > > How would we make it "work"? The postmaster can't possibly be expected > to know the right thing to do with unexpected children. > >> I guess we can do that in older releases, but do we really need it? As >> I understand, the only thing we need to do is verify that the dying PID >> is a backend PID, and not cause a crash cycle if it isn't. > Maybe we should put in a startup-time check, analogous to the > can't-run-as-root test, that the postmaster mustn't be PID 1. Given that a number of minimal `init`s already exist specifically for the case of running a single application in a container, I don't think Postgres should to reinvent that wheel. A quick eyball of the output of `apt search container init` on a Debian Bullseyse system reveals at least four: - https://github.com/Yelp/dumb-init - https://github.com/krallin/tini - https://github.com/fpco/pid1 - https://github.com/openSUSE/catatonit The first one also explains why there's more to being PID 1 than just handling reparented children. - ilmari -- "The surreality of the universe tends towards a maximum" -- Skud's Law "Never formulate a law or axiom that you're not prepared to live with the consequences of." -- Skud's Meta-Law
Re: MaxOffsetNumber for Table AMs
On Fri, 2021-04-30 at 10:55 -0700, Jeff Davis wrote: > On Fri, 2021-04-30 at 12:35 -0400, Tom Lane wrote: > > ISTM that would be up to the index AM. We'd need some interlocks > > on > > which index AMs could be used with which table AMs in any case, I > > think. > > I'm not sure why? It seems like we should be able to come up with > something that's generic enough. Another point: the idea of supporting only some kinds of indexes doesn't mix well with partitioning. If you declare an index on the parent, we should do something reasonable if one partition's table AM doesn't support that index AM. Regards, Jeff Davis
.ready and .done files considered harmful
I and various colleagues of mine have from time to time encountered systems that got a bit behind on WAL archiving, because the archive_command started failing and nobody noticed right away. Ideally, people should have monitoring for this and put it to rights immediately, but some people don't. If those people happen to have a relatively small pg_wal partition, they will likely become aware of the issue when it fills up and takes down the server, but some users provision disk space pretty generously and therefore nothing compels them to notice the issue until they fill it up. In at least one case, on a system that was actually generating a reasonable amount of WAL, this took in excess of six months. As you might imagine, pg_wal can get fairly large in such scenarios, but the user is generally less concerned with solving that problem than they are with getting the system back up. It is doubtless true that the user would prefer to shrink the disk usage down to something more reasonable over time, but on the facts as presented, it can't really be an urgent issue for them. What they really need is just free up a little disk space somehow or other and then get archiving running fast enough to keep up with future WAL generation. Regrettably, the archiver cannot do this, not even if you set archive_command = /bin/true, because the archiver will barely ever actually run the archive_command. Instead, it will spend virtually all of its time calling readdir(), because for some reason it feels a need to make a complete scan of the archive_status directory before archiving a WAL file, and then it has to make another scan before archiving the next one. Someone - and it's probably for the best that the identity of that person remains unknown to me - came up with a clever solution to this problem, which is now used almost as a matter of routine whenever this comes up. You just run pg_archivecleanup on your pg_wal directory, and then remove all the corresponding .ready files and call it a day. I haven't scrutinized the code for pg_archivecleanup, but evidently it avoids needing O(n^2) time for this and therefore can clean up the whole directory in something like the amount of time the archiver would take to deal with a single file. While this seems to be quite an effective procedure and I have not yet heard any user complaints, it seems disturbingly error-prone, and honestly shouldn't ever be necessary. The issue here is only that pgarch.c acts as though after archiving 00010001, 00010002, and then 00010003, we have no idea what file we might need to archive next. Could it, perhaps, be 00010004? Only a full directory scan will tell us the answer! I have two possible ideas for addressing this; perhaps other people will have further suggestions. A relatively non-invasive fix would be to teach pgarch.c how to increment a WAL file name. After archiving segment N, check using stat() whether there's an .ready file for segment N+1. If so, do that one next. If not, then fall back to performing a full directory scan. As far as I can see, this is just cheap insurance. If archiving is keeping up, the extra stat() won't matter much. If it's not, this will save more system calls than it costs. Since during normal operation it shouldn't really be possible for files to show up in pg_wal out of order, I don't really see a scenario where this changes the behavior, either. If there are gaps in the sequence at startup time, this will cope with it exactly the same as we do now, except with a better chance of finishing before I retire. However, that's still pretty wasteful. Every time we have to wait for the next file to be ready for archiving, we'll basically fall back to repeatedly scanning the whole directory, waiting for it to show up. And I think that we can't get around that by just using stat() to look for the appearance of the file we expect to see, because it's possible that we might be doing all of this on a standby which then gets promoted, or some upstream primary gets promoted, and WAL files start appearing on a different timeline, making our prediction of what the next filename will be incorrect. But perhaps we could work around this by allowing pgarch.c to access shared memory, in which case it could examine the current timeline whenever it wants, and probably also whatever LSNs it needs to know what's safe to archive. If we did that, could we just get rid of the .ready and .done files altogether? Are they just a really expensive IPC mechanism to avoid a shared memory connection, or is there some more fundamental reason why we need them? And is there any good reason why the archiver shouldn't be connected to shared memory? It is certainly nice to avoid having more processes connected to shared memory than necessary, but the current scheme is so inefficient that I think we end up worse off. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Re: PG in container w/ pid namespace is init, process exits cause restart
Hi, On 2021-05-03 16:20:43 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2021-05-03 15:37:24 -0400, Tom Lane wrote: > >> And who's to say that ignoring unexpected child deaths is okay, > >> anyway? We could hardly be sure that the dead process hadn't been > >> connected to shared memory. > > > I don't think checking the exit status of unexpected children to see > > whether we should crash-restart out of that concern is meaningful: We > > don't know that the child didn't do anything bad with shared memory when > > they exited with exit(1), instead of exit(2). > > Hmm, by that argument, any unexpected child PID in reaper() ought to be > grounds for a restart, regardless of its exit code. Which'd be fine by > me. I'm on board with being more restrictive about this, not less so. Are there any holes / races that could lead to this "legitimately" happening? To me the signal blocking looks like it should prevent that? I'm a bit worried that we'd find some harmless corner cases under adding a new instability. So personally I'd be inclined to just make it a warning, but ... Greetings, Andres Freund
Re: PG in container w/ pid namespace is init, process exits cause restart
On 5/3/21 3:07 PM, Andres Freund wrote: > Hi, > > A colleague debugged an issue where their postgres was occasionally > crash-restarting under load. > > The cause turned out to be that a relatively complex archive_command was > used, which could in some rare circumstances have a bash subshell > pipeline not succeed. It wasn't at all obvious why that'd cause a crash > though - the archive command handles the error. > > The issue turns out to be that postgres was in a container, with pid > namespaces enabled. Because postgres was run directly in the container, > without a parent process inside, it thus becomes pid 1. Which mostly > works without a problem. Until, as the case here with the archive > command, a sub-sub process exits while it still has a child. Then that > child gets re-parented to postmaster (as init). > > Such a child is likely to have exited not just with 0 or 1, but > something else. As the pid won't match anything in reaper(), we'll go to > CleanupBackend(). Where any exit status but 0/1 will unconditionally > trigger a restart: > > if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus)) > { > HandleChildCrash(pid, exitstatus, _("server process")); > return; > } > > > This kind of thing is pretty hard to debug, because it's not easy to > even figure out what the "crashing" pid belonged to. > > I wonder if we should work a bit harder to try to identify whether an > exiting process was a "server process" before identifying it as such? > > And perhaps we ought to warn about postgres running as "init" unless we > make that robust? > Hmm, my initial reaction was if we detect very early on we're PID 1 then fork and do all our work in the child, and in the parent just wait until there are no more children. Not sure if that's feasible but I thought I'd throw it out there. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: PG in container w/ pid namespace is init, process exits cause restart
On 2021-May-03, Andres Freund wrote: > Using / for a single statically linked binary that e.g. just serves a > bunch of hardcoded files is one thing. Putting actual data in / for > something like postgres another. Yeah, I just had a word with them and I had misunderstood what they were doing. They were attempting something completely insane and pointless, so I'm going to leave it at that. > I think there's a few more special cases when running as init, other > than reparenting. E.g. I think the default signal handlers are > different, the kernel kills the process in fewer cases etc. I am not > opposed to adding support for it, but I think it'd need a bit of care. Ok, we can leave that as future development then. > Given that we probably shouldn't just break things in a minor release by > refusing to run as 1, a warning seems to be the easiest thing for now? WFM. -- Álvaro Herrera Valdivia, Chile "Once again, thank you and all of the developers for your hard work on PostgreSQL. This is by far the most pleasant management experience of any database I've worked on." (Dan Harris) http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php
Re: PG in container w/ pid namespace is init, process exits cause restart
Hi, On 2021-05-03 15:25:53 -0400, Alvaro Herrera wrote: > I also heard a story where things ran into trouble (I didn't get the > whole story of *what* was the problem with that) because the datadir is /. > I know -- nobody in their right mind would put the datadir in / -- but > apparently in the container world that's not something as stupid as it > sounds. That's of course not related to what you describe here > code-wise, but the underlying reason is the same. It still seems pretty insane in the container world too. Postgres needs shared libraries (even if you managed to link postgres itself statically, something we do not support). Postgres needs to write to the data directory. Putting shared libraries inside the data directory seems like a bad idea. Using / for a single statically linked binary that e.g. just serves a bunch of hardcoded files is one thing. Putting actual data in / for something like postgres another. > > I wonder if we should work a bit harder to try to identify whether an > > exiting process was a "server process" before identifying it as such? > > Well, we've never made any effort there because it just wasn't possible. > Nobody ever had postmaster also be init .. until containers. Let's fix > it. > > And perhaps we ought to warn about postgres running as "init" unless we > > make that robust? > > I guess we can do that in older releases, but do we really need it? As > I understand, the only thing we need to do is verify that the dying PID > is a backend PID, and not cause a crash cycle if it isn't. I think there's a few more special cases when running as init, other than reparenting. E.g. I think the default signal handlers are different, the kernel kills the process in fewer cases etc. I am not opposed to adding support for it, but I think it'd need a bit of care. Given that we probably shouldn't just break things in a minor release by refusing to run as 1, a warning seems to be the easiest thing for now? Greetings, Andres Freund
Re: PG in container w/ pid namespace is init, process exits cause restart
Andres Freund writes: > On 2021-05-03 15:37:24 -0400, Tom Lane wrote: >> And who's to say that ignoring unexpected child deaths is okay, >> anyway? We could hardly be sure that the dead process hadn't been >> connected to shared memory. > I don't think checking the exit status of unexpected children to see > whether we should crash-restart out of that concern is meaningful: We > don't know that the child didn't do anything bad with shared memory when > they exited with exit(1), instead of exit(2). Hmm, by that argument, any unexpected child PID in reaper() ought to be grounds for a restart, regardless of its exit code. Which'd be fine by me. I'm on board with being more restrictive about this, not less so. > Do you feel the same about having different logging between the "known" > and "unknown" child processes? No objection to logging such cases more clearly, for sure. regards, tom lane
Re: PG in container w/ pid namespace is init, process exits cause restart
Hi, On 2021-05-03 15:37:24 -0400, Tom Lane wrote: > Alvaro Herrera writes: > > On 2021-May-03, Andres Freund wrote: > >> The issue turns out to be that postgres was in a container, with pid > >> namespaces enabled. Because postgres was run directly in the container, > >> without a parent process inside, it thus becomes pid 1. Which mostly > >> works without a problem. Until, as the case here with the archive > >> command, a sub-sub process exits while it still has a child. Then that > >> child gets re-parented to postmaster (as init). > > > Hah .. interesting. I think we should definitely make this work, since > > containerized stuff is going to become more and more prevalent. > > How would we make it "work"? The postmaster can't possibly be expected > to know the right thing to do with unexpected children. Not saying that we should, but we could check if we're pid 1 / init, and just warn about children we don't know anything about. Which we could detect by iterating over BackendList/BackgroundWorkerList before crash-restarting in CleanupBackend(). Then we'd not loose reliability in the "normal" case, while not reducing reliability in the container case. I'm not quite sure I buy the reliability argument, tbh: The additional process exits we see as pid 1 are after all process exits that we'd not see if we weren't pid 1. And if we're not pid 1 then there really should never be any "unexpected children" - we know what processes postmaster itself forked after all. So where would unexpected children come from, except reparenting? > And who's to say that ignoring unexpected child deaths is okay, > anyway? We could hardly be sure that the dead process hadn't been > connected to shared memory. I don't think checking the exit status of unexpected children to see whether we should crash-restart out of that concern is meaningful: We don't know that the child didn't do anything bad with shared memory when they exited with exit(1), instead of exit(2). Random thought: I wonder if we ought to set madvise(MADV_DONTFORK) on shared memory in postmaster children, where available. Then we could be fairly certain that there aren't processes we don't know about that are attached to shared memory (unless there's some nasty shared_preload_library forking early during backend startup - but that's hard to get excited about). > > I guess we can do that in older releases, but do we really need it? As > > I understand, the only thing we need to do is verify that the dying PID > > is a backend PID, and not cause a crash cycle if it isn't. > > I think that'd be a net reduction in reliability, not an improvement. > In most scenarios it'd do little except mask bugs. Do you feel the same about having different logging between the "known" and "unknown" child processes? Personally I don't think it's of utmost importance to support running as pid 1. But we should at least print useful log messages about what processes exited... Greetings, Andres Freund
Re: PG in container w/ pid namespace is init, process exits cause restart
On 2021-May-03, Tom Lane wrote: > Alvaro Herrera writes: > > I also heard a story where things ran into trouble (I didn't get the > > whole story of *what* was the problem with that) because the datadir is /. > > BTW, as far as that goes, I think the general recommendation is that > the datadir shouldn't be a mount point, because bad things happen if > you mount or unmount the drive while the postmaster is up. I could > see enforcing that, if we could find a reasonably platform-independent > way to do it. / is not a mount point; it's just that the container system binds (?) some different directory as / for the process to run into. I suppose it must be similar to chrooting to /, but I'm not sure if it's exactly that. > (Of course, / can't be unmounted, so I wonder exactly what bad thing > happened in that story.) It's not related to unmounting. I'll try to get the details. -- Álvaro Herrera Valdivia, Chile
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 12:06 PM Matthias van de Meent wrote: > One could relatively easily disable bitmap scans on the table AM by > not installing the relevant bitmap support functions on the registered > TableAM structure, and thus not touch that problem. I have no idea how much it'll hurt things if the column store table AM supports no analogue of bitmap scans. > Some indexes will > then never be accessed due to the bitmap scan requirement of their > IndexAM (gin, brin, bloom, to name a few), and as such won't make > sense to create on that table, but that's about it I think. Right. More formally: if this restriction is accepted by a table AM (say the column store table AM), then any index AM with amgettuple set to NULL cannot ever be used (it should probably be treated as an error condition at CREATE INDEX time). If this really is the best path forward (again, no idea if that's true) then that would conveniently make it pretty easy to solve the GIN posting list issue raised by Jeff. It just wouldn't matter -- GIN indexes cannot be used with the column store anyway. -- Peter Geoghegan
Re: PG in container w/ pid namespace is init, process exits cause restart
Alvaro Herrera writes: > I also heard a story where things ran into trouble (I didn't get the > whole story of *what* was the problem with that) because the datadir is /. BTW, as far as that goes, I think the general recommendation is that the datadir shouldn't be a mount point, because bad things happen if you mount or unmount the drive while the postmaster is up. I could see enforcing that, if we could find a reasonably platform-independent way to do it. (Of course, / can't be unmounted, so I wonder exactly what bad thing happened in that story.) regards, tom lane
Re: Regex performance regression induced by match-all code
On Mon, May 3, 2021, at 21:38, Tom Lane wrote: > "Joel Jacobson" mailto:joel%40compiler.org>> writes: > > On Sun, May 2, 2021, at 18:53, Tom Lane wrote: > >> fix-exponential-cost-of-checkmatchall-2.patch > > > Successfully tested. > > Again, thanks for checking! You're welcome, thanks for coding! /Joel
Re: Regex performance regression induced by match-all code
"Joel Jacobson" writes: > On Sun, May 2, 2021, at 18:53, Tom Lane wrote: >> fix-exponential-cost-of-checkmatchall-2.patch > Successfully tested. Again, thanks for checking! regards, tom lane
Re: PG in container w/ pid namespace is init, process exits cause restart
Alvaro Herrera writes: > On 2021-May-03, Andres Freund wrote: >> The issue turns out to be that postgres was in a container, with pid >> namespaces enabled. Because postgres was run directly in the container, >> without a parent process inside, it thus becomes pid 1. Which mostly >> works without a problem. Until, as the case here with the archive >> command, a sub-sub process exits while it still has a child. Then that >> child gets re-parented to postmaster (as init). > Hah .. interesting. I think we should definitely make this work, since > containerized stuff is going to become more and more prevalent. How would we make it "work"? The postmaster can't possibly be expected to know the right thing to do with unexpected children. > I guess we can do that in older releases, but do we really need it? As > I understand, the only thing we need to do is verify that the dying PID > is a backend PID, and not cause a crash cycle if it isn't. I think that'd be a net reduction in reliability, not an improvement. In most scenarios it'd do little except mask bugs. And who's to say that ignoring unexpected child deaths is okay, anyway? We could hardly be sure that the dead process hadn't been connected to shared memory. Maybe we should put in a startup-time check, analogous to the can't-run-as-root test, that the postmaster mustn't be PID 1. regards, tom lane
Re: Regex performance regression induced by match-all code
On Sun, May 2, 2021, at 18:53, Tom Lane wrote: > fix-exponential-cost-of-checkmatchall-2.patch Successfully tested. SELECT is_match <> (subject ~ pattern), captured IS DISTINCT FROM regexp_match(subject, pattern, flags), COUNT(*) FROM performance_test GROUP BY 1,2 ORDER BY 1,2; ?column? | ?column? | count --+--+- f| f| 3253889 (1 row) Time: 94149.542 ms (01:34.150) Time: 91565.305 ms (01:31.565) Time: 91565.305 ms (01:31.565) SELECT regexp_matches('', '(.|){20}',''); regexp_matches {""} (1 row) Time: 0.541 ms SELECT regexp_matches('', '(.|){25}',''); regexp_matches {""} (1 row) Time: 0.724 ms SELECT regexp_matches('', '(.|){27}',''); regexp_matches {""} (1 row) Time: 0.782 ms /Joel
Re: PG in container w/ pid namespace is init, process exits cause restart
On 2021-May-03, Andres Freund wrote: > The issue turns out to be that postgres was in a container, with pid > namespaces enabled. Because postgres was run directly in the container, > without a parent process inside, it thus becomes pid 1. Which mostly > works without a problem. Until, as the case here with the archive > command, a sub-sub process exits while it still has a child. Then that > child gets re-parented to postmaster (as init). Hah .. interesting. I think we should definitely make this work, since containerized stuff is going to become more and more prevalent. I also heard a story where things ran into trouble (I didn't get the whole story of *what* was the problem with that) because the datadir is /. I know -- nobody in their right mind would put the datadir in / -- but apparently in the container world that's not something as stupid as it sounds. That's of course not related to what you describe here code-wise, but the underlying reason is the same. > I wonder if we should work a bit harder to try to identify whether an > exiting process was a "server process" before identifying it as such? Well, we've never made any effort there because it just wasn't possible. Nobody ever had postmaster also be init .. until containers. Let's fix it. > And perhaps we ought to warn about postgres running as "init" unless we > make that robust? I guess we can do that in older releases, but do we really need it? As I understand, the only thing we need to do is verify that the dying PID is a backend PID, and not cause a crash cycle if it isn't. -- Álvaro Herrera Valdivia, Chile
Re: Granting control of SUSET gucs to non-superusers
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Mon, May 3, 2021 at 2:48 PM Tom Lane wrote: > > I'm still of the opinion that slicing and dicing this at the per-GUC > > level is a huge waste of effort. Just invent one role that lets > > grantees set any GUC, document it as being superuser-equivalent, > > and be done. > > If you want to grant someone full superuser rights, you can do that > already. The trick is what to do when you want someone to be able to > administer the cluster in a meaningful way without giving them full > superuser rights. I would suggest that both are useful, but the one-big-hammer does nothing to answer the use-case which was brought up on this particular thread (which is also certainly not the first time this has been desired). Instead, I would imagine that there would be a set of predefined roles for the capabilities and then we might have another role which is akin to 'pg_monitor' but is 'pg_admin' which is GRANT'd a bunch of those capabilities and explicitly documented to be able to become a superuser if they wished to. Perhaps we would also have a "pg_notsuperuser_admin" which would be GRANT'd all the capabilities, excluding the ones that could be used to gain superuser access. As has also been discussed recently, one of the big missing capabilities for a "pg_notsuperuser_admin" is a 'create role' capability. I realize that's not exactly the same as GUCs but it's a big part of what's missing to make all of this "run a service where the 'DBA' can do everything except get out to the OS" stuff work out of the box. Thanks, Stephen signature.asc Description: PGP signature
PG in container w/ pid namespace is init, process exits cause restart
Hi, A colleague debugged an issue where their postgres was occasionally crash-restarting under load. The cause turned out to be that a relatively complex archive_command was used, which could in some rare circumstances have a bash subshell pipeline not succeed. It wasn't at all obvious why that'd cause a crash though - the archive command handles the error. The issue turns out to be that postgres was in a container, with pid namespaces enabled. Because postgres was run directly in the container, without a parent process inside, it thus becomes pid 1. Which mostly works without a problem. Until, as the case here with the archive command, a sub-sub process exits while it still has a child. Then that child gets re-parented to postmaster (as init). Such a child is likely to have exited not just with 0 or 1, but something else. As the pid won't match anything in reaper(), we'll go to CleanupBackend(). Where any exit status but 0/1 will unconditionally trigger a restart: if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus)) { HandleChildCrash(pid, exitstatus, _("server process")); return; } This kind of thing is pretty hard to debug, because it's not easy to even figure out what the "crashing" pid belonged to. I wonder if we should work a bit harder to try to identify whether an exiting process was a "server process" before identifying it as such? And perhaps we ought to warn about postgres running as "init" unless we make that robust? Greetings, Andres Freund
Re: [PATCH] Identify LWLocks in tracepoints
On 30.04.21 05:22, Craig Ringer wrote: On Thu, 29 Apr 2021 at 15:31, Peter Eisentraut wrote: So if you could produce a separate patch that adds the _ENABLED guards targeting PG14 (and PG13), that would be helpful. Here is a proposed patch for this. LGTM. Applies and builds fine on master and (with default fuzz) on REL_13_STABLE. Works as expected. committed
Re: MaxOffsetNumber for Table AMs
On Mon, 3 May 2021 at 20:43, Peter Geoghegan wrote: > > On Mon, May 3, 2021 at 10:57 AM Jeff Davis wrote: > > For the purposes of this discussion, what's making my life difficult is > > that we don't have a good definition for TID, leaving me with two > > options: > > > > 1. be overly conservative, accept MaxOffsetNumber=2048, wasting a > > bunch of address space; or > > tidbitmap.c uses MaxHeapTuplesPerPage as its MAX_TUPLES_PER_PAGE, > which is much lower than MaxOffsetNumber (it's almost 10x lower). I > wonder what that means for your design. One could relatively easily disable bitmap scans on the table AM by not installing the relevant bitmap support functions on the registered TableAM structure, and thus not touch that problem. Some indexes will then never be accessed due to the bitmap scan requirement of their IndexAM (gin, brin, bloom, to name a few), and as such won't make sense to create on that table, but that's about it I think. We might want to add some safeguards that bitmapscan-only indexams arent used on tableams that don't support it, but I believe that's a nice-to-have and not critical, on a similar level to the deduplication of constaint indexes. With regards, Matthias van de Meent
Re: Granting control of SUSET gucs to non-superusers
On Mon, May 3, 2021 at 2:48 PM Tom Lane wrote: > I'm still of the opinion that slicing and dicing this at the per-GUC > level is a huge waste of effort. Just invent one role that lets > grantees set any GUC, document it as being superuser-equivalent, > and be done. If you want to grant someone full superuser rights, you can do that already. The trick is what to do when you want someone to be able to administer the cluster in a meaningful way without giving them full superuser rights. I agree that in some cases it's fine to have predefined roles that are known to permit easy escalation to superuser privileges, like pg_execute_server_program. It doesn't provide any real security, but like you said, it can help prevent mistakes. However, there is a real use cases for a privileged user who cannot be permitted to escalate to superuser or to the OS account, but still needs to be able to do some administration of the cluster. The scenario Mark laid out in his original post is very common. In fact, it may already be the dominant model for PostgreSQL deployment, and if it isn't now, it will be in 5 years. Letting each individual company that's providing a hosted PostgreSQL solution hack up its own solution to that problem, all of which are subtly incompatible with each other and with upstream, is not good for users or the project. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Some oversights in query_id calculation
On Sun, May 2, 2021 at 12:27:37PM +0800, Julien Rouhaud wrote: > Hi Aleksander, > > On Wed, Apr 28, 2021 at 03:22:39PM +0300, Aleksander Alekseev wrote: > > Hi Julien, > > > > > You should see failures doing a check-world or simply a make -C > > > contrib/pg_stat_statements check > > > > Sorry, my bad. I was running make check-world, but did it with -j4 flag > > which was a mistake. > > > > The patch is OK. > > Thanks for reviewing! Patch applied, thanks. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Granting control of SUSET gucs to non-superusers
On Mon, May 3, 2021 at 2:41 PM Stephen Frost wrote: > In general, I agree that we should be looking at predefined roles as > being similar to the Linux capabilities system- defining certain kinds > of operations which the user who has that role is allowed to do, and > then both in-core and extensions can make decisions based on what > capabilities the user has been GRANT'd. Cool. > Hopefully that would limit the amount of cases where a given capability > ends up being overly broad while at the same time allowing extensions to > sensibly be able to use the defined capabilities for their own needs. Yeah. I think it will be a little tricky to get right, as some of the cases are a bit subjective, I think. > As we do in other places, we should make it clear when a certain > capability allows a user with that capability to gain superuser access > as that may not always be clear to a user. +1. > One thing that seems missing from this discussion and is part of what > paused my effort on the 'admin' role proposed towards the end of the > last cycle is that we really need to consider how this all plays with > ALTER SYSTEM and not just SUSET GUCs but also other (eg: POSTMASTER, > SIGHUP) GUCs. That is- imv we should have a sensible solution for > more-or-less all GUCs and which would allow a non-superuser to be able > to set POSTMASTER and SIGHUP GUCs (and perhaps others..) through > ALTER SYSTEM. I missed the earlier discussion on this topic, but I agree that this is very important. I think that the discussion of capabilities might help us get there. For instance, if I'm a service provider, and I give user "bob" the pg_put_whatever_you_want_in_the_server_log role, and GUCs are tagged so we know what GUCs that affects, then it seems natural to me to allow Bob to set those GUCs via ALTER SYSTEM as well as via ALTER USER or ALTER DATABASE. However, if I don't give him the pg_frob_shell_commands role, he can't set archive_command. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Granting control of SUSET gucs to non-superusers
Stephen Frost writes: > One thing that seems missing from this discussion and is part of what > paused my effort on the 'admin' role proposed towards the end of the > last cycle is that we really need to consider how this all plays with > ALTER SYSTEM and not just SUSET GUCs but also other (eg: POSTMASTER, > SIGHUP) GUCs. Yeah, I'd meant to bring that up too. The ability to use ALTER SYSTEM freely is probably a much bigger use-case than messing with SUSET variables within one's own session. I'm still of the opinion that slicing and dicing this at the per-GUC level is a huge waste of effort. Just invent one role that lets grantees set any GUC, document it as being superuser-equivalent, and be done. regards, tom lane
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 10:57 AM Jeff Davis wrote: > For the purposes of this discussion, what's making my life difficult is > that we don't have a good definition for TID, leaving me with two > options: > > 1. be overly conservative, accept MaxOffsetNumber=2048, wasting a > bunch of address space; or tidbitmap.c uses MaxHeapTuplesPerPage as its MAX_TUPLES_PER_PAGE, which is much lower than MaxOffsetNumber (it's almost 10x lower). I wonder what that means for your design. > 2. risk the mapping between TID and row number could break at any > time Though this clearly is the immediate problem for you, I think that the real problem is that the table AM kind of tacitly assumes that there is a universality to item pointer TIDs -- which is obviously not true. It might be useful for you to know what assumptions index AMs can make about how TIDs work in general, but I think that you really need an index-AM level infrastructure that advertises the capabilities of each index AM with respect to handling each possible variation (I suppose you have heapam, 6 byte uint, and maybe 8 byte uint). The easiest reasonable short term design for you is probably to find a way to make 6 byte TIDs into 48-bit unsigned integers (perhaps only conceptually), at least in contexts where the columnar table AM is used. You'll still need the index AM for that. This at least makes 64-bit TID-like identifiers a relatively simple conceptually shift. -- Peter Geoghegan
Re: Granting control of SUSET gucs to non-superusers
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Mon, May 3, 2021 at 12:25 PM Mark Dilger > wrote: > > As things stand, all custom variables defined via the > > DefineCustom{Bool,Int,Real,String,Enum}Variable are placed in the > > CUSTOM_OPTIONS config_group. We could add a role for controlling any SUSET > > CUSTOM_OPTIONS GUCs, or we could extend those functions to take a > > config_group option, or perhaps some of both. I haven't thought too much > > yet about whether allowing extensions to place a custom GUC into one of the > > predefined groups would be problematic. Any thoughts on that? > > Well... > > One idea would be to get rid of PGC_SUSET altogether and instead have > a set of flags associated with each GUC, like PGF_SERVER_LOG, > PGF_CORRUPT_DATA, PGF_CRASH_SERVER. Then you could associate those > flags with particular predefined roles and grant them out to whoever > you want. > > So if a GUC is flagged PGF_SERVER_LOG|PGF_CRASH_SERVER, then the > assumption is that it's security-sensitive because it both lets you > alter the contents of the server log and also lets you crash the > server. If you are granted both pg_server_log and pg_crash_server, you > can set it, otherwise not. > > This is just wild brainstorming, but my point is that I don't think > doing it by options groups is particularly good, because it doesn't > really have any relationship to why those things are marked SUSET in > the first place. To take an example involving functions rather than > GUCs, the pageinspect functions are super-user only because you can > crash the server by inspecting malformed data that you supply as an > arbitrarily literal, but AFAIK the functions in pgstattuple have no > similar hazard, and are just super-only because we don't really know > who the superuser wants to authorize, and maybe it's not everybody. So > those cases are really different, even though both are extensions. I > think the same likely holds true for GUCs. In general, I agree that we should be looking at predefined roles as being similar to the Linux capabilities system- defining certain kinds of operations which the user who has that role is allowed to do, and then both in-core and extensions can make decisions based on what capabilities the user has been GRANT'd. Hopefully that would limit the amount of cases where a given capability ends up being overly broad while at the same time allowing extensions to sensibly be able to use the defined capabilities for their own needs. As we do in other places, we should make it clear when a certain capability allows a user with that capability to gain superuser access as that may not always be clear to a user. One thing that seems missing from this discussion and is part of what paused my effort on the 'admin' role proposed towards the end of the last cycle is that we really need to consider how this all plays with ALTER SYSTEM and not just SUSET GUCs but also other (eg: POSTMASTER, SIGHUP) GUCs. That is- imv we should have a sensible solution for more-or-less all GUCs and which would allow a non-superuser to be able to set POSTMASTER and SIGHUP GUCs (and perhaps others..) through ALTER SYSTEM. Thanks, Stephen signature.asc Description: PGP signature
Re: MaxOffsetNumber for Table AMs
On Mon, 2021-05-03 at 10:38 -0700, Peter Geoghegan wrote: > I don't think it's much good to just do that. You probably need a > full > 64-bits for something like a column store. But that's all you need. I would definitely like that for citus columnar, and it would definitely make it easier to manage the address space, but I won't demand it today. 48 bits is a workable tuple address space for many purposes, especially when you factor in logical partitioning. I will be dealing with gaps though, so wasting 5 bits of address space (2^16 / MaxOffsetNumber = 32) to bring it down to 43 bits is not great. Regards, Jeff Davis
Re: MaxOffsetNumber for Table AMs
On Mon, 2021-05-03 at 13:22 -0400, Robert Haas wrote: > to look and work like heap TIDs; that is, there had better be a block > number portion and an item number portion, Right (at least for now). > and the item number had > better be smaller than MaxOffsetNumber, That's not clear to me at all, and is the whole reason I began this thread. a. You say "smaller than MaxOffsetNumber", but that's a little weird. If an offset can't be MaxOffsetNumber, it's not really the maximum, is it? b. If you actually meant "less than or equal to MaxOffsetNumber", that will fail with the GIN posting list issue raised in my first email. Do you agree that's a bug? c. Why can't we go all the way up to MovedPartitionsOffsetNumber - 1? Right now, MaxOffsetNumber is poorly named, because it actually represents the a number slightly higher than the maximum number of items that can fit on a page. That essentially wastes 5 bits of address space for no obvious reason. > and if you want bitmap scans > to run reasonably quickly, the block number had also better > correspond > to physical locality to some degree. Right (at least for now). Regards, Jeff Davis
Re: Incorrect snapshots while promoting hot standby node when 2PC is used
Hi, On 2021-05-01 17:35:09 +0500, Andrey Borodin wrote: > I'm investigating somewhat resemblant case. > We have an OLTP sharded installation where shards are almost always under > rebalancing. Data movement is implemented with 2pc. > Switchover happens quite often due to datacenter drills. The installation is > running on PostgreSQL 12.6. If you still have the data it would be useful if you could check if the LSNs of the corrupted pages are LSNs from shortly after standby promotion/switchover? > Can this case be related to the problem that you described? It seems possible, but it's hard to know without a lot more information. > Or, perhaps, it looks more like a hardware problem? Data_checksums are > on, but few years ago we observed ssd firmware that was loosing > updates, but passing checksums. I'm sure that we would benefit from > having separate relation fork for checksums or LSNs. Right - checksums are "page local". They can only detect if a page is corrupted, not if e.g. an older version of the page (with correct checksum) has been restored. While there are schemes to have stronger error detection properties, they do come with substantial overhead (at least the ones I can think of right now). Greetings, Andres Freund
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 10:22 AM Matthias van de Meent wrote: > For IoT, as far as I know, one of the constraints is that there exists > some unique constraint on the table, which also defines the ordering. > Assuming that that is the case, we can use + transaction id> to identify tuple versions. Perhaps that's true in theory, but the resulting design seems likely to be useless in the end. In any case I believe that systems that generally use a heap but give you the choice of using an IoT (I'm really thinking of Oracle) tend to not have many users that actually avail of IoTs. On modern flash storage the trade-off made by an IoT or clustered index design seems like the wrong one on average. You're saving about 1 I/O on average with a PK lookup, which just isn't that much of an upside compared to the many downsides. -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
On Mon, 2021-05-03 at 09:59 -0700, Peter Geoghegan wrote: > You don't accept any of that, though. Fair enough. I predict that > avoiding making a hard choice will make Jeff's work here a lot > harder, > though. For the purposes of this discussion, what's making my life difficult is that we don't have a good definition for TID, leaving me with two options: 1. be overly conservative, accept MaxOffsetNumber=2048, wasting a bunch of address space; or 2. risk the mapping between TID and row number could break at any time And compounding that, it seems that there's a bug in GIN that doesn't honor MaxOffsetNumber, so actually neither of the rules above work either. Instead, I need to use 2047 as the max offset number, which has no real basis in the postgres design, but I'd be stuck with it for a long time. What I'm looking for is: * A declaration of what the actual maximum valid offset number is, and that it will be stable enough to use for table AMs for now. (This maximum valid offset number may or may not be called MaxOffsetNumber, and may or may not be tied to the maximum number of items that fit on a page.) * A confirmation that this GIN behavior is a bug that should be fixed, now that there are table AMs in existence that need it fixed. Even if we fix this in v15, we still need some guidance for what table AMs should do in earlier versions. If we change the way tuple IDs are represented or the table AM in v15 or beyond, that may require a REINDEX for indexes on some table AMs. As long as we have some robust way to check that a REINDEX is necessary, that's fine with me. Regards, Jeff Davis
Re: function for testing that causes the backend to terminate
Joe, Thanks, This works and I don't have to install anything! Dave Cramer On Thu, 29 Apr 2021 at 16:16, Joe Conway wrote: > On 4/29/21 6:56 AM, Dave Cramer wrote: > > For testing unusual situations I'd like to be able to cause a backend to > > terminate due to something like a segfault. Do we currently have this in > > testing ? > > If you can run SQL as a superuser from that backend, try: > > COPY (SELECT pg_backend_pid()) > TO PROGRAM 'xargs kill -SIGSEGV'; > > HTH, > > Joe > > -- > Crunchy Data - http://crunchydata.com > PostgreSQL Support for Secure Enterprises > Consulting, Training, & Open Source Development >
Re: Granting control of SUSET gucs to non-superusers
On 05/03/21 13:23, Robert Haas wrote: > On Mon, May 3, 2021 at 11:45 AM Chapman Flack wrote: >> I guess I was thinking, but forgot to convey to the keyboard, that the >> existence of a non-empty extra_check_hooks chain on a SUSET GUC (which >> could only have been attached from a shared preload library) would >> implicitly change SUSET to mean settable whenever accepted by the hook(s). > > Sure, but the hook still needs a way to know which users are entitled > to set the GUC. I was contemplating a version 0 with only that minimal support in core for allowing a shared preload library to set such hooks (and allowing include-with-a-role in config files), assuming service providers already do sophisticated building of stuff to construct the environments they provide, and a C shared object with hooks that enforce their designed constraints wouldn't be an onerous burden on top of that. Such providers could then be the laboratories of democracy building various forms of such things, and if one of those ends up having a reasonably general configuration mechanism and gets offered as a contrib module later or for inclusion in core, well, that's version 0.5. Regards, -Chap
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 10:22 AM Robert Haas wrote: > I don't really think so, or at least I don't see a reason why it > should. As things stand today, I don't think it's possible for a table > AM author to make any other choice than to assume that their TIDs have > to look and work like heap TIDs; that is, there had better be a block > number portion and an item number portion, and the item number had > better be smaller than MaxOffsetNumber, and if you want bitmap scans > to run reasonably quickly, the block number had also better correspond > to physical locality to some degree. It's not clear to me how exactly > someone would go about fixing all of that, but I think it would be > great if they did. Even if that person wanted to assume for purposes > of their own patch that fixed-width, integer-like TIDs are the only > thing we care about, that would be fine with me. Getting to a point > where the available 48 bits can be used in whatever way the table AM > author wants is clearly better than what we have now. I don't think it's much good to just do that. You probably need a full 64-bits for something like a column store. But that's all you need. > Now I'm personally of the opinion that we shouldn't be content to stop > there, but so what? I'm not insisting that Jeff or anyone else has to > work on this problem, or that they have to fix more of it rather than > less. I hope that nobody's going to try to back us into a corner by > making design decisions that deliberately complicate the possibility > of future improvements in that area, and that's about it. I don't > really understand why you think that's unreasonable, or even > problematic. I can't see that any way in which the assumption that we > will NEVER want to further generalize the TID concept simplifies > anything anyone wants to do today. It creates ambiguity of the kind that deters related improvements. I for one am not comfortable with (say) working on generalizing TID to the extent required to facilitate Jeff's work if that obligates me to make some legalistic and wholly insincere statement about future improvements to the definition of TID still being quite possible (to facilitate indirect indexes, or whatever). The truth is that I cannot possibly know if facilitating Jeff's work in the short term blocks off other things in the long term -- because I don't actually have a clue how these other things could ever really be implemented sensible in any case. -- Peter Geoghegan
Re: Granting control of SUSET gucs to non-superusers
On Mon, May 3, 2021 at 12:25 PM Mark Dilger wrote: > As things stand, all custom variables defined via the > DefineCustom{Bool,Int,Real,String,Enum}Variable are placed in the > CUSTOM_OPTIONS config_group. We could add a role for controlling any SUSET > CUSTOM_OPTIONS GUCs, or we could extend those functions to take a > config_group option, or perhaps some of both. I haven't thought too much yet > about whether allowing extensions to place a custom GUC into one of the > predefined groups would be problematic. Any thoughts on that? Well... One idea would be to get rid of PGC_SUSET altogether and instead have a set of flags associated with each GUC, like PGF_SERVER_LOG, PGF_CORRUPT_DATA, PGF_CRASH_SERVER. Then you could associate those flags with particular predefined roles and grant them out to whoever you want. So if a GUC is flagged PGF_SERVER_LOG|PGF_CRASH_SERVER, then the assumption is that it's security-sensitive because it both lets you alter the contents of the server log and also lets you crash the server. If you are granted both pg_server_log and pg_crash_server, you can set it, otherwise not. This is just wild brainstorming, but my point is that I don't think doing it by options groups is particularly good, because it doesn't really have any relationship to why those things are marked SUSET in the first place. To take an example involving functions rather than GUCs, the pageinspect functions are super-user only because you can crash the server by inspecting malformed data that you supply as an arbitrarily literal, but AFAIK the functions in pgstattuple have no similar hazard, and are just super-only because we don't really know who the superuser wants to authorize, and maybe it's not everybody. So those cases are really different, even though both are extensions. I think the same likely holds true for GUCs. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Granting control of SUSET gucs to non-superusers
On Mon, May 3, 2021 at 11:45 AM Chapman Flack wrote: > I guess I was thinking, but forgot to convey to the keyboard, that the > existence of a non-empty extra_check_hooks chain on a SUSET GUC (which > could only have been attached from a shared preload library) would > implicitly change SUSET to mean settable whenever accepted by the hook(s). Sure, but the hook still needs a way to know which users are entitled to set the GUC. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Mon, 3 May 2021 at 19:00, Peter Geoghegan wrote: > > On Mon, May 3, 2021 at 9:45 AM Robert Haas wrote: > > But if you're saying those identifiers have to be fixed-width and 48 > > (or even 64) bits, I disagree that we wish to have such a requirement > > in perpetuity. > > Once you require that TID-like identifiers must point to particular > versions (as opposed to particular logical rows), you also virtually > require that the identifiers must always be integer-like (though not > necessarily block-based and not necessarily 6 bytes). You've > practically ensured that clustered index tables (and indirect indexes) > will never be possible by accepting this. For IoT, as far as I know, one of the constraints is that there exists some unique constraint on the table, which also defines the ordering. Assuming that that is the case, we can use + to identify tuple versions. With regards, Matthias van de Meent.
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 1:00 PM Peter Geoghegan wrote: > You don't accept any of that, though. Fair enough. I predict that > avoiding making a hard choice will make Jeff's work here a lot harder, > though. I don't really think so, or at least I don't see a reason why it should. As things stand today, I don't think it's possible for a table AM author to make any other choice than to assume that their TIDs have to look and work like heap TIDs; that is, there had better be a block number portion and an item number portion, and the item number had better be smaller than MaxOffsetNumber, and if you want bitmap scans to run reasonably quickly, the block number had also better correspond to physical locality to some degree. It's not clear to me how exactly someone would go about fixing all of that, but I think it would be great if they did. Even if that person wanted to assume for purposes of their own patch that fixed-width, integer-like TIDs are the only thing we care about, that would be fine with me. Getting to a point where the available 48 bits can be used in whatever way the table AM author wants is clearly better than what we have now. Now I'm personally of the opinion that we shouldn't be content to stop there, but so what? I'm not insisting that Jeff or anyone else has to work on this problem, or that they have to fix more of it rather than less. I hope that nobody's going to try to back us into a corner by making design decisions that deliberately complicate the possibility of future improvements in that area, and that's about it. I don't really understand why you think that's unreasonable, or even problematic. I can't see that any way in which the assumption that we will NEVER want to further generalize the TID concept simplifies anything anyone wants to do today. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 9:45 AM Robert Haas wrote: > But if you're saying those identifiers have to be fixed-width and 48 > (or even 64) bits, I disagree that we wish to have such a requirement > in perpetuity. Once you require that TID-like identifiers must point to particular versions (as opposed to particular logical rows), you also virtually require that the identifiers must always be integer-like (though not necessarily block-based and not necessarily 6 bytes). You've practically ensured that clustered index tables (and indirect indexes) will never be possible by accepting this. Those designs are the only real reason to have truly variable-length TID-like identifiers IMV (as opposed to 2 or perhaps even 3 standard TID widths). You don't accept any of that, though. Fair enough. I predict that avoiding making a hard choice will make Jeff's work here a lot harder, though. -- Peter Geoghegan
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 11:26 AM Peter Geoghegan wrote: > It just has to be able to accept the restriction that > indexes must have a unique TID-like identifier for each version (not > quite a version actually -- whatever the equivalent of a HOT chain > is). This is a restriction that Jeff had pretty much planned on > working within before starting this thread (I know this because we > spoke about it privately). Well, I think what I'm saying is that I'm not on board with such a restriction. If you're just saying that it has to be possible to identify rows somehow, I am in full agreement, and I think the universe is on board as well. But if you're saying those identifiers have to be fixed-width and 48 (or even 64) bits, I disagree that we wish to have such a requirement in perpetuity. That'd be like going around to automobile manufacturers in 1925 and asking them to agree that all future cars ever manufactured must have a clutch. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_amcheck contrib application
On Fri, Apr 30, 2021 at 5:07 PM Robert Haas wrote: > On Fri, Apr 30, 2021 at 4:26 PM Mark Dilger > wrote: > > After further reflection, no other verbiage seems any better. I'd say go > > ahead and commit it this way. > > OK. I'll plan to do that on Monday, barring objections. Done now. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Granting control of SUSET gucs to non-superusers
> On May 3, 2021, at 8:22 AM, Robert Haas wrote: > > One problem with having a separate predefined role for every PGC_SUSET > GUC is that it's no help for extensions. Both auto_explain and > pg_stat_statements have such GUCs, and there may be out-of-core > extensions that do as well. We should try to come up with a system > that doesn't leave them out in the cold. As things stand, all custom variables defined via the DefineCustom{Bool,Int,Real,String,Enum}Variable are placed in the CUSTOM_OPTIONS config_group. We could add a role for controlling any SUSET CUSTOM_OPTIONS GUCs, or we could extend those functions to take a config_group option, or perhaps some of both. I haven't thought too much yet about whether allowing extensions to place a custom GUC into one of the predefined groups would be problematic. Any thoughts on that? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 8:03 AM Robert Haas wrote: > It's reasonable to wonder. I think it depends on whether the problem > is bloat or just general slowness. To the extent that the problem is > bloat, bottom-index deletion will help a lot, but it's not going to > help with slowness because, as you say, we still have to dirty the > pages. And I am pretty confident that slowness is a very significant > part of the problem here. It's all going to depend on workload of course -- we'll need to wait and see what users still complain about with Postgres 14 to really have some idea. You only freshly dirty those leaf pages that weren't already dirty, and the costs will be much more linear, so it's a complicated question. Here is a more modest statement that might be more convincing: The *relative* importance of making something like HOT more robust to things like long-running xacts has increased now that we have bottom-up index deletion. We could improve things here by adding something like zheap, which allows a HOT chain to mostly live in UNDO, and therefore pretty much become arbitrarily long. This seems plausible because users will accept that UPDATEs that modify one or more indexed columns kinda suck, as long as there is never any truly pathological performance. Whereas users will not easily accept that HOT (or something like it) doesn't quite work well enough to make relation sizes stable when they avoid updating indexed columns. I don't think that even the absence of UPDATEs that logically modify indexes and the absence of long running transactions (that hold up cleanup) is sufficient to make HOT work well enough to keep table sizes stable over time. Minor inefficiencies (e.g. LP_DEAD line pointer bloat) will tend to aggregate over time, leading to heap fragmentation. -- Peter Geoghegan
Re: Granting control of SUSET gucs to non-superusers
On 05/03/21 11:22, Robert Haas wrote: >> The GUC system would have to expose a way for the shared object to >> chain extra_check_hooks off existing GUCs. An extra_check_hook can check >> both the value and the role of the caller. > > I think there are two parts to this problem. First, the SP needs to be > able to delegate to some users but not others the ability to set > superuser GUCs. Second, the SP needs to be able to control which GUCs > the privileged users get to set and perhaps to what values. A hook of > the type you propose here seems like it might work reasonably well for > that second part, but it's not totally obvious to me how it helps with > the first part. I guess I was thinking, but forgot to convey to the keyboard, that the existence of a non-empty extra_check_hooks chain on a SUSET GUC (which could only have been attached from a shared preload library) would implicitly change SUSET to mean settable whenever accepted by the hook(s). Regards, -Chap
Re: MaxOffsetNumber for Table AMs
On Mon, May 3, 2021 at 7:41 AM Robert Haas wrote: > So here. The complexity of getting a table AM that does anything > non-trivial working is formidable, and I don't expect it to happen > right away. Picking one that is essentially block-based and can use > 48-bit TIDs is very likely the right initial target because that's the > closest we have now, and there's no sense attacking the hardest > variant of the problem first. It doesn't have to be block-based -- that's not what Jeff is proposing. It just has to be able to accept the restriction that indexes must have a unique TID-like identifier for each version (not quite a version actually -- whatever the equivalent of a HOT chain is). This is a restriction that Jeff had pretty much planned on working within before starting this thread (I know this because we spoke about it privately). It's quite possible to rule out an index-organized table design without ruling out a column store with logical TID-like identifiers, that aren't block-based. It's fair to wonder if not tightening up the rules for TID-like identifiers is actually helping table AM authors in practice. I think it's actually making things harder. -- Peter Geoghegan
Re: Granting control of SUSET gucs to non-superusers
On Sat, May 1, 2021 at 12:37 PM Chapman Flack wrote: > Maybe version 0 is where the provider just builds a shared object > to go in shared_preload_libraries. The provider has probably already > done a bunch of other stuff more challenging than that. > > The GUC system would have to expose a way for the shared object to > chain extra_check_hooks off existing GUCs. An extra_check_hook can check > both the value and the role of the caller. I think there are two parts to this problem. First, the SP needs to be able to delegate to some users but not others the ability to set superuser GUCs. Second, the SP needs to be able to control which GUCs the privileged users get to set and perhaps to what values. A hook of the type you propose here seems like it might work reasonably well for that second part, but it's not totally obvious to me how it helps with the first part. Instead of going to the extreme of one predefined role per GUC, maybe we could see if the PGC_SUSET GUCs could be divided into buckets based on the reason they are so marked? For example, log_parser_stats, log_planner_stats, log_executor_stats, log_statement_stats, log_btree_build_stats, trace_locks, trace_userlocks, trace_lwlocks, log_min_duration_statement, and a bunch of others are probably all SUSET just on the theory that only the superuser should have the right to control what ends up in the log. But we could make a predefined role that represents the right to control what ends up in the log, and then all of those GUCs could be tied to that role. Is that too coarse-grained? It might be. One problem with having a separate predefined role for every PGC_SUSET GUC is that it's no help for extensions. Both auto_explain and pg_stat_statements have such GUCs, and there may be out-of-core extensions that do as well. We should try to come up with a system that doesn't leave them out in the cold. -- Robert Haas EDB: http://www.enterprisedb.com
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 6:19 PM Peter Geoghegan wrote: > A remaining problem is that we must generate a new round of index > tuples for each and every index when only one indexed column is > logically modified by an UPDATE statement. I think that this is much > less of a problem now due to bottom-up index deletion. Sure, it sucks > that we still have to dirty the page at all. But it's nevertheless > true that it all but eliminates version-driven page splits, which are > where almost all of the remaining downside is. It's very reasonable to > now wonder if this particular all-indexes problem is worth solving at > all in light of that. (Modern hardware characteristics also make a > comprehensive fix less valuable in practice.) It's reasonable to wonder. I think it depends on whether the problem is bloat or just general slowness. To the extent that the problem is bloat, bottom-index deletion will help a lot, but it's not going to help with slowness because, as you say, we still have to dirty the pages. And I am pretty confident that slowness is a very significant part of the problem here. It's pretty common for people migrating from another database system to have, for example, a table with 10 indexes and then repeatedly update a column that is covered by only one of those indexes. Now, with bottom-up index deletion, this should cause a lot less bloat, and that's good. But you still have to update all 10 indexes in the foreground, and that's bad, because the alternative is to find just the one affected index and update it twice -- once to insert the new tuple, and a second time to delete-mark the old tuple. 10 is a lot more than 2, and that's even ignoring the cost of deferred cleanup on the other 9 indexes. So I don't really expect this to get us out of the woods. Somebody whose workload runs five times slower on a pristine data load is quite likely to give up on using PostgreSQL before bloat even enters the picture. -- Robert Haas EDB: http://www.enterprisedb.com
Re: strange error reporting
Robert Haas writes: > On Mon, May 3, 2021 at 6:08 AM Peter Eisentraut > wrote: >> Throwing the socket address in there seems a bit distracting and >> misleading, and it also pushes off the actual information very far to >> the end. (Also, in some cases the socket path is very long, making the >> actual information even harder to find.) By the time you get to this >> error, you have already connected, so mentioning the server address >> seems secondary at best. > It feels a little counterintuitive to me too but I am nevertheless > inclined to believe that it's an improvement. When multi-host > connection strings are used, the server address may not be clear. In > fact, even when they're not, it may not be clear to a new user that > socket communication is used, and it may not be clear where the socket > is located. Yeah. The specific problem I'm concerned about solving here is "I wasn't connecting to the server I thought I was", which could be a contributing factor in almost any connection-time failure. The multi-host-connection-string feature made that issue noticeably worse, but surely we've all seen trouble reports that boiled down to that even before that feature came in. As you say, we could perhaps redesign the messages to provide this info in another order. But it'd be difficult, and I think it might come out even more confusing in cases where libpq tried several servers on the way to finally failing. The old code's error reporting for such cases completely sucked, whereas now you get a reasonably complete trace of the attempts. As a quick example, for a case of bad hostname followed by wrong port: $ psql -d "host=foo1,sss2 port=5432,5342" psql: error: could not translate host name "foo1" to address: Name or service not known connection to server at "sss2" (192.168.1.48), port 5342 failed: Connection refused Is the server running on that host and accepting TCP/IP connections? v13 renders this as $ psql -d "host=foo1,sss2 port=5432,5342" psql: error: could not translate host name "foo1" to address: Name or service not known could not connect to server: Connection refused Is the server running on host "sss2" (192.168.1.48) and accepting TCP/IP connections on port 5342? Now, of course the big problem there is the lack of consistency about how the two errors are laid out; but I'd argue that putting the server identity info first is better than putting it later. Also, if you experiment with other cases such as some of the servers complaining about wrong user name, the old behavior is even harder to follow about which server said what. regards, tom lane
Re: MaxOffsetNumber for Table AMs
On Fri, Apr 30, 2021 at 5:22 PM Peter Geoghegan wrote: > I strongly suspect that index-organized tables (or indirect indexes, > or anything else that assumes that TID-like identifiers map directly > to logical rows as opposed to physical versions) are going to break > too many assumptions to ever be tractable. Assuming I have that right, > it would advance the discussion if we could all agree on that being a > non-goal for the tableam interface in general. I *emphatically* disagree with the idea of ruling such things out categorically. This is just as naive as the TODO's statement that we do not want "All backends running as threads in a single process". Does anyone really believe that we don't want that any more? I believed it 10 years ago, but not any more. It's costing us very substantially not only in that in makes parallel query more complicated and fragile, but more importantly in that we can't scale up to connection counts that other databases can handle because we use up too many operating system resources. Support threading in PostgreSQL isn't a project that someone will pull off over a long weekend and it's not something that has to be done tomorrow, but it's pretty clearly the future. So here. The complexity of getting a table AM that does anything non-trivial working is formidable, and I don't expect it to happen right away. Picking one that is essentially block-based and can use 48-bit TIDs is very likely the right initial target because that's the closest we have now, and there's no sense attacking the hardest variant of the problem first. However, as with the threads-vs-processes example, I strongly suspect that having only one table AM is leaving vast amounts of performance on the table. To say that we're never going to pursue the parts of that space that require a different kind of tuple identifier is to permanently write off tons of ideas that have produced promising results in other systems. Let's not do that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Lowering the ever-growing heap->pd_lower
On Mon, 3 May 2021 at 16:26, John Naylor wrote: > > On Sat, Apr 3, 2021 at 10:07 PM Peter Geoghegan wrote: > > I would like to deal with this work within the scope of the project > > we're discussing over on the "New IndexAM API controlling index vacuum > > strategies" thread. The latest revision of that patch series includes > > a modified version of your patch: > > > > https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopjmhfqj_g1raj4gwr3z...@mail.gmail.com > > > > Please take discussion around this project over to that other thread. > > There are a variety of issues that can only really be discussed in > > that context. > > Since that work has been committed as of 3c3b8a4b2689, I've marked this CF > entry as committed. I disagree that this work has been fully committed. A derivative was committed that would solve part of the problem, but it doesn't cover all problem cases. I believe that I voiced such concern in the other thread as well. As such, I am planning on fixing this patch sometime before the next commit fest so that we can truncate the LP array during hot pruning as well, instead of only doing so in the 2nd VACUUM pass. This is especially relevant on pages where hot is highly effective, but vacuum can't keep up and many unused (former HOT) line pointers now exist on the page. With regards, Matthias van de Meent
Re: Identify missing publications from publisher while create/alter subscription.
On Mon, May 3, 2021 at 11:11 AM Bharath Rupireddy wrote: > > On Sun, May 2, 2021 at 10:04 PM vignesh C wrote: > > > 5) Instead of adding a new file 021_validate_publications.pl for > > > tests, spawning a new test database which would make the overall > > > regression slower, can't we add with the existing database nodes in > > > 0001_rep_changes.pl? I would suggest adding the tests in there even if > > > the number of tests are many, I don't mind. > > > > 001_rep_changes.pl has the changes mainly for checking the replicated > > data. I did not find an appropriate file in the current tap tests, I > > preferred these tests to be in a separate file. Thoughts? > > If 001_rep_changes.pl is not the right place, how about adding them > into 007_ddl.pl? That file seems to be only for DDL changes, and since > the feature tests cases are for CREATE/ALTER SUBSCRIPTION, it's the > right place. I strongly feel that we don't need a new file for these > tests. > Modified. > Comment on the tests: > 1) Instead of "pub_doesnt_exist" name, how about "non_existent_pub" or > just pub_non_existent" or some other? > + "CREATE SUBSCRIPTION testsub2 CONNECTION '$publisher_connstr' > PUBLICATION pub_doesnt_exist WITH (VALIDATE_PUBLICATION = TRUE)" > The error message with this name looks a bit odd to me. > + /ERROR: publication "pub_doesnt_exist" does not exist in > the publisher/, Modified. Thanks for the comments, these comments are handle in the v7 patch posted in my earlier mail. Regards, Vignesh
Re: Identify missing publications from publisher while create/alter subscription.
On Mon, May 3, 2021 at 1:46 PM Dilip Kumar wrote: > > On Mon, May 3, 2021 at 10:48 AM Dilip Kumar wrote: > > > > On Sun, May 2, 2021 at 10:04 PM vignesh C wrote: > > > > > > Thanks for the comments. > > > The Attached patch has the fixes for the same. > > > > I was reviewing the documentation part, I think in the below paragraph > > we should include validate_publication as well? > > > > > > connect (boolean) > > > > > > Specifies whether the CREATE SUBSCRIPTION > > should connect to the publisher at all. Setting this to > > false will change default values of > > enabled, create_slot and > > copy_data to false. > > > > Modified. > > I will review/test the other parts of the patch and let you know. > > I have reviewed it and it mostly looks good to me. I have some minor > suggestions though. > > 1. > +/* > + * Check the specified publication(s) is(are) present in the publisher. > + */ > > vs > > + > +/* > + * Connect to the publisher and check if the publications exist. > + */ > > I think the formatting of the comments are not uniform. Some places > we are using "publication(s) is(are)" whereas other places are just > "publications". > Modified. > 2. Add a error case for connect=false and VALIDATE_PUBLICATION = true Added. Thanks for the comments, attached v7 patch has the fixes for the same. Thoughts? Regards, Vignesh From 09ae1cac60320bde08a6c380d3637eecdbb3d6ff Mon Sep 17 00:00:00 2001 From: vignesh Date: Wed, 7 Apr 2021 22:05:53 +0530 Subject: [PATCH v7] Identify missing publications from publisher while create/alter subscription. Creating/altering subscription is successful when we specify a publication which does not exist in the publisher. This patch checks if the specified publications are present in the publisher and throws an error if any of the publication is missing in the publisher. --- doc/src/sgml/ref/alter_subscription.sgml | 13 ++ doc/src/sgml/ref/create_subscription.sgml | 18 +- src/backend/commands/subscriptioncmds.c | 234 +++--- src/bin/psql/tab-complete.c | 7 +- src/test/subscription/t/007_ddl.pl| 68 ++- 5 files changed, 305 insertions(+), 35 deletions(-) diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml index 367ac814f4..81e156437b 100644 --- a/doc/src/sgml/ref/alter_subscription.sgml +++ b/doc/src/sgml/ref/alter_subscription.sgml @@ -160,6 +160,19 @@ ALTER SUBSCRIPTION name RENAME TO < + + +validate_publication (boolean) + + + When true, the command verifies if all the specified publications + that are being subscribed to are present in the publisher and throws + an error if any of the publication doesn't exist. The default is + false. + + + + diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml index e812beee37..cad9285c16 100644 --- a/doc/src/sgml/ref/create_subscription.sgml +++ b/doc/src/sgml/ref/create_subscription.sgml @@ -207,8 +207,9 @@ CREATE SUBSCRIPTION subscription_nameCREATE SUBSCRIPTION should connect to the publisher at all. Setting this to false will change default values of - enabled, create_slot and - copy_data to false. + enabled, create_slot, + copy_data and + validate_publication to false. @@ -239,6 +240,19 @@ CREATE SUBSCRIPTION subscription_name + + +validate_publication (boolean) + + + When true, the command verifies if all the specified publications + that are being subscribed to are present in the publisher and throws + an error if any of the publication doesn't exist. The default is + false. + + + + diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 517c8edd3b..3de4488e4d 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -69,7 +69,9 @@ parse_subscription_options(List *options, char **synchronous_commit, bool *refresh, bool *binary_given, bool *binary, - bool *streaming_given, bool *streaming) + bool *streaming_given, bool *streaming, + bool *validate_publication_given, + bool *validate_publication) { ListCell *lc; bool connect_given = false; @@ -111,6 +113,12 @@ parse_subscription_options(List *options, *streaming = false; } + if (validate_publication) + { + *validate_publication_given = false; + *validate_publication = false; + } + /* Parse options */ foreach(lc, options) { @@ -215,6 +223,17 @@ parse_subscription_options(List *options, *streaming_given = tr
Re: Lowering the ever-growing heap->pd_lower
On Sat, Apr 3, 2021 at 10:07 PM Peter Geoghegan wrote: > I would like to deal with this work within the scope of the project > we're discussing over on the "New IndexAM API controlling index vacuum > strategies" thread. The latest revision of that patch series includes > a modified version of your patch: > > https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopjmhfqj_g1raj4gwr3z...@mail.gmail.com > > Please take discussion around this project over to that other thread. > There are a variety of issues that can only really be discussed in > that context. Since that work has been committed as of 3c3b8a4b2689, I've marked this CF entry as committed. -- John Naylor EDB: http://www.enterprisedb.com
Re: strange error reporting
On Mon, May 3, 2021 at 6:08 AM Peter Eisentraut wrote: > I find these new error messages to be more distracting than before in > some cases. For example: > > PG13: > > clusterdb: error: could not connect to database typo: FATAL: database > "typo" does not exist > > PG14: > > clusterdb: error: connection to server on socket "/tmp/.s.PGSQL.65432" > failed: FATAL: database "typo" does not exist > > Throwing the socket address in there seems a bit distracting and > misleading, and it also pushes off the actual information very far to > the end. (Also, in some cases the socket path is very long, making the > actual information even harder to find.) By the time you get to this > error, you have already connected, so mentioning the server address > seems secondary at best. It feels a little counterintuitive to me too but I am nevertheless inclined to believe that it's an improvement. When multi-host connection strings are used, the server address may not be clear. In fact, even when they're not, it may not be clear to a new user that socket communication is used, and it may not be clear where the socket is located. New users may not even realize that there's a socket involved; I certainly didn't know that for quite a while. It's a lot harder for the database name to be unclear, because since a particular connection attempt will never try more than one, and also because when it's relevant to understanding why the connection failed, the server will hopefully include it in the message string anyway, as here. So the PG13 message is really kind of silly: it tells us the same thing twice, which we must already know, instead of telling us something that we might not know. It might be more intuitive in some ways if the socket information were demoted to the end of the message, but I think we'd lose more than we gained. The standard way of reporting someone else's error is basically "what i have to say about the problem: %s" and that's exactly what we're doing here. We could find some way of gluing the information about the socket onto the end of the server message, but it seems unclear how to do that in a way that looks natural, and it would depart from our usual practice. So even though I also find this to be a bit distracting, I think we should just live with it, because everything else seems worse. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Transactions involving multiple postgres foreign servers, take 2
On Mon, May 3, 2021 at 5:25 AM Masahiko Sawada wrote: > On Sun, May 2, 2021 at 1:23 AM Zhihong Yu wrote: > > > > > > > > On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada > wrote: > >> > >> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu wrote: > >> > > >> > Hi, > >> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch : > >> > >> Thank you for reviewing the patch! > >> > >> > > >> > With this commit, the foreign server modified within the transaction > marked as 'modified'. > >> > > >> > transaction marked -> transaction is marked > >> > >> Will fix. > >> > >> > > >> > +#define IsForeignTwophaseCommitRequested() \ > >> > +(foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED) > >> > > >> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the > macro should be named: IsForeignTwophaseCommitRequired. > >> > >> But even if foreign_twophase_commit is > >> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if > >> there is only one modified server, right? It seems the name > >> IsForeignTwophaseCommitRequested is fine. > >> > >> > > >> > +static bool > >> > +checkForeignTwophaseCommitRequired(bool local_modified) > >> > > >> > + if (!ServerSupportTwophaseCommit(fdw_part)) > >> > + have_no_twophase = true; > >> > ... > >> > + if (have_no_twophase) > >> > + ereport(ERROR, > >> > > >> > It seems the error case should be reported within the loop. This way, > we don't need to iterate the other participant(s). > >> > Accordingly, nserverswritten should be incremented for local server > prior to the loop. The condition in the loop would become if > (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1). > >> > have_no_twophase is no longer needed. > >> > >> Hmm, I think If we process one 2pc-non-capable server first and then > >> process another one 2pc-capable server, we should raise an error but > >> cannot detect that. > > > > > > Then the check would stay as what you have in the patch: > > > > if (!ServerSupportTwophaseCommit(fdw_part)) > > > > When the non-2pc-capable server is encountered, we would report the > error in place (following the ServerSupportTwophaseCommit check) and come > out of the loop. > > have_no_twophase can be dropped. > > But if we processed only one non-2pc-capable server, we would raise an > error but should not in that case. > > On second thought, I think we can track how many servers are modified > or not capable of 2PC during registration and unr-egistration. Then we > can consider both 2PC is required and there is non-2pc-capable server > is involved without looking through all participants. Thoughts? > That is something worth trying. Thanks > > Regards, > > -- > Masahiko Sawada > EDB: https://www.enterprisedb.com/ >
Re: Replication slot stats misgivings
On Mon, May 3, 2021 at 5:48 PM Masahiko Sawada wrote: > > On Mon, May 3, 2021 at 2:27 PM Amit Kapila wrote: > > > > On Thu, Apr 29, 2021 at 10:37 AM Amit Kapila > > wrote: > > > > > > On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada > > > wrote: > > > > > > > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila > > > > wrote: > > > > > > > > > > > > > > I am not sure if any of these alternatives are a good idea. What do > > > > > you think? Do you have any other ideas for this? > > > > > > > > I've been considering some ideas but don't come up with a good one > > > > yet. It’s just an idea and not tested but how about having > > > > CreateDecodingContext() register before_shmem_exit() callback with the > > > > decoding context to ensure that we send slot stats even on > > > > interruption. And FreeDecodingContext() cancels the callback. > > > > > > > > > > Is it a good idea to send stats while exiting and rely on the same? I > > > think before_shmem_exit is mostly used for the cleanup purpose so not > > > sure if we can rely on it for this purpose. I think we can't be sure > > > that in all cases we will send all stats, so maybe Vignesh's patch is > > > sufficient to cover the cases where we avoid losing it in cases where > > > we would have sent a large amount of data. > > > > > > > Sawada-San, any thoughts on this point? > > before_shmem_exit is mostly used to the cleanup purpose but how about > on_shmem_exit()? pgstats relies on that to send stats at the > interruption. See pgstat_shutdown_hook(). > Yeah, that is worth trying. Would you like to give it a try? I think it still might not cover the cases where we error out in the backend while decoding via APIs because at that time we won't exit, maybe for that we can consider Vignesh's patch. -- With Regards, Amit Kapila.
Re: [PATCH] Allow CustomScan nodes to signal projection support
Hi Sven, > The attached patch allows CustomScan nodes to signal whether they > support projection. I noticed that you didn't change custom-scan.sgml accordingly. The updated patch is attached. Otherwise, it seems to be fine in terms of compiling, passing tests etc. > I named the flag CUSTOMPATH_SUPPORT_PROJECTION similar to the other > custom node flags, but this would revert the current logic This seems to be a typical Kobayashi Maru situation, i.e any choice is a bad one. I suggest keeping the patch as is and hoping that the developers of existing extensions read the release notes. -- Best regards, Aleksander Alekseev v2-0001-costumscan_projection_flag.patch Description: Binary data
Re: Enhanced error message to include hint messages for redundant options error
On Mon, May 3, 2021 at 12:08 PM Bharath Rupireddy wrote: > > On Sun, May 2, 2021 at 8:44 PM vignesh C wrote: > > > > On Sat, May 1, 2021 at 9:02 PM Bharath Rupireddy > > wrote: > > > > > > On Sat, May 1, 2021 at 7:25 PM vignesh C wrote: > > > > > > I'm not attaching above one line change as a patch, maybe Vignesh > > > > > > can > > > > > > merge this into the main patch. > > > > > > > > Thanks for the comments. I have merged the change into the attached > > > > patch. > > > > Thoughts? > > > > > > Thanks! v4 basically LGTM. Can we park this in the current commitfest > > > if not done already? > > > > > > Upon looking at the number of places where we have the "option \"%s\" > > > specified more than once" error, I, now strongly feel that we should > > > use goto duplicate_error approach like in compute_common_attribute, so > > > that we will have only one ereport(ERROR. We can change it in > > > following files: copy.c, dbcommands.c, extension.c, > > > compute_function_attributes, sequence.c, subscriptioncmds.c, > > > typecmds.c, user.c, walsender.c, pgoutput.c. This will reduce the LOC > > > greatly. > > > > > > Thoughts? > > > > I have made the changes for this, I have posted the same in the v5 > > patch posted in my earlier mail. > > Thanks! The v5 patch looks good to me. Let's see if all agree on the > goto duplicate_error approach which could reduce the LOC by ~80. > > I don't see it in the current commitfest, can we park it there so that > the patch will get tested on cfbot systems? I have added an entry in commitfest: https://commitfest.postgresql.org/33/3103/ Regards, Vignesh
Toast compression method options
We have already pushed the configurable lz4 toast compression code[1]. In the custom compression thread, we were already having the patch to support the compression method options[2]. But the design for the base patches was heavily modified before commit but I never rebased this patch based on the new design. Now, I have rebased this patch so that we don't lose track and we can work on this for v15. This is still a WIP patch. For v15 I will work on improving the code and I will also work on analyzing the usage of compression method options (compression speed/ratio). [1] https://www.postgresql.org/message-id/E1lNKw9-0008DT-1L%40gemulon.postgresql.org [2] https://www.postgresql.org/message-id/CAFiTN-s7fno8pGwfK7jwSf7uNaVhPZ38C3LAcF%2B%3DWHu7jNvy7g%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com WIP-0001-Toast-compression-method-options.patch Description: Binary data
Re: Logical Replication - behavior of TRUNCATE ... CASCADE
On Mon, May 3, 2021 at 1:02 PM Dilip Kumar wrote: > > I think you are comparing the user-exposed behavior with the internal > code comments. The meaning of the comments is that it should not > truncate any table on subscriber using cascade, because there might be > some subscriber-specific relations that depend upon the primary table > and those should not get truncated as a side-effect of the cascade. > > For example, you can slightly change your example as below > > On subscriber: > > CREATE TABLE tbl_pk(id int primary key); > > CREATE TABLE tbl_fk_sub(fkey int references tbl_pk(id)); -> this table > > doesn't refer to tbl_pk on the publisher > > So now as part of the truncate tbl_pk the tbl_fk_subould not get > truncated and that is what the comment is trying to say. Thanks. I was of the thinking that the subscriber table can not have references to other subscriber-local tables and they should also be having the same column constraints as the publisher table columns. But I was wrong. I tried the use case [1] where the subscriber table tbl_pk, that was subscribed to the changes from the publisher, is being referenced by another subscriber-local table tbl_fk. In this case, the comment and the code that sends only RESTRICT behaviour ignoring the upstream CASCADE option make sense. Having said that, isn't it good if we can provide a subscription (CREATE/ALTER) level option say "cascade"(similar to other options such as binary, synchronous_commit, stream) default being false, when set to true, we send upstream CASCADE option to ExecuteTruncateGuts in apply_handle_truncate? It will be useful to truncate all the dependent tables in the subscriber. Users will have to use it with caution though. Note that the comment already says this: /* * Even if we used CASCADE on the upstream primary we explicitly default * to replaying changes without further cascading. This might be later * changeable with a user specified option. */ Thoughts? [1] On publisher: DROP TABLE tbl_pk CASCADE; CREATE TABLE tbl_pk(id int primary key); INSERT INTO tbl_pk (SELECT x FROM generate_series(1,10) x); DROP PUBLICATION testpub; CREATE PUBLICATION testpub FOR TABLE tbl_pk; On subscriber: DROP TABLE tbl_pk CASCADE; CREATE TABLE tbl_pk(id int primary key); DROP TABLE tbl_fk; CREATE TABLE tbl_fk(id1 int references tbl_pk(id)); DROP SUBSCRIPTION testsub; CREATE SUBSCRIPTION testsub CONNECTION 'host=localhost dbname=postgres user=bharath port=5432' PUBLICATION testpub; INSERT INTO tbl_fk (SELECT x FROM generate_series(1,10) x); On publisher: TRUNCATE tbl_pk CASCADE; SELECT count(id) FROM tbl_pk; -- 0 On subscriber we get error, because the RESTRICT option is passed to ExecuteTruncateGuts in logical apply worker and the table tbl_pk is referenced by tbl_fk, so tbl_pk is not truncated. SELECT count(id) FROM tbl_pk; -- non-zero With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Transactions involving multiple postgres foreign servers, take 2
On Sun, May 2, 2021 at 1:23 AM Zhihong Yu wrote: > > > > On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada wrote: >> >> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu wrote: >> > >> > Hi, >> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch : >> >> Thank you for reviewing the patch! >> >> > >> > With this commit, the foreign server modified within the transaction >> > marked as 'modified'. >> > >> > transaction marked -> transaction is marked >> >> Will fix. >> >> > >> > +#define IsForeignTwophaseCommitRequested() \ >> > +(foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED) >> > >> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the >> > macro should be named: IsForeignTwophaseCommitRequired. >> >> But even if foreign_twophase_commit is >> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if >> there is only one modified server, right? It seems the name >> IsForeignTwophaseCommitRequested is fine. >> >> > >> > +static bool >> > +checkForeignTwophaseCommitRequired(bool local_modified) >> > >> > + if (!ServerSupportTwophaseCommit(fdw_part)) >> > + have_no_twophase = true; >> > ... >> > + if (have_no_twophase) >> > + ereport(ERROR, >> > >> > It seems the error case should be reported within the loop. This way, we >> > don't need to iterate the other participant(s). >> > Accordingly, nserverswritten should be incremented for local server prior >> > to the loop. The condition in the loop would become if >> > (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1). >> > have_no_twophase is no longer needed. >> >> Hmm, I think If we process one 2pc-non-capable server first and then >> process another one 2pc-capable server, we should raise an error but >> cannot detect that. > > > Then the check would stay as what you have in the patch: > > if (!ServerSupportTwophaseCommit(fdw_part)) > > When the non-2pc-capable server is encountered, we would report the error in > place (following the ServerSupportTwophaseCommit check) and come out of the > loop. > have_no_twophase can be dropped. But if we processed only one non-2pc-capable server, we would raise an error but should not in that case. On second thought, I think we can track how many servers are modified or not capable of 2PC during registration and unr-egistration. Then we can consider both 2PC is required and there is non-2pc-capable server is involved without looking through all participants. Thoughts? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Replication slot stats misgivings
On Mon, May 3, 2021 at 2:29 PM Amit Kapila wrote: > > On Fri, Apr 30, 2021 at 1:47 PM Amit Kapila wrote: > > > > LGTM. I have slightly edited the comments in the attached. I'll push > > this early next week unless there are more comments. > > > > Pushed. Thank you! Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Replication slot stats misgivings
On Mon, May 3, 2021 at 2:27 PM Amit Kapila wrote: > > On Thu, Apr 29, 2021 at 10:37 AM Amit Kapila wrote: > > > > On Wed, Apr 28, 2021 at 7:43 PM Masahiko Sawada > > wrote: > > > > > > On Wed, Apr 28, 2021 at 3:25 PM Amit Kapila > > > wrote: > > > > > > > > > > > I am not sure if any of these alternatives are a good idea. What do > > > > you think? Do you have any other ideas for this? > > > > > > I've been considering some ideas but don't come up with a good one > > > yet. It’s just an idea and not tested but how about having > > > CreateDecodingContext() register before_shmem_exit() callback with the > > > decoding context to ensure that we send slot stats even on > > > interruption. And FreeDecodingContext() cancels the callback. > > > > > > > Is it a good idea to send stats while exiting and rely on the same? I > > think before_shmem_exit is mostly used for the cleanup purpose so not > > sure if we can rely on it for this purpose. I think we can't be sure > > that in all cases we will send all stats, so maybe Vignesh's patch is > > sufficient to cover the cases where we avoid losing it in cases where > > we would have sent a large amount of data. > > > > Sawada-San, any thoughts on this point? before_shmem_exit is mostly used to the cleanup purpose but how about on_shmem_exit()? pgstats relies on that to send stats at the interruption. See pgstat_shutdown_hook(). That being said, I agree Vignesh' patch would cover most cases. If we don't find any better solution, I think we can go with Vignesh's patch. > Apart from this, I think you > have suggested somewhere in this thread to slightly update the > description of stream_bytes. I would like to update the description of > stream_bytes and total_bytes as below: > > stream_bytes > Amount of transaction data decoded for streaming in-progress > transactions to the decoding output plugin while decoding changes from > WAL for this slot. This and other streaming counters for this slot can > be used to tune logical_decoding_work_mem. > > total_bytes > Amount of transaction data decoded for sending transactions to the > decoding output plugin while decoding changes from WAL for this slot. > Note that this includes data that is streamed and/or spilled. > > This update considers two points: > a. we don't send this data across the network because plugin might > decide to filter this data, ex. based on publications. > b. not all of the decoded changes are sent to plugin, consider > REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, > REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT, etc. Looks good to me. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: strange error reporting
On 21.01.21 02:33, Tom Lane wrote: I'd be inclined to spell it "connection to server at ... failed", but that sort of wording is surely also possible. "connection to server" rather than "connection to database" works for me; in fact, I think I like it slightly better. If I don't hear any other opinions, I'll change these messages to "connection to server at socket \"%s\" failed:" "connection to server at \"%s\" (%s), port %s failed:" (or maybe "server on socket"? "at" sounds right for the IP address case, but it feels a little off in the socket pathname case.) I was just trying some stuff with PG14, which led me to this thread. I find these new error messages to be more distracting than before in some cases. For example: PG13: clusterdb: error: could not connect to database typo: FATAL: database "typo" does not exist PG14: clusterdb: error: connection to server on socket "/tmp/.s.PGSQL.65432" failed: FATAL: database "typo" does not exist Throwing the socket address in there seems a bit distracting and misleading, and it also pushes off the actual information very far to the end. (Also, in some cases the socket path is very long, making the actual information even harder to find.) By the time you get to this error, you have already connected, so mentioning the server address seems secondary at best.
Re: how to correctly cast json value to text?
po 3. 5. 2021 v 11:26 odesílatel Marko Tiikkaja napsal: > On Mon, May 3, 2021 at 12:24 PM Pavel Stehule > wrote: > >> Is it possible to do this with built functionality? >> >> I miss the cast function for json scalar string value to string. >> > > #>>'{}' > It is working. Thank you. But this syntax is a little bit scary. Maybe we can introduce some functions for this case. Until to pg 14 this functionality was not necessary, but now it can be nice to have it. DO $$ DECLARE v jsonb; BEGIN -- hodnota musi byt validni json v['a'] = '"Ahoj"'; RAISE NOTICE '%', v['a'] #>> '{}'; END; $$; NOTICE: Ahoj DO Some ideas about the name of this function? CREATE OR REPLACE FUNCTION jsonscalar_to_text(jsonb) RETURNS text AS $$ SELECT $1 #>> '{}' $$ LANGUAGE sql; > > .m >
Re: how to correctly cast json value to text?
On Mon, May 3, 2021 at 12:24 PM Pavel Stehule wrote: > Is it possible to do this with built functionality? > > I miss the cast function for json scalar string value to string. > #>>'{}' .m
Re: how to correctly cast json value to text?
Hi po 3. 5. 2021 v 11:15 odesílatel Pavel Stehule napsal: > Hi > > I am testing a new subscripting interface for jsonb, and I found one issue. > > DO $$ > DECLARE v jsonb; > BEGIN > v['a'] = '"Ahoj"'; > RAISE NOTICE '%', v['a']; > END; > $$; > NOTICE: "Ahoj" > DO > > When I use this interface for reading, the jsonb type is returned. What is > the correct way for casting from jsonb text to text value? I would not > double quotes inside the result. Cast to text doesn't help. For operator > API we can use "->>" symbol. But we have nothing similar for subscript API. > now I need function like CREATE OR REPLACE FUNCTION public.value_to_text(jsonb) RETURNS text LANGUAGE plpgsql IMMUTABLE AS $function$ DECLARE x jsonb; BEGIN x['x'] = $1; RETURN x->>'x'; END; $function$ DO $$ DECLARE v jsonb; BEGIN -- hodnota musi byt validni json v['a'] = '"Ahoj"'; RAISE NOTICE '%', value_to_text(v['a']); END; $$; NOTICE: Ahoj DO Is it possible to do this with built functionality? I miss the cast function for json scalar string value to string. Regards Pavel > Regards > > Pavel > > >
how to correctly cast json value to text?
Hi I am testing a new subscripting interface for jsonb, and I found one issue. DO $$ DECLARE v jsonb; BEGIN v['a'] = '"Ahoj"'; RAISE NOTICE '%', v['a']; END; $$; NOTICE: "Ahoj" DO When I use this interface for reading, the jsonb type is returned. What is the correct way for casting from jsonb text to text value? I would not double quotes inside the result. Cast to text doesn't help. For operator API we can use "->>" symbol. But we have nothing similar for subscript API. Regards Pavel
Re: Identify missing publications from publisher while create/alter subscription.
On Mon, May 3, 2021 at 10:48 AM Dilip Kumar wrote: > > On Sun, May 2, 2021 at 10:04 PM vignesh C wrote: > > > > Thanks for the comments. > > The Attached patch has the fixes for the same. > > I was reviewing the documentation part, I think in the below paragraph > we should include validate_publication as well? > > > connect (boolean) > > > Specifies whether the CREATE SUBSCRIPTION > should connect to the publisher at all. Setting this to > false will change default values of > enabled, create_slot and > copy_data to false. > > > I will review/test the other parts of the patch and let you know. I have reviewed it and it mostly looks good to me. I have some minor suggestions though. 1. +/* + * Check the specified publication(s) is(are) present in the publisher. + */ vs + +/* + * Connect to the publisher and check if the publications exist. + */ I think the formatting of the comments are not uniform. Some places we are using "publication(s) is(are)" whereas other places are just "publications". 2. Add a error case for connect=false and VALIDATE_PUBLICATION = true -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Enhanced error message to include hint messages for redundant options error
On Mon, May 3, 2021 at 12:08 PM Bharath Rupireddy wrote: > > On Sun, May 2, 2021 at 8:44 PM vignesh C wrote: > > > > On Sat, May 1, 2021 at 9:02 PM Bharath Rupireddy > > wrote: > > > > > > On Sat, May 1, 2021 at 7:25 PM vignesh C wrote: > > > > > > I'm not attaching above one line change as a patch, maybe Vignesh > > > > > > can > > > > > > merge this into the main patch. > > > > > > > > Thanks for the comments. I have merged the change into the attached > > > > patch. > > > > Thoughts? > > > > > > Thanks! v4 basically LGTM. Can we park this in the current commitfest > > > if not done already? > > > > > > Upon looking at the number of places where we have the "option \"%s\" > > > specified more than once" error, I, now strongly feel that we should > > > use goto duplicate_error approach like in compute_common_attribute, so > > > that we will have only one ereport(ERROR. We can change it in > > > following files: copy.c, dbcommands.c, extension.c, > > > compute_function_attributes, sequence.c, subscriptioncmds.c, > > > typecmds.c, user.c, walsender.c, pgoutput.c. This will reduce the LOC > > > greatly. > > > > > > Thoughts? > > > > I have made the changes for this, I have posted the same in the v5 > > patch posted in my earlier mail. > > Thanks! The v5 patch looks good to me. Let's see if all agree on the > goto duplicate_error approach which could reduce the LOC by ~80. I think the "goto duplicate_error" approach looks good, it avoids duplicating the same error code multiple times. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Logical Replication - behavior of TRUNCATE ... CASCADE
sh,On Mon, May 3, 2021 at 12:37 PM Bharath Rupireddy wrote: > > On Mon, May 3, 2021 at 11:59 AM Dilip Kumar wrote: > > > > On Mon, May 3, 2021 at 10:42 AM Bharath Rupireddy > > wrote: > > > > > > Hi, > > > > > > In apply_handle_truncate, the following comment before > > > ExecuteTruncateGuts says that it defaults to RESTRICT even if the CASCADE > > > option has been specified in publisher's TRUNCATE command. > > > /* > > > * Even if we used CASCADE on the upstream primary we explicitly > > > default > > > * to replaying changes without further cascading. This might be later > > > * changeable with a user specified option. > > > */ > > > I tried the following use case to see if that's actually true: > > > 1) Created two tables tbl_pk (primary key), tbl_fk(references tbl_pk > > > primary key via foreign key) on both publisher and subscriber. > > > 2) In general, TRUNCATE tbl_pk; or TRUNCATE tbl_pk RESTRICT; would fail > > > because tbl_fk is dependent on tbl_pk. > > > 3) TRUNCATE tbl_pk, tbl_fk; would work because the dependent table is > > > specified in the command. > > > 4) TRUNCATE tbl_pk CASCADE; would work because of the CASCADE option and > > > both tbl_pk and tbl_fk are truncated. When this command is run on the > > > publisher, the CASCADE option is sent to the subscriber, see > > > DecodeTruncate. But the apply worker ignores it and passes DROP_RESTRICT > > > to ExecuteTruncateGuts. Therefore, the expectation(per the comment) is > > > that on the subscriber, the behavior should be equivalent to TRUNCATE > > > tbl_pk;, so an error is expected. But we are also receiving the tbl_fk in > > > the remote rels along with tbl_pk, so the behavior is equivalent to (3) > > > and both tbl_pk and tbl_fk are truncated. > > > > > > Does the comment still hold true? Does ignoring the CASCADE option make > > > sense in apply_handle_truncate, as we are receiving all the dependent > > > relations in the remote rels from the publisher? Am I missing something? > > > > > > The commit id of the feature "Logical replication support for TRUNCATE" > > > is 039eb6e92f, and adding relevant people in cc. > > > > Assume this case > > publisher: tbl_pk -> tbl_fk_pub > > subscriber: tbl_pk-> tbl_fk_sub > > > > Now, in this case, this comment is true right because we are not > > supposed to truncate tbl_fk_sub on the subscriber side and this should > > error out. > > Here's what I tried, let me know if I'm wrong: > > On publisher: > CREATE TABLE tbl_pk(id int primary key); > CREATE TABLE tbl_fk(fkey int references tbl_pk(id)); > INSERT INTO tbl_pk (SELECT x FROM generate_series(1,10) x); > INSERT INTO tbl_fk (SELECT x % 10 + 1 FROM generate_series(5,25) x); > DROP PUBLICATION testpub; > CREATE PUBLICATION testpub FOR TABLE tbl_pk, tbl_fk; > > On subscriber: > CREATE TABLE tbl_pk(id int primary key); > CREATE TABLE tbl_fk(fkey int references tbl_pk(id)); > DROP SUBSCRIPTION testsub; > CREATE SUBSCRIPTION testsub CONNECTION 'host=localhost dbname=postgres > user=bharath port=5432' PUBLICATION testpub; > > On both publisher and subscriber to ensure that the initial rows were > replicated: > SELECT count(id) FROM tbl_pk; -- non zero > SELECT count(fkey) FROM tbl_fk; -- non zero > > On publisher: > TRUNCATE tbl_pk CASCADE; > SELECT count(id) FROM tbl_pk; -- 0 > SELECT count(fkey) FROM tbl_fk; -- 0 > > On subscriber also we get to see 0 rows: > SELECT count(id) FROM tbl_pk; -- 0 > SELECT count(fkey) FROM tbl_fk; -- 0 > > But the comment says that tbl_fk shouldn't be truncated as it doesn't > pass the cascade option to ExecuteTruncateGuts even though it was > received from the publisher. This behaviour is not in accordance with > the comment, right? I think you are comparing the user-exposed behavior with the internal code comments. The meaning of the comments is that it should not truncate any table on subscriber using cascade, because there might be some subscriber-specific relations that depend upon the primary table and those should not get truncated as a side-effect of the cascade. For example, you can slightly change your example as below > On subscriber: > CREATE TABLE tbl_pk(id int primary key); > CREATE TABLE tbl_fk_sub(fkey int references tbl_pk(id)); -> this table > doesn't refer to tbl_pk on the publisher So now as part of the truncate tbl_pk the tbl_fk_subould not get truncated and that is what the comment is trying to say. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Logical Replication - behavior of TRUNCATE ... CASCADE
On Mon, May 3, 2021 at 11:59 AM Dilip Kumar wrote: > > On Mon, May 3, 2021 at 10:42 AM Bharath Rupireddy > wrote: > > > > Hi, > > > > In apply_handle_truncate, the following comment before ExecuteTruncateGuts > > says that it defaults to RESTRICT even if the CASCADE option has been > > specified in publisher's TRUNCATE command. > > /* > > * Even if we used CASCADE on the upstream primary we explicitly default > > * to replaying changes without further cascading. This might be later > > * changeable with a user specified option. > > */ > > I tried the following use case to see if that's actually true: > > 1) Created two tables tbl_pk (primary key), tbl_fk(references tbl_pk > > primary key via foreign key) on both publisher and subscriber. > > 2) In general, TRUNCATE tbl_pk; or TRUNCATE tbl_pk RESTRICT; would fail > > because tbl_fk is dependent on tbl_pk. > > 3) TRUNCATE tbl_pk, tbl_fk; would work because the dependent table is > > specified in the command. > > 4) TRUNCATE tbl_pk CASCADE; would work because of the CASCADE option and > > both tbl_pk and tbl_fk are truncated. When this command is run on the > > publisher, the CASCADE option is sent to the subscriber, see > > DecodeTruncate. But the apply worker ignores it and passes DROP_RESTRICT to > > ExecuteTruncateGuts. Therefore, the expectation(per the comment) is that on > > the subscriber, the behavior should be equivalent to TRUNCATE tbl_pk;, so > > an error is expected. But we are also receiving the tbl_fk in the remote > > rels along with tbl_pk, so the behavior is equivalent to (3) and both > > tbl_pk and tbl_fk are truncated. > > > > Does the comment still hold true? Does ignoring the CASCADE option make > > sense in apply_handle_truncate, as we are receiving all the dependent > > relations in the remote rels from the publisher? Am I missing something? > > > > The commit id of the feature "Logical replication support for TRUNCATE" is > > 039eb6e92f, and adding relevant people in cc. > > Assume this case > publisher: tbl_pk -> tbl_fk_pub > subscriber: tbl_pk-> tbl_fk_sub > > Now, in this case, this comment is true right because we are not > supposed to truncate tbl_fk_sub on the subscriber side and this should > error out. Here's what I tried, let me know if I'm wrong: On publisher: CREATE TABLE tbl_pk(id int primary key); CREATE TABLE tbl_fk(fkey int references tbl_pk(id)); INSERT INTO tbl_pk (SELECT x FROM generate_series(1,10) x); INSERT INTO tbl_fk (SELECT x % 10 + 1 FROM generate_series(5,25) x); DROP PUBLICATION testpub; CREATE PUBLICATION testpub FOR TABLE tbl_pk, tbl_fk; On subscriber: CREATE TABLE tbl_pk(id int primary key); CREATE TABLE tbl_fk(fkey int references tbl_pk(id)); DROP SUBSCRIPTION testsub; CREATE SUBSCRIPTION testsub CONNECTION 'host=localhost dbname=postgres user=bharath port=5432' PUBLICATION testpub; On both publisher and subscriber to ensure that the initial rows were replicated: SELECT count(id) FROM tbl_pk; -- non zero SELECT count(fkey) FROM tbl_fk; -- non zero On publisher: TRUNCATE tbl_pk CASCADE; SELECT count(id) FROM tbl_pk; -- 0 SELECT count(fkey) FROM tbl_fk; -- 0 On subscriber also we get to see 0 rows: SELECT count(id) FROM tbl_pk; -- 0 SELECT count(fkey) FROM tbl_fk; -- 0 But the comment says that tbl_fk shouldn't be truncated as it doesn't pass the cascade option to ExecuteTruncateGuts even though it was received from the publisher. This behaviour is not in accordance with the comment, right? If we see why this is so: the publisher sends both tbl_pk and tbl_fk rels to the subscriber and the TRUNCATE tbl_pk, tbl_fk; is allowed (see the code in heap_truncate_check_FKs) even if RESTRICT option is specified. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com