pg_ctl start may return 0 even if the postmaster has been already started on Windows
Dear hackers, While investigating the cfbot failure [1], I found a strange behavior of pg_ctl command. How do you think? Is this a bug to be fixed or in the specification? # Problem The "pg_ctl start" command returns 0 (succeeded) even if the cluster has already been started. This occurs on Windows environment, and when the command is executed just after postmaster starts. # Analysis The primal reason is in wait_for_postmaster_start(). In this function the postmaster.pid file is read and checked whether the start command is successfully done or not. Check (1) requires that the postmaster must be started after the our pg_ctl command, but 2 seconds delay is accepted. In the linux mode, the check (2) is also executed to ensures that the forked process modified the file, so this time window is not so problematic. But in the windows system, (2) is ignored, *so the pg_ctl command may be succeeded if the postmaster is started within 2 seconds.* ``` if ((optlines = readfile(pid_file, )) != NULL && numlines >= LOCK_FILE_LINE_PM_STATUS) { /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; /* * Make sanity checks. If it's for the wrong PID, or the recorded * start time is before pg_ctl started, then either we are looking * at the wrong data directory, or this is a pre-existing pidfile * that hasn't (yet?) been overwritten by our child postmaster. * Allow 2 seconds slop for possible cross-process clock skew. */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); if (pmstart >= start_time - 2 && // (1) #ifndef WIN32 pmpid == pm_pid // (2) #else /* Windows can only reject standalone-backend PIDs */ pmpid > 0 #endif ``` # Appendix - how do I found? I found it while investigating the failure. In the test "pg_upgrade --check" is executed just after old cluster has been started. I checked the output file [2] and found that the banner says "Performing Consistency Checks", which meant that the parameter live_check was set to false (see output_check_banner()). This parameter is set to true when the postmaster has been started at that time and the pg_ctl start fails. That's how I find. [1]: https://cirrus-ci.com/task/4634769732927488 [2]: https://api.cirrus-ci.com/v1/artifact/task/4634769732927488/testrun/build/testrun/pg_upgrade/003_logical_replication_slots/data/t_003_logical_replication_slots_new_publisher_data/pgdata/pg_upgrade_output.d/20230905T080645.548/log/pg_upgrade_internal.log Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Transaction timeout
On 2023-09-06 20:32, Andrey M. Borodin wrote: Thanks for looking into this! On 6 Sep 2023, at 13:16, Fujii Masao wrote: While testing v4 patch, I noticed it doesn't handle the COMMIT AND CHAIN case correctly. When COMMIT AND CHAIN is executed, I believe the transaction timeout counter should reset and start from zero with the next transaction. However, it appears that the current v4 patch doesn't reset the counter in this scenario. Can you confirm this? Yes, I was not aware of this feature. I'll test and fix this. With the v4 patch, I found that timeout errors no longer occur during the idle in transaction phase. Instead, they occur when the next statement is executed. Is this the intended behavior? AFAIR I had been testing that behaviour of "idle in transaction" was intact. I'll check that again. I thought some users might want to use the transaction timeout feature to prevent prolonged transactions and promptly release resources (e.g., locks) in case of a timeout, similar to idle_in_transaction_session_timeout. Yes, this is exactly how I was expecting the feature to behave: empty up max_connections slots for long-hanging transactions. Thanks for your findings, I'll check and post new version! Best regards, Andrey Borodin. Hi, Thank you for implementing this nice feature! I tested the v4 patch in the interactive transaction mode with 3 following cases: 1. Start a transaction with transaction_timeout=0 (i.e., timeout disabled), and then change the timeout value to more than 0 during the transaction. =# SET transaction_timeout TO 0; =# BEGIN;//timeout is not enabled =# SELECT pg_sleep(5); =# SET transaction_timeout TO '1s'; =# SELECT pg_sleep(10);//timeout is enabled with 1s In this case, the transaction timeout happens during pg_sleep(10). 2. Start a transaction with transaction_timeout>0 (i.e., timeout enabled), and then change the timeout value to more than 0 during the transaction. =# SET transaction_timeout TO '1000s'; =# BEGIN;//timeout is enabled with 1000s =# SELECT pg_sleep(5); =# SET transaction_timeout TO '1s'; =# SELECT pg_sleep(10);//timeout is not restarted and still running with 1000s In this case, the transaction timeout does NOT happen during pg_sleep(10). 3. Start a transaction with transaction_timeout>0 (i.e., timeout enabled), and then change the timeout value to 0 during the transaction. =# SET transaction_timeout TO '10s'; =# BEGIN;//timeout is enabled with 10s =# SELECT pg_sleep(5); =# SET transaction_timeout TO 0; =# SELECT pg_sleep(10);//timeout is NOT disabled and still running with 10s In this case, the transaction timeout happens during pg_sleep(10). The first case where transaction_timeout is disabled before the transaction begins is totally fine. However, in the second and third cases, where transaction_timeout is enabled before the transaction begins, since the timeout has already enabled with a certain value, it will not be enabled again with a new setting value. Furthermore, let's say I want to set a transaction_timeout value for all transactions in postgresql.conf file so it would affect all sessions. The same behavior happened but for all 3 cases, here is one example with the second case: =# BEGIN; SHOW transaction_timeout; select pg_sleep(10); SHOW transaction_timeout; COMMIT; BEGIN transaction_timeout - 15s (1 row) 2023-09-07 11:52:50.510 JST [23889] LOG: received SIGHUP, reloading configuration files 2023-09-07 11:52:50.510 JST [23889] LOG: parameter "transaction_timeout" changed to "5000" pg_sleep -- (1 row) transaction_timeout - 5s (1 row) COMMIT I am of the opinion that these behaviors might lead to confusion among users. Could you confirm if these are the intended behaviors? Additionally, I think the short description should be "Sets the maximum allowed time to commit a transaction." or "Sets the maximum allowed time to wait before aborting a transaction." so that it could be more clear and consistent with other %_timeout descriptions. Also, there is a small whitespace error here: src/backend/tcop/postgres.c:3373: space before tab in indent. + stmt_reason, comma2, tx_reason))); On a side note, while testing the patch with pgbench, it came to my attention that in scenarios involving the execution of multiple concurrent transactions within a high contention environment and with relatively short timeout durations, there is a potential for cascading blocking. This phenomenon can lead to multiple transactions exceeding their designated timeouts, consequently resulting in a degradation of transaction processing performance. No? Do you think this feature should be co-implemented with the existing concurrency control protocol to maintain the transaction performance (e.g. a transaction scheduling mechanism based on transaction timeout)? Regards,
Re: [PoC] Reducing planning time when tables have many partitions
Hi Yuya, On Fri, Aug 25, 2023 at 1:09 PM Yuya Watari wrote: > > 3. Future works > > 3.1. Redundant memory allocation of Lists > > When we need child EquivalenceMembers in a loop over ec_members, v20 > adds them to the list. However, since we cannot modify the ec_members, > v20 always copies it. In most cases, there are only one or two child > members, so this behavior is a waste of memory and time and not a good > idea. I didn't address this problem in v20 because doing so could add > much complexity to the code, but it is one of the major future works. > > I suspect that the degradation of Queries A and B is due to this > problem. The difference between 'make installcheck' and Queries A and > B is whether there are partitioned tables. Most of the tests in 'make > installcheck' do not have partitions, so find_relids_top_parents() > could immediately determine the given Relids are already top-level and > keep degradation very small. However, since Queries A and B have > partitions, too frequent allocations of Lists may have caused the > regression. I hope we can reduce the degradation by avoiding these > memory allocations. I will continue to investigate and fix this > problem. > > 3.2. em_relids and pull_varnos > > I'm sorry that v20 did not address your 1st concern regarding > em_relids and pull_varnos. I will try to look into this. > > 3.3. Indexes for RestrictInfos > > Indexes for RestrictInfos are still in RangeTblEntry in v20-0002. I > will also investigate this issue. > > 3.4. Correctness > > v20 has passed all regression tests in my environment, but I'm not so > sure if v20 is correct. > > 4. Conclusion > > I wrote v20 based on a new idea. It may have a lot of problems, but it > has advantages. At least it solves your 3rd concern. Since we iterate > Lists instead of Bitmapsets, we don't have to introduce an iterator > mechanism. My experiment showed that the 'make installcheck' > degradation was very small. For the 2nd concern, v20 no longer adds > child EquivalenceMembers to ec_members. I'm sorry if this is not what > you intended, but it effectively worked. Again, v20 is a new proof of > concept. I hope the v20-based approach will be a good alternative > solution if we can overcome several problems, including what I > mentioned above. It seems that you are still investigating and fixing issues. But the CF entry is marked as "needs review". I think a better status is "WoA". Do you agree with that? -- Best Wishes, Ashutosh Bapat
Re: psql help message contains excessive indentations
At Thu, 7 Sep 2023 15:02:49 +0900, Yugo NAGATA wrote in > On Thu, 07 Sep 2023 14:29:56 +0900 (JST) > Kyotaro Horiguchi wrote: > > > \q quit psql > > > \watch [[i=]SEC] [c=N] [m=MIN] > > !> execute query every SEC seconds, up to N times > > !> stop if less than MIN rows are returned > > > > The last two lines definitely have some extra indentation. > > Agreed. > > > I've attached a patch that fixes this. > > I wonder this better to fix this in similar way to other places where the > description has multiple lines., like "\g [(OPTIONS)] [FILE]". It looks correct to me: > \errverboseshow most recent error message at maximum verbosity > \g [(OPTIONS)] [FILE] execute query (and send result to file or |pipe); > \g with no arguments is equivalent to a semicolon > \gdesc describe result of query, without executing it regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pg_rewind with cascade standby doesn't work well
Hi there, I tested pg_rewind behavior and found a suspicious one. Consider a scenario like this, Server A: primary Server B :replica of A Server C :replica of B and somehow A down ,so B gets promoted. Server A: down Server B :new primary Server C :replica of B In this case, pg_rewind can be used to reconstruct the cascade; the source is C and the target is A. However, we get error as belows by running pg_rewind. ``` pg_rewind: fetched file "global/pg_control", length 8192 pg_rewind: source and target cluster are on the same timeline pg_rewind: no rewind required ``` Though A's timeline is 1 and C's is 2 ideally, it says they're on the same timeline. This is because `pg_rewind` currently uses minRecoveryPointTLI and latest checkpoint's TimelineID to compare the TLI between source and target[1]. Both C's minRecoveryPointTLI and Latest checkpoint's TimelineID are not modified until checkpointing. (even though B's are modified). And then, if you run pg_rewind immediately, pg_rewind won't work because C and A appear to be on the same timeline. So we have to CHECKPOINT on C before running pg_rewind; BTW, immediate pg_rewind with cascade standby seems to be already concerned in another discussion[2], but unfortunately missed. Anyway, I don't think this behavior is kind. To fix this, should we use another variable to compare TLI? Or, modify the cascade standby's minRecoveryPointTLI somehow? Masaki Kuwamura [1] https://www.postgresql.org/message-id/flat/9f568c97-87fe-a716-bd39-65299b8a60f4%40iki.fi [2] https://www.postgresql.org/message-id/flat/aeb5f31a-8de2-40a8-64af-ab659a309d6b%40iki.fi
Re: pg_upgrade and logical replication
On Mon, Sep 04, 2023 at 02:12:58PM +0530, Amit Kapila wrote: > Yeah, I agree that could be hacked quickly but note I haven't reviewed > in detail if there are other design issues in this patch. Note that we > thought first to support the upgrade of the publisher node, otherwise, > immediately after upgrading the subscriber and publisher, the > subscriptions won't work and start giving errors as they are dependent > on slots in the publisher. One other point that needs some thought is > that the LSN positions we are going to copy in the catalog may no > longer be valid after the upgrade (of the publisher) because we reset > WAL. Does that need some special consideration or are we okay with > that in all cases? In pg_upgrade, copy_xact_xlog_xid() puts the new node ahead of the old cluster by 8 segments on TLI 1, so how would be it a problem if the subscribers keep a remote confirmed LSN lower than that in their catalogs? (You've mentioned that to me offline, but I forgot the details in the code.) > As of now, things are quite safe as documented in > pg_dump doc page that it will be the user's responsibility to set up > replication after dump/restore. I think it would be really helpful if > you could share your thoughts on the publisher-side matter as we are > facing a few tricky questions to be answered. For example, see a new > thread [1]. In my experience, users are quite used to upgrade standbys *first*, even in simple scenarios like minor upgrades, because that's the only way to do things safely. For example, updating and/or upgrading primaries before the standbys could be a problem if an update introduces a slight change in the WAL record format that could be generated by the primary but not be processed by a standby, and we've done such tweaks in some records in the past for some bug fixes that had to be backpatched to stable branches. IMO, the upgrade of subscriber nodes and the upgrade of publisher nodes need to be treated as two independent processing problems, dealt with separately. As you have mentioned me earlier offline, these two have, from what I understand. one dependency: during a publisher upgrade we need to make sure that there are no invalid slots when beginning to run pg_upgrade, and that the confirmed LSN of all the slots used by the subscribers match with the shutdown checkpoint's LSN, ensuring that the subscribers would not lose any data because everything's already been consumed by them when the publisher gets to be upgraded. > The point raised by Jonathan for not having an option for pg_upgrade > is that it will be easy for users, otherwise, users always need to > enable this option. Consider a replication setup, wouldn't users want > by default it to be upgraded? Asking them to do that via an option > would be an inconvenience. So, that was the reason, we wanted to have > an --exclude option and by default allow slots to be upgraded. I think > the same theory applies here. > > [1] - > https://www.postgresql.org/message-id/CAA4eK1LV3%2B76CSOAk0h8Kv0AKb-OETsJHe6Sq6172-7DZXf0Qg%40mail.gmail.com I saw this thread, and have some thoughts to share. Will reply there. -- Michael signature.asc Description: PGP signature
Re: persist logical slots to disk during shutdown checkpoint
On Thu, Sep 7, 2023 at 10:13 AM vignesh C wrote: > > On Wed, 6 Sept 2023 at 12:09, Dilip Kumar wrote: > > > > Other than that the patch LGTM. > > pgindent reported that the new comments added need to be re-adjusted, > here is an updated patch for the same. > Thanks, the patch looks good to me as well. -- With Regards, Amit Kapila.
Re: Improve heapgetpage() performance, overhead from serializable
On Fri, Sep 1, 2023 at 1:08 PM tender wang wrote: > > This thread [1] also Improving the heapgetpage function, and looks like this thread. > > [1] https://www.postgresql.org/message-id/a9f40066-3d25-a240-4229-ec2fbe94e7a5%40yeah.net Please don't top-post. For the archives: That CF entry has been withdrawn, after the author looked at this one and did some testing. -- John Naylor EDB: http://www.enterprisedb.com
Re: RFC: Logging plan of the running query
On 2023-09-06 11:17, James Coleman wrote: > I've never been able to reproduce it (haven't tested the new version, > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian > (first buster and now bullseye). > > I'm attaching several stacktraces in the hope that they provide some > clues. These all match the ps output I sent earlier, though note in > that output there is both the regress instance and my test instance > (pid 3213249) running (different ports, of course, and they are from > the exact same compilation run). I've attached ps output for the > postgres processes under the make check process to simplify cross > referencing. > > A few interesting things: > - There's definitely a lock on a relation that seems to be what's > blocking the processes. > - When I try to connect with psql the process forks but then hangs > (see the ps output with task names stuck in "authentication"). I've > also included a trace from one of these. Thanks for sharing them! Many processes are waiting to acquire the LW lock, including the process trying to output the plan(select1.trace). I suspect that this is due to a lock that was acquired prior to being interrupted by ProcessLogQueryPlanInterrupt(), but have not been able to reproduce the same situation.. I don't have time immediately to dig into this, but perhaps loading up the core dumps would allow us to see what query is running in each backend process (if it hasn't already been discarded by that point) and thereby determine what point in each test process led to the error condition? Thanks for the suggestion. I am concerned that core dumps may not be readable on different operating systems or other environments. (Unfortunately, I do not have Debian on hand) It seems that we can know what queries were running from the stack traces you shared. As described above, I suspect a lock which was acquired prior to ProcessLogQueryPlanInterrupt() caused the issue. I will try a little more to see if I can devise a way to create the same situation. Alternatively we might be able to apply the same trick to the test client instead... BTW, for my own easy reference in this thread: relid 1259 is pg_class if I'm not mistaken. Yeah, and I think it's strange that the lock to 1259 appears twice and must be avoided. #10 0x559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at lmgr.c:117 .. #49 0x559d61b4989d in ProcessLogQueryPlanInterrupt () at explain.c:5158 .. #53 0x559d61d8ee6e in LockRelationOid (relid=1259, lockmode=1) at lmgr.c:117 -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Re: psql help message contains excessive indentations
On Thu, 07 Sep 2023 14:29:56 +0900 (JST) Kyotaro Horiguchi wrote: > This topic is extracted from [1]. > > As mentioned there, in psql, running \? displays the following lines. > > > \gdesc describe result of query, without executing it > > \gexec execute query, then execute each value in its > result > > \gset [PREFIX] execute query and store result in psql variables > > \gx [(OPTIONS)] [FILE] as \g, but forces expanded output mode > > \q quit psql > > \watch [[i=]SEC] [c=N] [m=MIN] > !> execute query every SEC seconds, up to N times > !> stop if less than MIN rows are returned > > The last two lines definitely have some extra indentation. Agreed. > I've attached a patch that fixes this. I wonder this better to fix this in similar way to other places where the description has multiple lines., like "\g [(OPTIONS)] [FILE]". Regards, Yugo Nagata > > [1] > https://www.postgresql.org/message-id/20230830.103356.1909699532104167477.horikyota@gmail.com > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center -- Yugo NAGATA