Re: [HACKERS] Determine state of cluster (HA)
On Mon, 16 Oct 2017 10:39:16 +0800 Craig Ringer wrote: > On 13 October 2017 at 08:50, Joshua D. Drake wrote: > > I had a long call with a firm developing front end proxy/cache/HA for > > Postgres today. Essentially the software is a replacement for PGPool in > > entirety but also supports analytics etc... When I was asking them about > > pain points they talked about the below and I was wondering if this is a > > problem we would like to solve. > > IMO: no one node knows the full state of the system, or can know it. +1 > I'd love PostgreSQL to help users more with scaling, HA, etc. But I > think it's a big job. We'd need: > > - a node topology of some kind, communicated between nodes > - heartbeat and monitoring > - failover coordination > - pooling/proxying > - STONITH/fencing > - etc. And some of items on this list can not be in core. However, there's some things PostgreSQL can do to make HA easier to deal with. > That said, I do think it'd be very desirable for us to introduce a > greater link from a standby to master: > > - Get info about master. We should finish merging recovery.conf into > postgresql.conf. agree. +1. To make things easier from the "cluster manager" piece outside of PostgreSQL, I would add: * being able to "demote" a master as a standby without restart. * being able to check the status of each node without eating a backend connection (to avoid hitting "max_connection" limit) * being able to monitor each step of a switchover (or "controlled failover": standby/master role swapping between two nodes) > > b. Attempt to connect to the host directly, if not... > > c. use the slave and use the hostname via dblink to connect to the master, > > as the hostname , i.e. select * from dblink('" + connInfo + " > > dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet). > > This is necessary in the event the hostname used in the recovery.conf file > > is not resolvable from the outside. > > OK, so "connect directly" here means from some 3rd party, the one > doing the querying of the replica. It seems to me the failover process is issuing all required commands to move the master role to another available standby. The knowledge of the orchestration and final status (if everything went good) is in this piece of software. If you want to know where is your master in an exotic or complex setup, ask who was responsible to promote your master. HA should stay as simple as possible. The more the architecture is complex, the more you will have failing scenarios. > > 1. The dblink call doesn't have a way to specify a timeout, so we have to > > use Java futures to control how long this may take to a reasonable amount of > > time; > > statement_timeout doesn't work? > > If not, that sounds like a sensible, separate feature to add. Patches welcome! > > > 2. NAT mapping may result in us detecting IP ranges that are not accessible > > to the application nodes. > > PostgreSQL can't do anything about this one. You could get the master IP address from the "pg_stat_wal_receiver" view. But this is still not enough though. You might have dedicated networks for applications and for pgsql replication both separated. If you want a standby to tell the application where to connect to the master then you'll have to put this information yourself somewhere, accessible from application nodes. > > 3. there is no easy way to monitor for state changes as they happen, > > allowing faster failovers, everything has to be polled based on events; In the corosync world (the clustering piece of the Pacemaker ecosystem), node failure are detected really really fast. About 1s. Considering application failure (pgsql here), this will be polling, yes. But I fail to imagine how a dying application can warn the cluster before dying. Not only crashing (systemd could help there), but eg. before entering an infinite dummy loop or an exhausting one. > It'd be pretty simple to write a function that sleeps in the backend > until it's promoted. I don't know off the top of my head if we set all > proc latches when we promote, but if we don't it's probably harmless > and somewhat useful to do so. As soon as the cluster manager promoted a new master, it can trigger and event to notify whatever you need. > Either way, you'd do long-polling. Call the function and let the > connection block until something interesting happens. Use TCP > keepalives to make sure you notice if it dies. Have the function > return when the state changes. This would still rely on TCP keepalive frequency, back to polling :( Regards, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Replication stats from slaves
Hi hackers, We are developing and maintaining a Pacemaker resource agent for PostgreSQL here. This agent is "stateful", ie. it knows where is the master and where are the slaves. See [1]. Now that this resource agent version is close to what we wanted to achieve, we will make some official announce soon, with details & stuff. We welcome feedback, help issues, etc, but on github please, not in this thread. For the next version, we are now considering to improve the switchover mechanism with appropriate checks for every steps. For reminder, the switchover in PostgreSQL is possible since the following commit: commit 985bd7d49726c9f178558491d31a570d47340459 Author: Fujii Masao Date: Wed Jun 26 02:14:37 2013 +0900 It requires: (1) shutdown the master first (2) make sure the slave received the shutdown checkpoint from the old master (3) promote the slave as master (4) start the old master as slave The problem here is step (2). After discussing IRL with Magnus and Heikki, they confirmed me checking this using pg_xlogdump is fine, eg. (reformated): $ pg_xlogdump 0001000B0014 -s 'B/14D845F8' -n1 rmgr: XLOG len (rec/tot): 72/104, tx: 0, lsn: B/14D845F8, prev B/14D84590, bkp: , desc: checkpoint: redo B/14D845F8; tli 1; prev tli 1; fpw true; xid 0/6422462;oid 1183646; multi 1; offset 0; oldest xid 712 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown This is possible from the resource agent point of view, but not really in a clean way. It requires: * to keep in memory the last LSN of the master after shutdown * check on the slave this LSN has been received * check the record is a rmgr XLOG with a shutdown information as payload * check this is the very last WAL record received (nothing after). First, looking at the last LSN and creating a cluster attribute (in Pacemaker context) from the old master to share it with slaves is possible, but not really elegant for a resource agent. Then, the -n1 in sample command here avoid pg_xlogdump to exit with an error and a rc=1. But it is not compatible with the last check (very last WAL record) and I need to check the command succeed. A best solution here would be to be able to check from a view on the slave, say pg_stat_standby, when it was connected to the master for the last time, the last wal restored by log shipping, last LSN received by streaming rep, flushed, how/why the SR has been disconnected. As instance, reasons for SR disconnection might be: master shutdown, too much lag, connection reset. I can try to give a try to such patch after some acceptance and discussing what exactly we should push in such view. Comments? Guidance? Thoughts? Other solutions? Thanks! Regards, [1] https://github.com/dalibo/pgsql-resource-agent/tree/master/multistate -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Bad row estimation with indexed func returning bool
Hi, I faced a correlation problem on a query today and tried the usual trick consisting of using an functional index and rewriting the query to use it. However, after writing the function, indexing it and rewriting the query, I faced an optimizer behavior I was not expecting. I wrote a short scenario to reproduce it on current HEAD: CREATE TABLE correl AS SELECT (i-1)%2 AS i, i%2 AS j FROM generate_series(1,10) AS i; ANALYZE correl ; EXPLAIN ANALYZE SELECT * FROM correl WHERE i = 1 AND j = 1; -- Seq Scan on correl (rows=25000) (rows=0) --Filter: ((i = 1) AND (j = 1)) --Rows Removed by Filter: 10 -- Planning time: 0.356 ms -- Execution time: 21.937 ms CREATE FUNCTION fix_correl(int, int) RETURNS bool AS 'BEGIN RETURN $1 = 1 AND $2 = 1; END ' IMMUTABLE CALLED ON NULL INPUT LANGUAGE plpgsql; CREATE INDEX ON correl ( fix_correl(i, j) ); ANALYZE correl ; EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl(i, j); -- Index Scan using correl_fix_correl_idx on correl (rows=3) (rows=0) --Index Cond: (fix_correl(i, j) = true) --Filter: fix_correl(i, j) -- Planning time: 0.421 ms -- Execution time: 0.102 ms Using a function returning integer work as expected: CREATE FUNCTION fix_correl_add(int, int) RETURNS int AS 'BEGIN RETURN $1 + $2 ; END ' IMMUTABLE CALLED ON NULL INPUT LANGUAGE plpgsql; CREATE INDEX ON correl ( fix_correl_add( i, j ) ); ANALYZE correl ; EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl_add( i, j ) = 2; -- Index Scan using correl_fix_correl_add_idx on correl (rows=1) (rows=0) --Index Cond: (fix_correl_add(i, j) = 2) -- Planning time: 0.462 ms -- Execution time: 0.102 ms It works works as expected with a simple index on (i + j) with no function, but I wanted to have the same conditions in both tests. Why does the optimizer behave differently in both cases? Why do it add a Filter when index scan-ing on correl_fix_correl_idx indexing booleans? Please, find the complete scenario in attachment. Regards, -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com ioguix@erg:~$ dropdb correl ioguix@erg:~$ createdb correl ioguix@erg:~$ psql -qtX correl correl=# CREATE TABLE correl AS SELECT (i-1)%2 AS i, i%2 AS j FROM generate_series(1,10) AS i; correl=# ANALYZE correl ; correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE i = 1 AND j = 1; Seq Scan on correl (cost=0.00..1943.00 rows=25000 width=8) (actual time=21.898..21.898 rows=0 loops=1) Filter: ((i = 1) AND (j = 1)) Rows Removed by Filter: 10 Planning time: 0.356 ms Execution time: 21.937 ms correl=# CREATE FUNCTION fix_correl(int, int) RETURNS bool AS ' BEGIN RETURN $1 = 1 AND $2 = 1; END ' IMMUTABLE CALLED ON NULL INPUT LANGUAGE plpgsql; correl=# CREATE INDEX ON correl ( fix_correl(i, j) ); correl=# ANALYZE correl ; correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl(i, j); Index Scan using correl_fix_correl_idx on correl (cost=0.29..4.56 rows=3 width=8) (actual time=0.053..0.053 rows=0 loops=1) Index Cond: (fix_correl(i, j) = true) Filter: fix_correl(i, j) Planning time: 0.421 ms Execution time: 0.102 ms correl=# SELECT * FROM pg_stats WHERE tablename ~ '^correl_'; schemaname | public tablename | correl_fix_correl_idx attname| fix_correl inherited | f null_frac | 0 avg_width | 1 n_distinct | 1 most_common_vals | {f} most_common_freqs | {1} histogram_bounds | correlation| 1 most_common_elems | most_common_elem_freqs | elem_count_histogram | correl=# CREATE INDEX ON correl ( (i + j) ); correl=# ANALYZE correl ; correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE ( i + j ) = 2; Index Scan using correl_expr_idx on correl (cost=0.29..4.31 rows=1 width=8) (actual time=0.027..0.027 rows=0 loops=1) Index Cond: ((i + j) = 2) Planning time: 0.175 ms Execution time: 0.076 ms correl=# CREATE FUNCTION fix_correl_add(int, int) RETURNS int AS ' BEGIN RETURN $1 + $2 ; END ' IMMUTABLE CALLED ON NULL INPUT LANGUAGE plpgsql; correl=# CREATE INDEX ON correl ( fix_correl_add( i, j ) ); correl=# ANALYZE correl ; correl=# EXPLAIN ANALYZE SELECT * FROM correl WHERE fix_correl_add( i, j ) = 2; Index Scan using correl_fix_correl_add_idx on correl (cost=0.29..4.31 rows=1 width=8) (actual time=0.052..0.052 rows=0 loops=1) Index Cond: (fix_correl_add(i, j) = 2) Planning time: 0.462 ms Execution time: 0.102 ms correl=# SELECT * FROM pg_stats WHERE tablename ~ '^correl_'; schemaname | public tablename | correl_fix_correl_idx attname| fix_correl inherited | f null_frac | 0 avg_width | 1 n_distinct | 1 most_common_vals | {f} most_common_freqs | {1} histogram_
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
Hi, As I'm writing a doc patch for 9.4 -> 9.0, I'll discuss below on this formula as this is the last one accepted by most of you. On Mon, 3 Nov 2014 12:39:26 -0800 Jeff Janes wrote: > It looked to me that the formula, when descending from a previously > stressed state, would be: > > greatest(1 + checkpoint_completion_target) * checkpoint_segments, > wal_keep_segments) + 1 + > 2 * checkpoint_segments + 1 It lacks a closing parenthesis. I guess the formula is: greatest ( (1 + checkpoint_completion_target) * checkpoint_segments, wal_keep_segments ) + 1 + 2 * checkpoint_segments + 1 > This assumes logs are filled evenly over a checkpoint cycle, which is > probably not true because there is a spike in full page writes right after > a checkpoint starts. > > But I didn't have a great deal of confidence in my analysis. The only problem I have with this formula is that considering checkpoint_completion_target ~ 1 and wal_keep_segments = 0, it becomes: 4 * checkpoint_segments + 2 Which violate the well known, observed and default one: 3 * checkpoint_segments + 1 A value above this formula means the system can not cope with the number of file to flush. The doc is right about that: If, due to a short-term peak of log output rate, there are more than 3 * checkpoint_segments + 1 segment files, the unneeded segment files will be deleted The formula is wrong in the doc when wal_keep_segments <> 0 > The first line reflects the number of WAL that will be retained as-is, I agree with this files MUST be retained: the set of checkpoint_segments WALs beeing flushed and the checkpoint_completion_target ones written in the meantime. > the second is the number that will be recycled for future use before starting > to delete them. disagree cause the WAL files beeing written are actually consuming recycled WALs in the meantime. Your formula expect written files are created and recycled ones never touched, leading to this checkpoint_segment + 1 difference between formulas. > My reading of the code is that wal_keep_segments is computed from the > current end of WAL (i.e the checkpoint record), not from the checkpoint > redo point. If I distribute the part outside the 'greatest' into both > branches of the 'greatest', I don't get the same answer as you do for > either branch. So The formula, using checkpoint_completion_target=1, should be: greatest ( checkpoint_segments, wal_keep_segments ) + 2 * checkpoint_segments + 1 Please find attached to this email a documentation patch for 9.4 using this formula. Regards, -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index d2392b2..1ed780b 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -546,17 +546,18 @@ There will always be at least one WAL segment file, and will normally - not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 - or checkpoint_segments + + 1 - files. Each segment file is normally 16 MB (though this size can be + not be more than greatest(checkpoint_segments, ) + + (1 + checkpoint_completion_target) * checkpoint_segments + + 1 files. Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for WAL. Ordinarily, when old log segment files are no longer needed, they are recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of log output rate, there - are more than 3 * checkpoint_segments + 1 - segment files, the unneeded segment files will be deleted instead - of recycled until the system gets back under this limit. + are more than greatest(checkpoint_segments, wal_keep_segments) + + 2 * checkpoint_segments + 1 segment files, the + unneeded segment files will be deleted instead of recycled until the system + gets back under this limit. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
On Tue, 31 Mar 2015 08:24:15 +0200 Jehan-Guillaume de Rorthais wrote: > On Tue, 3 Mar 2015 11:15:13 -0500 > Bruce Momjian wrote: > > > On Tue, Oct 14, 2014 at 01:21:53PM -0400, Bruce Momjian wrote: > > > On Tue, Oct 14, 2014 at 09:20:22AM -0700, Jeff Janes wrote: > > > > On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian > > > > wrote: > > > > > > > > > > > > I looked into this, and came up with more questions. Why is > > > > checkpoint_completion_target involved in the total number of WAL > > > > segments? If checkpoint_completion_target is 0.5 (the default), the > > > > calculation is: > > > > > > > > (2 + 0.5) * checkpoint_segments + 1 > > > > > > > > while if it is 0.9, it is: > > > > > > > > (2 + 0.9) * checkpoint_segments + 1 > > > > > > > > Is this trying to estimate how many WAL files are going to be > > > > created during the checkpoint? If so, wouldn't it be (1 + > > > > checkpoint_completion_target), not "2 +". My logic is you have the > > > > old WAL files being checkpointed (that's the "1"), plus you have new WAL > > > > files being created during the checkpoint, which would be > > > > checkpoint_completion_target * checkpoint_segments, plus one for the > > > > current WAL file. > > > > > > > > > > > > WAL is not eligible to be recycled until there have been 2 successful > > > > checkpoints. > > > > > > > > So at the end of a checkpoint, you have 1 cycle of WAL which has just > > > > become eligible for recycling, > > > > 1 cycle of WAL which is now expendable but which is kept anyway, and > > > > checkpoint_completion_target worth of WAL which has occurred while the > > > > checkpoint was occurring and is still needed for crash recovery. > > > > > > OK, so based on this analysis, what is the right calculation? This? > > > > > > (1 + checkpoint_completion_target) * checkpoint_segments + 1 + > > > max(wal_keep_segments, checkpoint_segments) > > > > Now that we have min_wal_size and max_wal_size in 9.5, I don't see any > > value to figuring out the proper formula for backpatching. > > I guess it worth backpatching the documentation as 9.4 -> 9.1 will be > supported for somes the next 4 years Sorry, lack of caffeine this morning. Fired the mail before correcting and finishing it: I guess it worth backpatching the documentation as 9.4 -> 9.1 will be supported for some more years. I'll give it a try this week. Regards, -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
On Tue, 3 Mar 2015 11:15:13 -0500 Bruce Momjian wrote: > On Tue, Oct 14, 2014 at 01:21:53PM -0400, Bruce Momjian wrote: > > On Tue, Oct 14, 2014 at 09:20:22AM -0700, Jeff Janes wrote: > > > On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian wrote: > > > > > > > > > I looked into this, and came up with more questions. Why is > > > checkpoint_completion_target involved in the total number of WAL > > > segments? If checkpoint_completion_target is 0.5 (the default), the > > > calculation is: > > > > > > (2 + 0.5) * checkpoint_segments + 1 > > > > > > while if it is 0.9, it is: > > > > > > (2 + 0.9) * checkpoint_segments + 1 > > > > > > Is this trying to estimate how many WAL files are going to be created > > > during the checkpoint? If so, wouldn't it be (1 + > > > checkpoint_completion_target), not "2 +". My logic is you have the > > > old WAL files being checkpointed (that's the "1"), plus you have new WAL > > > files being created during the checkpoint, which would be > > > checkpoint_completion_target * checkpoint_segments, plus one for the > > > current WAL file. > > > > > > > > > WAL is not eligible to be recycled until there have been 2 successful > > > checkpoints. > > > > > > So at the end of a checkpoint, you have 1 cycle of WAL which has just > > > become eligible for recycling, > > > 1 cycle of WAL which is now expendable but which is kept anyway, and > > > checkpoint_completion_target worth of WAL which has occurred while the > > > checkpoint was occurring and is still needed for crash recovery. > > > > OK, so based on this analysis, what is the right calculation? This? > > > > (1 + checkpoint_completion_target) * checkpoint_segments + 1 + > > max(wal_keep_segments, checkpoint_segments) > > Now that we have min_wal_size and max_wal_size in 9.5, I don't see any > value to figuring out the proper formula for backpatching. I guess it worth backpatching the documentation as 9.4 -> 9.1 will be supported for somes the next 4 years -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Incremental backup v3: incremental PoC
On Tue, 6 Jan 2015 08:26:22 -0500 Robert Haas wrote: > Three, scan the WAL generated since the incremental backup and summarize it > into a list of blocks that need to be backed up. This can be done from the archive side. I was talking about some months ago now: http://www.postgresql.org/message-id/51c4dd20.3000...@free.fr One of the traps I could think of it that it requires "full_page_write=on" so we can forge each block correctly. So collar is that we need to start a diff backup right after a checkpoints then. And even without "full_page_write=on", maybe we could add a function, say "pg_start_backupdiff()", which would force to log full pages right after it only, the same way "full_page_write" does after a checkpoint. Diff backups would be possible from each LSN where we pg_start_backupdiff'ed till whenever. Building this backup by merging versions of blocks from WAL is on big step. But then, there is a file format to define, how to restore it and to decide what tools/functions/GUCs to expose to admins. After discussing with Magnus, he adviced me to wait for a diff backup file format to emerge from online tools, like discussed here (by the time, that was Michael's proposal based on pg_basebackup that was discussed). But I wonder how easier it would be to do this the opposite way? If this idea of building diff backup offline from archives is possible, wouldn't it remove a lot of trouble you are discussing here? Regards, -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
Hi hackers, We spend some time with Guillaume Lelarge studying this issue. CreateRestartPoint() calls RemoveOldXlogFiles() to drop/recycle old WALs. This one is calling XLogArchiveCheckDone() to check if the given WAL can be dropped. As our slave has "archive_mode" & "archive_command" set, XLogArchivingActive() returns true, forcing XLogArchiveCheckDone() to look for a ".done" file. If the corresponding ".done" file doesn't exist, XLogArchiveCheckDone() calls XLogArchiveNotify() to actually create the ".ready" file. Now, on a standby, we are supposed to create the ".done" files by calling XLogArchiveForceDone() either after a successful restore_command or when the streaming replication finishes and flushes a WAL. It seems like the root cause of this trouble: maybe a race condition where XLogArchiveForceDone() would not be called for a WAL. As the ".done" file is never created, we end up calling XLogArchiveNotify() to create a ".ready" and this WAL will never be archived by the slave, sitting in the pg_xlog dir. But worst, we experience this under two 9.2.8 clusters and realized this version is not supposed to have 256 segments per WAL, finishing a WAL to the *FE one. I'm quoting back the history of my previous mail as we clearly have some *FF files on the slave side! > > 0001040E00FF > > 0001041400DA > > 0001046E00FF > > 0001047000FF > > 00010485000F > > 000104850010 > > [...normal WAL sequence...] > > 000104850052 Note that it seems to happen only during streaming replication, not in pure Log Shipping replication. We added the debug message on the master side to make sure it never archive *FF file We kept the WAL files and log files for further analysis. How can we help regarding this issue? Regards, On Mon, 15 Sep 2014 17:37:24 +0200 Jehan-Guillaume de Rorthais wrote: > Hi hackers, > > An issue that seems related to this has been posted on pgsql-admin. See: > > > http://www.postgresql.org/message-id/caas3tylnxaydz0+zhxlpdvtovhqovr+jsphp30o8kvwqqs0...@mail.gmail.com > > How can we help on this issue? > > Cheers, > > On Thu, 4 Sep 2014 17:50:36 +0200 > Jehan-Guillaume de Rorthais wrote: > > > Hi hackers, > > > > Since few months, we occasionally see .ready files appearing on some slave > > instances from various context. The two I have in mind are under 9.2.x. > > > > I tried to investigate a bit. These .ready files are created when a WAL file > > from pg_xlog has no corresponding file in pg_xlog/archive_status. I could > > easily experience this by deleting such a file: it is created again at the > > next restartpoint or checkpoint received from the master. > > > > Looking at the WAL in pg_xlog folder corresponding to these .ready files, > > they are all much older than the current WAL "cycle" in both mtime and name > > logic sequence. As instance on one of these box we have currently 6 of > > those "ghost" WALs: > > > > 00021E5300FF > > 00021F1800FF > > 0002204700FF > > 000220BF00FF > > 0002214000FF > > 0002237000FF > > 0002255D00A8 > > 0002255D00A9 > > [...normal WAL sequence...] > > 0002255E009D > > > > And on another box: > > > > 0001040E00FF > > 0001041400DA > > 0001046E00FF > > 0001047000FF > > 00010485000F > > 000104850010 > > [...normal WAL sequence...] > > 000104850052 > > > > So it seems for some reasons, these old WALs were "forgotten" by the > > restartpoint mechanism when they should have been recylced/deleted. > > > > For one of these servers, I could correlate this with some brutal > > disconnection of the streaming replication appearing in its logs. But there > > was no known SR disconnection on the second one. > > > > Any idea about this weird behaviour? What can we do to help you investigate > > further? > > > > Regards, > > > -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] .ready files appearing on slaves
Hi hackers, An issue that seems related to this has been posted on pgsql-admin. See: http://www.postgresql.org/message-id/caas3tylnxaydz0+zhxlpdvtovhqovr+jsphp30o8kvwqqs0...@mail.gmail.com How can we help on this issue? Cheers, On Thu, 4 Sep 2014 17:50:36 +0200 Jehan-Guillaume de Rorthais wrote: > Hi hackers, > > Since few months, we occasionally see .ready files appearing on some slave > instances from various context. The two I have in mind are under 9.2.x. > > I tried to investigate a bit. These .ready files are created when a WAL file > from pg_xlog has no corresponding file in pg_xlog/archive_status. I could > easily experience this by deleting such a file: it is created again at the > next restartpoint or checkpoint received from the master. > > Looking at the WAL in pg_xlog folder corresponding to these .ready files, they > are all much older than the current WAL "cycle" in both mtime and name logic > sequence. As instance on one of these box we have currently 6 of those "ghost" > WALs: > > 00021E5300FF > 00021F1800FF > 0002204700FF > 000220BF00FF > 0002214000FF > 0002237000FF > 0002255D00A8 > 0002255D00A9 > [...normal WAL sequence...] > 0002255E009D > > And on another box: > > 0001040E00FF > 0001041400DA > 0001046E00FF > 0001047000FF > 00010485000F > 000104850010 > [...normal WAL sequence...] > 000104850052 > > So it seems for some reasons, these old WALs were "forgotten" by the > restartpoint mechanism when they should have been recylced/deleted. > > For one of these servers, I could correlate this with some brutal > disconnection of the streaming replication appearing in its logs. But there > was no known SR disconnection on the second one. > > Any idea about this weird behaviour? What can we do to help you investigate > further? > > Regards, -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] .ready files appearing on slaves
Hi hackers, Since few months, we occasionally see .ready files appearing on some slave instances from various context. The two I have in mind are under 9.2.x. I tried to investigate a bit. These .ready files are created when a WAL file from pg_xlog has no corresponding file in pg_xlog/archive_status. I could easily experience this by deleting such a file: it is created again at the next restartpoint or checkpoint received from the master. Looking at the WAL in pg_xlog folder corresponding to these .ready files, they are all much older than the current WAL "cycle" in both mtime and name logic sequence. As instance on one of these box we have currently 6 of those "ghost" WALs: 00021E5300FF 00021F1800FF 0002204700FF 000220BF00FF 0002214000FF 0002237000FF 0002255D00A8 0002255D00A9 [...normal WAL sequence...] 0002255E009D And on another box: 0001040E00FF 0001041400DA 0001046E00FF 0001047000FF 00010485000F 000104850010 [...normal WAL sequence...] 000104850052 So it seems for some reasons, these old WALs were "forgotten" by the restartpoint mechanism when they should have been recylced/deleted. For one of these servers, I could correlate this with some brutal disconnection of the streaming replication appearing in its logs. But there was no known SR disconnection on the second one. Any idea about this weird behaviour? What can we do to help you investigate further? Regards, -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How can we make beta testing better?
On Thu, 17 Apr 2014 16:42:21 -0700 Josh Berkus wrote: > On 04/15/2014 09:53 PM, Rod Taylor wrote: > > A documented beta test process/toolset which does the following would help: > > 1) Enables full query logging > > 2) Creates a replica of a production DB, record $TIME when it stops. > > 3) Allow user to make changes (upgrade to 9.4, change hardware, change > > kernel settings, ...) > > 4) Plays queries from the CSV logs starting from $TIME mimicking actual > > timing and transaction boundaries > > > > If Pg can make it easy to duplicate activities currently going on in > > production inside another environment, I would be pleased to fire a couple > > billion queries through it over the next few weeks. > > > > #4 should include reporting useful to the project, such as a sampling of > > queries which performed significantly worse and a few relative performance > > stats for overall execution time. > > So we have some software we've been procrastinating on OSS'ing, which does: > > 1) Takes full query CSV logs from a running postgres instance > 2) Runs them against a target instance in parallel > 3) Records response times for all queries > > tsung and pgreplay also do this, but have some limitations which make > them impractical for a general set of logs to replay. I've been working on another tool able to replay scenario recorded directly from a network dump (see [pgshark]). It works, can be totally transparent from the application point of view, the tcpdump can run anywhere, and **ALL** the real traffic can be replayed...but it needs some more work for reporting and handling parallel sessions. The drawback of using libpcap is that you can lost packets while capturing and a very large capture buffer can not keep you safe for hours of high-speed scenario. So it might require multiple capture and adjusting the buffer size to capture 100% of the traffic on the required period. I tried to quickly write a simple proxy using Perl POE to capture ALL the traffic safely. My POC was doing nothing but forwarding packets and IIRC a 30s stress test with 10 or 20 sessions using pgbench showed a drop of ~60% of performances. But it was a very quick POC with a mono-processus/mono-thread POC. Maybe another path would be to be able to generate some this traffic dump from PostgreSQL (which only have the application level to deal with) itself in a format we can feed to pgbench. > What it would need is: > > A) scripting around coordinated backups > B) Scripting for single-command runs, including changing pg.conf to > record data. Changing the pg.conf is pretty easy with alter system now. But I'm sure we all have some scripts out there doing this (at least I do) > C) tools to *analyze* the output data, including error messages. That's what I lack in pgshark so far. [pgshark] https://github.com/dalibo/pgshark Cheers, -- Jehan-Guillaume de Rorthais Dalibo http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unarchived WALs deleted after crash
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Just a quick top-post to thank you all for this fix guys ! Cheers, On 15/02/2013 18:43, Heikki Linnakangas wrote: > On 15.02.2013 19:16, Fujii Masao wrote: >> On Sat, Feb 16, 2013 at 2:07 AM, Heikki Linnakangas >> wrote: >>> On 15.02.2013 18:10, Fujii Masao wrote: >>>> >>>> At least in 9.2, when the archived file is restored into >>>> pg_xlog, its xxx.done archive status file is created. So we >>>> don't need to check InArchiveRecovery when deleting old WAL >>>> files. Checking whether xxx.done exists is enough. >>> >>> Hmm, what about streamed WAL files? I guess we could go back to >>> the pre-9.2 coding, and check WalRcvInProgress(). But I didn't >>> actually like that too much, it seems rather random that old >>> streamed files are recycled when wal receiver is running at the >>> time of restartpoint, and otherwise not. Because whether wal >>> receiver is running at the time the restartpoint happens has >>> little to do with which files were created by streaming >>> replication. With the right pattern of streaming files from the >>> master, but always being teporarily disconnected when the >>> restartpoint runs, you could still accumulate WAL files >>> infinitely. >> >> Walreceiver always creates .done file when it closes the >> already-flushed WAL file and switches WAL file to next. So we >> also don't need to check WalRcvInProgress(). > > Ah, I missed that part of the patch. > > Okay, agreed, that's a better fix. I committed your forward-port of > the 9.2 patch to master, reverted my earlier fix for this bug, and > simply removed the > InArchiveRecovery/ArchiveRecoveryInProgress()/RecoveryInProgress() > condition from RemoveOldXlogFiles(). > > - Heikki - -- Jehan-Guillaume de Rorthais http://www.dalibo.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlEk5hwACgkQXu9L1HbaT6JZ3wCg4h7QT+wRMT8KZAA/PjOjZcCV CS4AnRFeGdXIgklo1/RD2hi+e98pNBEe =voW3 -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Unarchived WALs deleted after crash
Hi, I am facing an unexpected behavior on a 9.2.2 cluster that I can reproduce on current HEAD. On a cluster with archive enabled but failing, after a crash of postmaster, the checkpoint occurring before leaving the recovery mode deletes any additional WALs, even those waiting to be archived. Because of this, after recovering from the crash, previous PITR backup can not be used to restore the instance to a time where archiving was failing. Any slaves fed by WAL or lagging in SR need to be recreated. AFAICT, this is not documented and I would expect the WALs to be archived by the archiver process when the cluster exits the recovery step. Here is a simple scenario to reproduce this. Configuration: wal_level = archive archive_mode = on archive_command = '/bin/false' log_checkpoints = on Scenario: createdb test psql -c 'create table test as select i, md5(i::text) from generate_series(1,300) as i;' test kill -9 $(head -1 $PGDATA/postmaster.pid) pg_ctl start Using this scenario, log files shows: LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false WARNING: transaction log file "00010001" could not be archived: too many failures LOG: database system was interrupted; last known up at 2013-02-14 16:12:58 CET LOG: database system was not properly shut down; automatic recovery in progress LOG: crash recovery starts in timeline 1 and has target timeline 1 LOG: redo starts at 0/11400078 LOG: record with zero length at 0/13397190 LOG: redo done at 0/13397160 LOG: last completed transaction was at log time 2013-02-14 16:12:58.49303+01 LOG: checkpoint starting: end-of-recovery immediate LOG: checkpoint complete: wrote 2869 buffers (17.5%); 0 transaction log file(s) added, 9 removed, 7 recycled; write=0.023 s, sync=0.468 s, total=0.739 s; sync files=2, longest=0.426 s, average=0.234 s LOG: autovacuum launcher started LOG: database system is ready to accept connections LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false LOG: archive command failed with exit code 1 DETAIL: The failed archive command was: /bin/false WARNING: transaction log file "00010011" could not be archived: too many failures Before the kill, "00010001" was the WAL to archive. After the kill, the checkpoint deleted 9 files before exiting recovery mode and "00010011" become the first WAL to archive. "00010001" through "000100000010" were removed or recycled. Is it expected ? -- Jehan-Guillaume de Rorthais http://www.dalibo.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Process waiting for ExclusiveLock on INSERT
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, I had a surprising question today from a customer about exclusive locks on INSERT. His log file shows something like: ==> LOG: process 1881 still waiting for ExclusiveLock on extension of relation 168318652 of database 58429185 after 1000.065 ms STATEMENT: INSERT INTO (...cols...) VALUES (...values...) <== While asking on #postgresql and investigating in the code, I think I hit a theory. When the system is I/O bound and backends are doing shared buffer cleanups, at some point they have to extend a relation with new pages, requiring for an exclusive lock on the relation to forbid anyone else to extend it in the same time. At this time, if multiple backends try to extend the relation, one win, all other wait for the lock, leading to messages in the log file when log_lock_waits is enabled. This lock would comes from src/backend/access/heap/hio.c:432: ==> /* * Have to extend the relation. * * We have to use a lock to ensure no one else is extending the rel at the * same time, else we will both try to initialize the same new page. We * can skip locking for new or temp relations, however, since no one else * could be accessing them. */ needLock = !RELATION_IS_LOCAL(relation); if (needLock) LockRelationForExtension(relation, ExclusiveLock); <== Is this theory correct or this issue should be discussed a bit more on this list ? PFA a small bash script with a small PostgreSQL configuration in comments that reproduce this behavior very often on my laptop, eg: ==> 2012-11-13 23:15:51 CET [23137]: [1-1] user=postgres,db=test LOG: process 23137 still waiting for ExclusiveLock on extension of relation 28118 of database 28115 after 100.086 ms 2012-11-13 23:15:51 CET [23137]: [2-1] user=postgres,db=test STATEMENT: insert into test select i, md5(i::text) from generate_series(1,1000) AS i <== Regards, - -- Jehan-Guillaume de Rorthais http://www.dalibo.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlCiyT4ACgkQXu9L1HbaT6KbdgCgslQiKjP5bovr/eN5gi1TJB6i 9pcAoI9BpfD/4306xSUZTPUcQTLYHJS3 =HgzB -END PGP SIGNATURE- test_exclusivelock_on_insert.sh Description: application/shellscript -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Logging parameters values on bind error
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, We were struggling today with some java code binding values violating some constraints in a prepared statement. We don't own the code and couldn't make tests with it. So we tried to find if PostgreSQL was able to log binded values when the BIND operation fail and found that this is not possible in current release: the error raised while planing the statement is not caught. PFA a patch that catch any error while creating the query plan and add parameters values to the error message if log_statement or log_min_duration_statement would have logged it. Comments? - -- Jehan-Guillaume de Rorthais http://www.dalibo.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlBy/cgACgkQXu9L1HbaT6Li5QCdEa9Zc4g302znpHmrwB9dnRBI JSwAnR2Poil0QAP6b+TflM2ebDCPLq3G =HU3H -END PGP SIGNATURE- *** a/src/backend/tcop/postgres.c --- b/src/backend/tcop/postgres.c *** *** 1715,1726 exec_bind_message(StringInfo input_message) pq_getmsgend(input_message); ! /* ! * Obtain a plan from the CachedPlanSource. Any cruft from (re)planning ! * will be generated in MessageContext. The plan refcount will be ! * assigned to the Portal, so it will be released at portal destruction. ! */ ! cplan = GetCachedPlan(psrc, params, false); /* * Now we can define the portal. --- 1715,1738 pq_getmsgend(input_message); ! PG_TRY(); ! { ! /* ! * Obtain a plan from the CachedPlanSource. Any cruft from (re)planning ! * will be generated in MessageContext. The plan refcount will be ! * assigned to the Portal, so it will be released at portal destruction. ! */ ! ! cplan = GetCachedPlan(psrc, params, false); ! } ! PG_CATCH(); ! { ! if (check_log_statement(portal->stmts) || check_log_duration(msec_str, false) == 2) ! errdetail_params(params); ! ! PG_RE_THROW(); ! } ! PG_END_TRY(); /* * Now we can define the portal. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers