[GENERAL] Long-running query on replica not timing out
Hello, On PG 9.0.8, we just observed a long-running query executing on a replica (~1 hour), which was effectively blocking replication. I say effectively, as checks on streaming replication appeared as if everything was up-to-date (using SELECT pg_current_xlog_location() on the primary and SELECT pg_last_xlog_receive_location() on the replica). However, when we checked a frequently updated table on the replica, it was ~1 hour behind the primary. It has been our experience (and configuration) that long running queries that block replication get cancelled after at most 10 minutes. These are the relevant settings from our postgresql.conf on the replica: # - Standby Servers - hot_standby = on# "on" allows queries during recovery # (change requires restart) max_standby_archive_delay = 600s# max delay before canceling queries # when reading WAL from archive; # -1 allows indefinite delay max_standby_streaming_delay = 600s # max delay before canceling queries # when reading streaming WAL; # -1 allows indefinite delay It is worth noting that the query was joining a table from the main schema that is not updated often with a table from another schema (that may or may not be updated often). However, it appears that replication ground to a halt until we terminated the query (which triggered crash recovery). Are we seeing something abnormal or unexpected here? It caught us by surprise… Thank you, Andrew Hannon -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Amazon High I/O instances
Just looking into High IO instances for a DB deployment. In order to get past 1TB, we are looking at RAID-0. I have heard (http://hackerne.ws/item?id=4266119) there might be a problem if TRIM isn't supported. Does anyone know if it is and has anyone used RAID-0 on these instances? (Linux of course…) On Aug 21, 2012, at 9:36 AM, Merlin Moncure wrote: > On Tue, Aug 21, 2012 at 12:33 AM, Sébastien Lorion > wrote: >> Hello, >> >> Since Amazon has added new high I/O instance types and EBS volumes, anyone >> has done some benchmark of PostgreSQL on them ? >> >> http://perspectives.mvdirona.com/2012/07/20/IOPerformanceNoLongerSucksInTheCloud.aspx >> http://perspectives.mvdirona.com/2012/08/01/EBSProvisionedIOPSOptimizedInstanceTypes.aspx >> http://aws.typepad.com/aws/2012/08/fast-forward-provisioned-iops-ebs.html >> >> I will be testing my app soon, but was curious to know if others have done >> some tests so I can compare / have a rough idea to what to expect. Looking >> on Google, I found an article about MySQL >> (http://palominodb.com/blog/2012/07/24/palomino-evaluates-amazon%E2%80%99s-new-high-io-ssd-instances), >> but nothing about PostgresSQL. > > here's a datapoint, stock config: > pgbench -i -s 500 > pgbench -c 16 -T 60 > number of transactions actually processed: 418012 > tps = 6962.607292 (including connections establishing) > tps = 6973.154593 (excluding connections establishing) > > not too shabby. this was run by a friend who is evaluating high i/o > instances for their high load db servers. we didn't have time to > kick off a high scale read only test unfortunately. > > Regarding 'AWS vs bare metal', I think high i/o instances full a huge > niche in their lineup. Dollar for dollar, I'm coming around to the > point of view that dealing with aws is a cheaper/more effective > solution than renting out space from a data center or (even worse) > running your own data center unless you're very large or have other > special requirements. Historically the problem with AWS is that you > had no solution for highly transaction bound systems which forced you > to split your environment which ruined most of the benefit, and they > fixed that. > > merlin
[GENERAL] Streaming Replication Error
Hello, We were auditing our logs on one of our PG 9.0.6 standby servers that we use for nightly snapshotting. The high-level process is: 1. Stop PG 2. Snapshot 3. Start PG Where "Snapshot" includes several steps to ensure data/filesystem integrity. The archive command on the master continues throughout this process, so the standby does have all of the log files. When we restart the cluster, we see the typical startup message about restoring files from the archive. However, we have noticed that occasionally the following occurs: LOG: restored log file "00014456007F" from archive LOG: restored log file "000144560080" from archive cp: cannot stat `/ebs-raid0/archive/000144560081': No such file or directory LOG: unexpected pageaddr 4454/7400 in log file 17494, segment 129, offset 0 cp: cannot stat `/ebs-raid0/archive/000144560081': No such file or directory LOG: streaming replication successfully connected to primary FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000144560091 has already been removed LOG: restored log file "000144560091" from archive LOG: restored log file "000144560092" from archive LOG: restored log file "000144560093" from archive … LOG: restored log file "000144570092" from archive cp: cannot stat `/ebs-raid0/archive/000144570093': No such file or directory LOG: streaming replication successfully connected to primary -- The concerning bit here is that we receive the FATAL message "requested WAL segment 000144560091 has already been removed" after streaming replication connects successfully, which seems to trigger an additional sequence of log restores. The questions we have are: 1. Is our data intact? PG eventually starts up, and it seems like once the streaming suffers the FATAL error, it falls back to performing log restores. 2. What triggers this error? Too much time between log recovery, streaming startup and a low wal_keep_segments value (currently 128)? Thank you very much, Andrew Hannon -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] PG 9.0 EBS Snapshot Backups on Slave
Hello, I am playing with a script that implements physical backups by snapshotting the EBS-backed software RAID. My basic workflow is this: 1. Stop PG on the slave 2. pg_start_backup on the master 3. On the slave: A. unmount the PG RAID B. snapshot each disk in the raid C. mount the PG RAID 4. pg_stop_backup 5. Restart PG on the slave Step 3 is actually quite fast, however, on the master, I end up seeing the following warning: WARNING: transaction log file "000100CC0076" could not be archived: too many failures I am guessing (I will confirm with timestamps later) this warning happens during steps 3A-3C, however my questions below stand regardless of when this failure occurs. It is worth noting that, the slave (seemingly) catches up eventually, recovering later log files with streaming replication current. Can I trust this state? Should I be concerned about this warning? Is it a simple blip that can easily be ignored, or have I lost data? From googling, it looks like retry attempts is not a configurable parameter (it appears to have retried a handful of times). If this is indeed a real problem, am I best off changing my archive_command to retain logs in a transient location when I am in "snapshot mode", and then ship them in bulk once the snapshot has completed? Are there any other remedies that I am missing? Thank you very much for your time, Andrew Hannon -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Repercussions of Cancelled Autovacuum
Hello, We have a large(ish) table that had a series of delete statements executed against it (culling most of the rows in the table). This triggered a time consuming autovacuum, which we allowed to run for 2+ days (and were planning to allow to run to completion). However, it ended up getting cancelled unexpectedly, and is now running once again. I have some questions: 1. My understanding is the autovacuum will block most (all?) DDL statements executed on it, and pg will cancel it if a user requests a conflicting lock. I don't believe either of these things occurred. Are there any other circumstances that will cancel an autovacuum? 2. Is autovacuum incremental? Or, does it effectively roll back and start fresh, having to redo all of the work of the last couple of days? Thank you! Andrew -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Two 9.1 Questions
Hello, Now that 9.1 has been released, I have two questions: 1. Can we upgrade a 9.0 hot standby (replicating from a 9.0 master) to a 9.1 standby while still replicating from the 9.0 master? Are there any version differences that would cause a problem? We would then eventually promote the 9.1 machine to the master... 2. Is it possible to do something like the following: A. Pause Replication on a hot standby (using the pause_at_recovery_target recovery target setting) B. Call pg_dumpall on the slave C. Resume replication (by disabling the pause_at_recovery_target recovery target setting) During the pause, we would likely continue shipping WAL to the slave (unless this is unadvisable). Is this a reasonable approach? Thank you, Andrew -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Adding Additional Standby
Hello, We have an existing master-slave setup running on 9.0. I was looking into adding a new hot standby into this setup — I have added a script that satisfies the requirements of the archive_command, but it isn't clear to me how to add this new standby into the mix. Is it as simple as calling pg_start_backup to create a new base backup and then starting postgresql on the new machine once the base backup has completed? Aren't there syncing issues to consider? My intuition tells me that the standbys need to be in sync for the archive_command to work (since it simply passes in a filename that is expected to get copied over). Initially, I was planning on pausing replication on the existing standby, and running replication on the new stanby to catch up to the same xid as the old standby. However, I then discovered that pausing isn't supported on 9.0 (but will be on 9.1). Does stopping postgresql on the existing standby have the same effect? Am I doing something unusual here? Is it possible to do what I am attempting to do? Thank you very much! Andrew -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general