[GENERAL] Long-running query on replica not timing out

2012-10-02 Thread Andrew Hannon
Hello,

On PG 9.0.8, we just observed a long-running query executing on a replica (~1 
hour), which was effectively blocking replication. I say effectively, as checks 
on streaming replication appeared as if everything was up-to-date (using SELECT 
pg_current_xlog_location() on the primary and SELECT 
pg_last_xlog_receive_location() on the replica). However, when we checked a 
frequently updated table on the replica, it was ~1 hour behind the primary. 

It has been our experience (and configuration) that long running queries that 
block replication get cancelled after at most 10 minutes. These are the 
relevant settings from our postgresql.conf on the replica:

# - Standby Servers -

hot_standby = on# on allows queries during recovery
# (change requires restart)
max_standby_archive_delay = 600s# max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
max_standby_streaming_delay = 600s  # max delay before canceling queries
# when reading streaming WAL;
# -1 allows indefinite delay



It is worth noting that the query was joining a table from the main schema that 
is not updated often with a table from another schema (that may or may not be 
updated often). However, it appears that replication ground to a halt until we 
terminated the query (which triggered crash recovery).

Are we seeing something abnormal or unexpected here? It caught us by surprise…

Thank you,

Andrew Hannon 

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Amazon High I/O instances

2012-08-22 Thread Andrew Hannon
Just looking into High IO instances for a DB deployment. In order to get past 
1TB, we are looking at RAID-0. I have heard 
(http://hackerne.ws/item?id=4266119) there might be a problem if TRIM isn't 
supported. Does anyone know if it is and has anyone used RAID-0 on these 
instances? (Linux of course…) 

On Aug 21, 2012, at 9:36 AM, Merlin Moncure wrote:
 On Tue, Aug 21, 2012 at 12:33 AM, Sébastien Lorion
 s...@thestrangefactory.com wrote:
 Hello,
 
 Since Amazon has added new high I/O instance types and EBS volumes, anyone
 has done some benchmark of PostgreSQL on them ?
 
 http://perspectives.mvdirona.com/2012/07/20/IOPerformanceNoLongerSucksInTheCloud.aspx
 http://perspectives.mvdirona.com/2012/08/01/EBSProvisionedIOPSOptimizedInstanceTypes.aspx
 http://aws.typepad.com/aws/2012/08/fast-forward-provisioned-iops-ebs.html
 
 I will be testing my app soon, but was curious to know if others have done
 some tests so I can compare / have a rough idea to what to expect. Looking
 on Google, I found an article about MySQL
 (http://palominodb.com/blog/2012/07/24/palomino-evaluates-amazon%E2%80%99s-new-high-io-ssd-instances),
 but nothing about PostgresSQL.
 
 here's a datapoint, stock config:
 pgbench -i -s 500
 pgbench -c 16 -T 60
 number of transactions actually processed: 418012
 tps = 6962.607292 (including connections establishing)
 tps = 6973.154593 (excluding connections establishing)
 
 not too shabby.  this was run by a friend who is evaluating high i/o
 instances for their high load db servers.   we didn't have time to
 kick off a high scale read only test unfortunately.
 
 Regarding 'AWS vs bare metal', I think high i/o instances full a huge
 niche in their lineup.   Dollar for dollar, I'm coming around to the
 point of view that dealing with aws is a cheaper/more effective
 solution than renting out space from a data center or (even worse)
 running your own data center unless you're very large or have other
 special requirements.  Historically the problem with AWS is that you
 had no solution for highly transaction bound systems which forced you
 to split your environment which ruined most of the benefit, and they
 fixed that.
 
 merlin


[GENERAL] Streaming Replication Error

2012-04-30 Thread Andrew Hannon
Hello,

We were auditing our logs on one of our PG 9.0.6 standby servers that we use 
for nightly snapshotting. The high-level process is:

1. Stop PG
2. Snapshot
3. Start PG

Where Snapshot includes several steps to ensure data/filesystem integrity. 
The archive command on the master continues throughout this process, so the 
standby does have all of the log files. When we restart the cluster, we see the 
typical startup message about restoring files from the archive. However, we 
have noticed that occasionally the following occurs:

LOG:  restored log file 00014456007F from archive
LOG:  restored log file 000144560080 from archive
cp: cannot stat `/ebs-raid0/archive/000144560081': No such file or 
directory
LOG:  unexpected pageaddr 4454/7400 in log file 17494, segment 129, offset 0
cp: cannot stat `/ebs-raid0/archive/000144560081': No such file or 
directory
LOG:  streaming replication successfully connected to primary
FATAL:  could not receive data from WAL stream: FATAL:  requested WAL segment 
000144560091 has already been removed

LOG:  restored log file 000144560091 from archive
LOG:  restored log file 000144560092 from archive
LOG:  restored log file 000144560093 from archive
…
LOG:  restored log file 000144570092 from archive
cp: cannot stat `/ebs-raid0/archive/000144570093': No such file or 
directory
LOG:  streaming replication successfully connected to primary

--

The concerning bit here is that we receive the FATAL message requested WAL 
segment 000144560091 has already been removed after streaming 
replication connects successfully, which seems to trigger an additional 
sequence of log restores.

The questions we have are:

1. Is our data intact? PG eventually starts up, and it seems like once the 
streaming suffers the FATAL error, it falls back to performing log restores.
2. What triggers this error? Too much time between log recovery, streaming 
startup and a low wal_keep_segments value (currently 128)?

Thank you very much,

Andrew Hannon
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] PG 9.0 EBS Snapshot Backups on Slave

2012-01-23 Thread Andrew Hannon
Hello,

I am playing with a script that implements physical backups by snapshotting the 
EBS-backed software RAID. My basic workflow is this:

1. Stop PG on the slave
2. pg_start_backup on the master
3. On the slave:
   A. unmount the PG RAID
   B. snapshot each disk in the raid
   C. mount the PG RAID 
4. pg_stop_backup
5. Restart PG on the slave

Step 3 is actually quite fast, however, on the master, I end up seeing the 
following warning:

WARNING:  transaction log file 000100CC0076 could not be 
archived: too many failures

I am guessing (I will confirm with timestamps later) this warning happens 
during steps 3A-3C, however my questions below stand regardless of when this 
failure occurs.

It is worth noting that, the slave (seemingly) catches up eventually, 
recovering later log files with streaming replication current. Can I trust this 
state?

Should I be concerned about this warning? Is it a simple blip that can easily 
be ignored, or have I lost data? From googling, it looks like retry attempts is 
not a configurable parameter (it appears to have retried a handful of times).

If this is indeed a real problem, am I best off changing my archive_command to 
retain logs in a transient location when I am in snapshot mode, and then ship 
them in bulk once the snapshot has completed? Are there any other remedies that 
I am missing?

Thank you very much for your time,

Andrew Hannon  
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Repercussions of Cancelled Autovacuum

2012-01-15 Thread Andrew Hannon
Hello,

We have a large(ish) table that had a series of delete statements executed 
against it (culling most of the rows in the table). This triggered a time 
consuming autovacuum, which we allowed to run for 2+ days (and were planning to 
allow to run to completion). However, it ended up getting cancelled 
unexpectedly, and is now running once again. I have some questions:

1. My understanding is the autovacuum will block most (all?) DDL statements 
executed on it, and pg will cancel it if a user requests a conflicting lock. I 
don't believe either of these things occurred. Are there any other 
circumstances that will cancel an autovacuum?
2. Is autovacuum incremental? Or, does it effectively roll back and start 
fresh, having to redo all of the work of the last couple of days?

Thank you!

Andrew
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Two 9.1 Questions

2011-09-12 Thread Andrew Hannon
Hello,

Now that 9.1 has been released, I have two questions:

1. Can we upgrade a 9.0 hot standby (replicating from a 9.0 master) to a 9.1 
standby while still replicating from the 9.0 master? Are there any version 
differences that would cause a problem? We would then eventually promote the 
9.1 machine to the master...
2. Is it possible to do something like the following:
  A. Pause Replication on a hot standby (using the pause_at_recovery_target 
recovery target setting)
  B. Call pg_dumpall on the slave
  C. Resume replication (by disabling the  pause_at_recovery_target recovery 
target setting)

During the pause, we would likely continue shipping WAL to the slave (unless 
this is unadvisable).

Is this a reasonable approach?

Thank you,

Andrew
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Adding Additional Standby

2011-09-03 Thread Andrew Hannon
Hello,

We have an existing master-slave setup running on 9.0. I was looking into 
adding a new hot standby into this setup — I have added a script that satisfies 
the requirements of the archive_command, but it isn't clear to me how to add 
this new standby into the mix. Is it as simple as calling pg_start_backup to 
create a new base backup and then starting postgresql on the new machine once 
the base backup has completed? Aren't there syncing issues to consider? My 
intuition tells me that the standbys need to be in sync for the archive_command 
to work (since it simply passes in a filename that is expected to get copied 
over).

Initially, I was planning on pausing replication on the existing standby, and 
running replication on the new stanby to catch up to the same xid as the old 
standby. However, I then discovered that pausing isn't supported on 9.0 (but 
will be on 9.1). Does stopping postgresql on the existing standby have the same 
effect? Am I doing something unusual here? Is it possible to do what I am 
attempting to do? 

Thank you very much!

Andrew
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general