The only thing I see that is a possibility for the issue is in the slave log:
LOG: unexpected EOF on client connection LOG: could not receive data from client: Connection reset by peer I don't know if that's related or not as it could just be somebody running a query. The log file does seem to be riddled with these but the replication failures don't happen constantly. As far as I know I'm not swallowing any errors. The logging is all set as the default: log_destination = 'stderr' logging_collector = on #client_min_messages = notice #log_min_messages = warning #log_min_error_statement = error #log_min_duration_statement = -1 #log_checkpoints = off #log_connections = off #log_disconnections = off #log_error_verbosity = default I'm going to have a look at the NICs to make sure there's no issue there. Thanks again for your help! On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netll...@gmail.com>wrote: > Are you certain that there are no relevant errors in the database logs > (on both master & slave)? Also, are you sure that you didn't > misconfigure logging such that errors wouldn't appear? > > On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexx...@gmail.com> wrote: > > Hi Lonni, > > > > Yes, I am using PG 9.1.9. > > Yes, 1 slave syncing from the master > > CentOS 6.4 > > I don't see any network or hardware issues (e.g. NIC) but will look more > > into this. They are communicating on a private network and switch. > > > > I forgot to mention that after I restart the slave, everything syncs > right > > back up and all if working again so if it is a network issue, the > > replication is just stopping after some hiccup instead of retrying and > > resuming when things are back up. > > > > Thanks! > > > > > > > > On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netll...@gmail.com> > > wrote: > >> > >> I've never seen this happen. Looks like you might be using 9.1? Are > >> you up to date on all the 9.1.x releases? > >> > >> Do you have just 1 slave syncing from the master? > >> Which OS are you using? > >> Did you verify that there aren't any network problems between the > >> slave & master? > >> Or hardware problems (like the NIC dying, or dropping packets)? > >> > >> > >> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexx...@gmail.com> > wrote: > >> > Hello, > >> > > >> > I'm having an issue where streaming replication just randomly stops > >> > working. > >> > I haven't been able to find anything in the logs which point to an > >> > issue, > >> > but the Postgres process shows a "waiting" status on the slave: > >> > > >> > postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 > >> > postgres: > >> > startup process recovering 000000010000053D0000003F waiting > >> > postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 > >> > postgres: > >> > writer process > >> > postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 > >> > postgres: > >> > stats collector process > >> > postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 > >> > postgres: > >> > wal receiver process streaming 549/216B3730 > >> > > >> > The replication works great for days, but randomly seems to lock up > and > >> > replication halts. I verified that the two databases were out of sync > >> > with > >> > a query on both of them. Has anyone experienced this issue before? > >> > > >> > Here are some relevant config settings: > >> > > >> > Master: > >> > > >> > wal_level = hot_standby > >> > checkpoint_segments = 32 > >> > checkpoint_completion_target = 0.9 > >> > archive_mode = on > >> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f > >> > </dev/null' > >> > max_wal_senders = 2 > >> > wal_keep_segments = 32 > >> > > >> > Slave: > >> > > >> > wal_level = hot_standby > >> > checkpoint_segments = 32 > >> > #checkpoint_completion_target = 0.5 > >> > hot_standby = on > >> > max_standby_archive_delay = -1 > >> > max_standby_streaming_delay = -1 > >> > #wal_receiver_status_interval = 10s > >> > #hot_standby_feedback = off > >> > > >> > Thank you for any help you can provide! > >> > > >> > Andrew > >> > >