Hi,
Has anyone encountered this issue? Why would the WAL receiver process not
stop when postmaster is shutdown?
Any suggestions on how to avoid running into this error or ways to recover
from it?
Thank you in advance for any inputs on this,
Dipti
On Fri, Aug 31, 2012 at 1:17 PM, Dipti Bharvirkar wrote:
Hi,
In our project, we use Postgres 9.1.3 version and asynchronous streaming
replication.
In recent times, on couple of our setups, we saw issues stopping Postgres
service on the standby server after streaming replication was setup.
The command service postgresql stop returned with a failure message. We
use pg_ctl stop -D '$PGDATA' -s -m fast in the Postgres service script to
stop the server.
To see if there were some active client connections that were causing a
failure in stopping Postgres service, I ran the query SELECT * FROM
pg_stat_activity;.
It failed with the following error: psql: FATAL: the database system is
shutting down
ps -ef | grep postgres returned the following:
postgres 14033 1 0 Aug28 ?00:00:01
/usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 14044 14033 0 Aug28 ?00:00:00 postgres: logger process
postgres 14046 14033 0 Aug28 ?00:00:00 postgres: writer process
postgres 14047 14033 0 Aug28 ?00:00:00 postgres: stats collector
process
postgres 14912 14033 0 Aug28 ?00:00:00 postgres: wal receiver
process
root 31519 3003 0 06:18 pts/200:00:00 grep postgres
netstat -anp | grep 5432 returns the following:
tcp0 0 0.0.0.0:54320.0.0.0:*
LISTEN 14033/postmaster
tcp0 0 127.0.0.1:5432 127.0.0.1:60597
TIME_WAIT -
tcp0 0 127.0.0.1:5432 127.0.0.1:60589
TIME_WAIT -
tcp67288 0 1.1.1.1:61500 http://47.11.49.176:61500 2
.2.2.2:5432 http://47.11.49.190:5432 ESTABLISHED
14912/postgres: wal
I had a few queries based on some of the observations -
1. On one of the setups where similar issue was observed, we stopped
Postgres service on the master server. As a result of this, the sender
process on the master server and consequently the receiver process on
standby stopped. After this, Postgres service could successfully be stopped
on the standby server. This fact coupled with the output of the two
commands mentioned above makes me believe that it is the wal receiver
process that is not getting terminated because of which the Postgres
service on standby server does not stop. Is this assumption right?
2. If yes, what could be the possible cause for the receiver process
to not terminate? Shouldn't it stop gracefully when a shutdown command is
received? When the issue occurred, we had minimal activity on the master
server. There were no long running transactions being committed to the
master and streamed to the standby when the issue occurred. Even if there
were, could it cause the receiver process to not terminate?
3. How can we avoid running into this issue? Could we be missing some
step that is essential for a graceful shutdown of the service on a standby?
4. On one setup where the issue was seen, since -m fast option with
pg_ctl stop did not help in stopping the service, I used the -m
immediate option. The service stopped (I understand that this option
aborts the processes without a clean shutdown and so is not a safe option).
The service would not start back up. We saw the invalid record length
error during the startup (I guess this was expected since it wasn't a clean
shutdown). A pg_resetxlog helped recover from this issue. However, that
seems risky too since there is a chance of data inconsistency. What is the
best way to recover from this error if it occurs again?
Thanks,
Dipti