On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila <amit.kap...@huawei.com> wrote: > On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote: >> On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit.kap...@huawei.com> >> wrote: >> > On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote: >> >> On 19.10.2012 14:42, Amit kapila wrote: >> >> > On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote: >> >> >> Before implementing the timeout parameter, I think that it's >> better >> >> to change >> >> >> both pg_basebackup background process and pg_receivexlog so that >> they >> >> >> send back the reply message immediately when they receive the >> >> keepalive >> >> >> message requesting the reply. Currently, they always ignore such >> >> keepalive >> >> >> message, so status interval parameter (-s) in them always must be >> set >> >> to >> >> >> the value less than replication timeout. We can avoid this >> >> troublesome >> >> >> parameter setting by introducing the same logic of walreceiver >> into >> >> both >> >> >> pg_basebackup background process and pg_receivexlog. >> >> > >> >> > Please find the patch attached to address the modification >> mentioned >> >> by you (send immediate reply for keepalive). >> >> > Both basebackup and pg_receivexlog uses the same function >> >> ReceiveXLogStream, so single change for both will address the issue. >> >> >> >> Thanks, committed this one after shuffling it around the changes I >> >> committed yesterday. I also updated the docs to not claim that -s >> option >> >> is required to avoid timeout disconnects anymore. >> > >> > Thank you. >> > However I think still the issue will not be completely solved. >> > pg_basebackup/pg_receivexlog can still take long time to >> > detect network break as they don't have timeout concept. To do that I >> have >> > sent one proposal which is mentioned at end of mail chain: >> > http://archives.postgresql.org/message- >> id/6C0B27F7206C9E4CA54AE035729E9C3828 >> > 53BBED@szxeml509-mbs >> > >> > Do you think there is any need to introduce such mechanism in >> > pg_basebackup/pg_receivexlog? >> >> Are you planning to introduce the timeout mechanism in pg_basebackup >> main process? Or background process? It's useful to implement both. > > By background process, you mean ReceiveXlogStream? > For both. > > I think for background process, it can be done in a way similar to what we > have done for walreceiver.
Yes. > But I have some doubts for how to do for main process: > > Logic similar to walreceiver can not be used incase network goes down during > getting other database file from server. > The reason for the same is to receive the data files PQgetCopyData() is > called in synchronous mode, so it keeps waiting for infinite time till it > gets some data. > In order to solve this issue, I can think of following options: > 1. Making this call also asynchronous (but now sure about impact of this). +1 Walreceiver already calls PQgetCopyData() asynchronously. ISTM you can solve the issue in the similar way to walreceiver's. > 2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite > wait), we can send some finite time. This time can be received as command > line argument > from respective utility and set the same in PGconn structure. > In order to have timeout value in PGconn, we can have: > a. Add new parameter in PGconn to indicate the receive timeout. > b. Use the existing parameter connect_timeout for receive timeout > also but this may lead to confusion. > 3. Any other better option? > > Apart from above issue, there is possibility that if during connect time > network goes down, then it might hang, because connect_timeout by default > will be NULL and connectDBComplete will start waiting inifinitely for > connection to become successful. > So shall we have command line argument separately for this also or any other > way as you suugest. Yes, I think that we should add something like --conninfo option to pg_basebackup and pg_receivexlog. We can easily set not only connect_timeout but also sslmode, application_name, ... by using such option accepting conninfo string. >> BTW, IIRC the walsender has no timeout mechanism during sending >> backup data to pg_basebackup. So it's also useful to implement the >> timeout mechanism for the walsender during backup. > > Yes, its useful, but for walsender the main problem is that it uses blocking > send call to send the data. > I have tried using tcp_keepalive settings, but the send call doesn't comeout > incase of network break. > The only way I could get it out is: > change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using > the command > echo "8" > /proc/sys/net/ipv4/tcp_retries2 > As per recommendation, its value should be at-least 8 (equivalent to 100 > sec) > > Do you have any idea, how it can be achieved? What about using pq_putmessage_noblock()? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers