> Yeah. > > +1 for this patch, with a few adjustments including making the test > use pg_sleep() as mentioned. It does something useful, namely > cancelling very long running queries sooner if the client has gone > away instead of discovering that potentially much later when sending a > response. It does so with a portable kernel interface (though we > haven't heard from a Windows tester), and I think it's using it in a > safe way (we're not doing the various bad things you can do with > MSG_PEEK, and the fd is expected to be valid for the process's > lifetime, and the socket is always in non-blocking mode*, so I don't > think there is any bad time for pg_check_client_connection() to run). > It reuses the existing timer infrastructure so there isn't really any > new overhead. One syscall every 10 seconds or whatever at the next > available CFI is basically nothing. On its own, this patch will > reliably detect clients that closed abruptly or exited/crashed (so > they client kernel sends a FIN packet). In combination with TCP > keepalive, it'll also detect clients that went away because the > network or client kernel ceased to exist. > > *There are apparently no callers of pg_set_block(), so if you survived > pq_init() you have a non-blocking socket. If you're on Windows, the > code always sets the magic pgwin32_noblock global flag before trying > to peek. I wondered if it's OK that the CFI would effectively clobber > that with 0 on its way out, but that seems to be OK because every > place in the code that sets that flag does so immediately before an IO > operationg without a CFI in between. As the comment in pgstat.c says > "This is extremely broken and should be fixed someday.". I wonder if > we even need that flag at all now that all socket IO is non-blocking.
I have looked into the patch and tested a little bit. First of all, I had to grab February snapshot to test the patch because it does not apply to the current HEAD. I noticed that there are some confusions in the doc and code regarding what the new configuration parameter means. According to the doc: + Default value is <literal>zero</literal> - it disables connection + checks, so the backend will detect client disconnection only when trying + to send a response to the query. But guc.c comment says: + gettext_noop("Sets the time interval for checking connection with the client."), + gettext_noop("A value of -1 disables this feature. Zero selects a suitable default value."), Probably the doc is correct since the actual code does so. Also I found this in postgresql.conf default: #client_connection_check_interval = 1000 # set time interval between So here the default value seems to be be 1000. If so, guc.c should be adjusted and the doc should be changed accordingly. I am not sure. Next I have tested the patch using standard pgbench. With the feature enabled with 1000ms check interval: $ pgbench -c 10 -T 300 -S test starting vacuum...end. transaction type: <builtin: select only> scaling factor: 1 query mode: simple number of clients: 10 number of threads: 1 duration: 300 s number of transactions actually processed: 19347995 latency average = 0.155 ms tps = 64493.278581 (including connections establishing) tps = 64493.811945 (excluding connections establishing) Without the feature (client-connection-check-interval = 0) $ pgbench -c 10 -T 300 -S test starting vacuum...end. transaction type: <builtin: select only> scaling factor: 1 query mode: simple number of clients: 10 number of threads: 1 duration: 300 s number of transactions actually processed: 20314812 latency average = 0.148 ms tps = 67715.993428 (including connections establishing) tps = 67717.251843 (excluding connections establishing) So the performance is about 5% down with the feature enabled in this case. For me, 5% down is not subtle. Probably we should warn this in the doc. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp