Re: [GENERAL] errors with high connections rate

2012-07-03 Thread Craig Ringer

Here's the test program, btw: is a home rolled fork() horror. is the same thing done with Python's multiprocessing module.

Craig Ringer

Re: [GENERAL] errors with high connections rate

2012-07-03 Thread Craig Ringer

On 07/03/2012 04:26 PM, Pawel S. Veselov wrote:

That's the thing, no segfaults (dmesg), nothing in the server logs.

It may as well be some sort of an anti-fork-bomb measure, only judging 
by the fact that with enough attempts, things do clear out, though I 
wish there would be some indication of that, and I'm still confused 
about the error code being ENOTCONN.

I've managed to produce the endpoint not connected errors with a little 
test I wrote here. Only once so far and only during an abnormal test run 
where I signalled the test workers as they were starting up, so that's 
not really very helpful.

I have no problem using a little Python test program to create 800 
connections in about a second. It forks some workers (100 by default) 
which grab enough connections each to reach the target connection count.

Ooh, handy. I just triggered it again now. The "Transport endpoint is 
not connected" messages were intermixed with some "FATAL:  sorry, too 
many clients already" messages. The PostgreSQL log is full of FATAL:  
sorry, too many clients already" messages intermixed with "LOG:  
unexpected EOF on client connection" messages. Again it was an abnormal 
run where I signalled my workers mid way through startup.

Interesting, that. I've never seen it on a run where I don't send a 
signal. You know what that makes me think? You're using a multithreaded 
approach, and there's something going wrong in your app's innards. Yes, 
that's a lot of hot air and handwaving, but it fits - you're getting an 
error saying that psql is trying to operate on a socket that isn't there.

The fact that there's nothing in the system logs or Pg logs just adds 
weight to that. I'm guessing you have a threading bug, possibly signal 

Craig Ringer

Sent via pgsql-general mailing list (
To make changes to your subscription:

Re: [GENERAL] errors with high connections rate

2012-07-03 Thread Kevin Grittner
John R Pierce 
> On 07/03/12 12:34 AM, Craig Ringer wrote:
>> I'm seriously impressed that your system is working under load at
>> all with 800 concurrent connections fighting to write all at once.
> indeed, in my transactional benchmarks on a 12 core, 24 thread dual
> xeon x5600 class systems, with 16 or 20 spindle raid10, I find
> somewherre around 50 to 80 database connection threads has the
> highest overall throughput (several thousand OLTP
> transactions/second). this hardware has vastly better IO and CPU
> performance than any AWS virtual machine.
> as craig suggested, your network threads could put the incoming
> requests into queue(s), and run a tunable number of database
> connection threads that take requests out of the queue and send
> them to the database, and if neccessary, return results to the
> network thread. doing this will give better CPU utilization, you
> can try different database worker thread counts til you hit the
> optimal number for your hardware.
We (at the Wisconsin courts) have definitely found that the best
model for us is to have a separate layer for running database
transactions, with one thread per database connection and each of
those threads pulling from a prioritized FIFO queue into which
*other* layers place requests.
This comes up so often that I threw together a Wiki page for it:
Of course, everyone should feel free to improve the page.

Sent via pgsql-general mailing list (
To make changes to your subscription:

Re: [GENERAL] errors with high connections rate

2012-07-03 Thread Pawel S. Veselov

On 07/03/2012 12:34 AM, Craig Ringer wrote:

On 07/03/2012 03:19 PM, Pawel Veselov wrote:


-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3 
(Amazon AMI distro).
The application writes data at a high rate (at this point it's 500 
transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then 
written out to the DB. There is no connection pool, instead, each 
worker thread maintains it's own connection that it uses to write 
data to the database. The connections are kept pthread's "specific" 
data blocks.

[skipped, replied to separately]

Can't connect to DB: could not send data to server: Transport 
endpoint is not connected

could not send startup packet: Transport endpoint is not connected

postmaster forking and failing because of operating system resource 
limits like max proc count, anti-forkbomb measures, max file handles, etc?

If accept() succeeded, and fork() failed, the socket would be closed by 
the process (parent will close, child socket wouldn't even be forked), 
wouldn't that result into ECONNRESET, and not ENOTCONN?

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce 
it, at least to see what's going on, but sometimes I get another 
error : "too many users connected". Even restarting postmaster 
doesn't help. The postmaster is running with -N810, and the role has 
connection limit of 1000. Yet, the "too many" error starts creeping 
up only after 275 connections are opened (counted by successful 
connect() from strace).

Any idea where should I dig?
See how many connections the *server* thinks exist by examining 
pg_stat_activity .

Check dmesg and the PostgreSQL server logs to see if you're hitting 
operating system limits. Look for fork() failures, unexplained 
segfaults, etc.

That's the thing, no segfaults (dmesg), nothing in the server logs.

It may as well be some sort of an anti-fork-bomb measure, only judging 
by the fact that with enough attempts, things do clear out, though I 
wish there would be some indication of that, and I'm still confused 
about the error code being ENOTCONN.

Re: [GENERAL] errors with high connections rate

2012-07-03 Thread Pawel S. Veselov

On 07/03/2012 12:54 AM, John R Pierce wrote:

On 07/03/12 12:34 AM, Craig Ringer wrote:
I'm seriously impressed that your system is working under load at all 
with 800 concurrent connections fighting to write all at once.

indeed, in my transactional benchmarks on a 12 core, 24 thread dual 
xeon x5600 class systems, with 16 or 20 spindle raid10, I find 
somewherre around 50 to 80 database connection threads has the highest 
overall throughput (several thousand OLTP transactions/second).
this hardware has vastly better IO and CPU performance than any AWS 
virtual machine.

as craig suggested, your network threads could put the incoming 
requests into queue(s), and run a tunable number of database 
connection threads that take requests out of the queue and send them 
to the database, and if neccessary, return results to the network 
thread.   doing this will give better CPU utilization, you can try 
different database worker thread counts til you hit the optimal number 
for your hardware.

Just to clear the air on this, this is almost exactly what I'm doing. 
The number of 800 came out of experimenting with numbers (I'm sure it 
took you some time to find the optimum of 50-80 for your configuration). 
The number of "worker" threads are configurable, and they do receive 
their work from a shared queue. By the way, on the operations that I'm 
doing, postgres is performing very well, with average of less than 10ms 
per transaction, with throughput of times over 600 tps.

However, writing data to postgres is not the only thing I need to do to 
process the data. If the time to process rises for other reasons, low 
number of threads may not be able to withstand constant stream of 
incoming data, and I have to raise the worker thread number to 
compensate. As I was doing this, I ran into the problem described in the 
original email, and it puzzled me. However, only because I opened 800 
connections, doesn't mean that all of the connections are being being 
actively used concurrently (so not that much fighting). I indeed should 
switch to a connection pool model in such a case, just to not over-fork 
postgres, however, I don't see that postgres is consuming any 
significant amount of system resources by forked server processes.

Thank you,

Sent via pgsql-general mailing list (
To make changes to your subscription:

Re: [GENERAL] errors with high connections rate

2012-07-03 Thread John R Pierce

On 07/03/12 12:34 AM, Craig Ringer wrote:
I'm seriously impressed that your system is working under load at all 
with 800 concurrent connections fighting to write all at once.

indeed, in my transactional benchmarks on a 12 core, 24 thread dual xeon 
x5600 class systems, with 16 or 20 spindle raid10, I find somewherre 
around 50 to 80 database connection threads has the highest overall 
throughput (several thousand OLTP transactions/second).this hardware 
has vastly better IO and CPU performance than any AWS virtual machine.

as craig suggested, your network threads could put the incoming requests 
into queue(s), and run a tunable number of database connection threads 
that take requests out of the queue and send them to the database, and 
if neccessary, return results to the network thread.   doing this will 
give better CPU utilization, you can try different database worker 
thread counts til you hit the optimal number for your hardware.

john r pierceN 37, W 122
santa cruz ca mid-left coast

Sent via pgsql-general mailing list (
To make changes to your subscription:

Re: [GENERAL] errors with high connections rate

2012-07-03 Thread Craig Ringer

On 07/03/2012 03:19 PM, Pawel Veselov wrote:


-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3 
(Amazon AMI distro).
The application writes data at a high rate (at this point it's 500 
transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then 
written out to the DB. There is no connection pool, instead, each 
worker thread maintains it's own connection that it uses to write data 
to the database. The connections are kept pthread's "specific" data 

Hmm. To get that kind of TPS with that design are you running with 
fsync=off or on storage that claims to flush I/O without actually doing 
so? Have you checked your crash safety? Is it just fairly big hardware?

Why are you using so many connections? Unless you have truly monstrous 
hardware your system should achieve considerably greater throughput by 
reducing the connection count and queueing bursts of writes. You 
wouldn't even need an external pool in your case, just switch to a 
producer/consumer model where your accepting threads add work to 
separate and much fewer writer threads for sending to the DB. Writer 
threads could then do useful optimisations like multi-value-inserting or 
COPYing data, doing small batches in transactions, etc.

I'm seriously impressed that your system is working under load at all 
with 800 concurrent connections fighting to write all at once.

Can't connect to DB: could not send data to server: Transport endpoint 
is not connected

could not send startup packet: Transport endpoint is not connected

postmaster forking and failing because of operating system resource 
limits like max proc count, anti-forkbomb measures, max file handles, etc?

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce it, 
at least to see what's going on, but sometimes I get another error : 
"too many users connected". Even restarting postmaster doesn't help. 
The postmaster is running with -N810, and the role has connection 
limit of 1000. Yet, the "too many" error starts creeping up only after 
275 connections are opened (counted by successful connect() from strace).

Any idea where should I dig?
See how many connections the *server* thinks exist by examining 
pg_stat_activity .

Check dmesg and the PostgreSQL server logs to see if you're hitting 
operating system limits. Look for fork() failures, unexplained 
segfaults, etc.

Craig Ringer

[GENERAL] errors with high connections rate

2012-07-03 Thread Pawel Veselov

-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3 (Amazon
AMI distro).
The application writes data at a high rate (at this point it's 500
transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then written
out to the DB. There is no connection pool, instead, each worker thread
maintains it's own connection that it uses to write data to the database.
The connections are kept pthread's "specific" data blocks.

Each thread would connect to the DB when the first work message is
received, or when there was an "error" flag with a connection. The error
flag is set any time there is any error running a database statement.

When the work is "slow", I don't see any problem (slow was ~250 messages
per second). As I increased the load, when I restart the process, threads
start grabbing work at high enough rate, and each will first open a
connection to the database, and these errors start popping up:

Can't connect to DB: could not send data to server: Transport endpoint is
not connected
could not send startup packet: Transport endpoint is not connected

This is a result of executing the following code:

wi->pg_conn = PQconnectdb(conn_str);
ConnectionStatusType cst = PQstatus(wi->pg_conn);

if (cst != CONNECTION_OK) {
ERR("Can't connect to DB: %s\n", PQerrorMessage(wi->pg_conn));

Eventually, the errors go away (when the worker thread fail to connect,
they just pass the message to another thread, and wait for their turn, and
will try reconnecting again), so it does seem that the remedy is just
spreading the connections in time.

The connection string is '' (empty), the connection is made through

I don't see these errors when:
1) the amount of worker threads is reduced (could never reproduce it under
200 or less, but seen them with 300 and more)
2) the amount of load is reduced

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce it, at
least to see what's going on, but sometimes I get another error : "too many
users connected". Even restarting postmaster doesn't help. The postmaster
is running with -N810, and the role has connection limit of 1000. Yet, the
"too many" error starts creeping up only after 275 connections are opened
(counted by successful connect() from strace).

Any idea where should I dig?

P.S. I looked at fe-connect.c, I'm wondering if there a potential race
condition between poll() and socket actually finishing the connection? If
running under strace, I never see EINPROGRESS returned from connect(), and
the only reason sendto() would result into ENOTCONN is when the connect
didn't finish, and the socket was deemed "connected" using
