On Tue, Apr 18, 2017 at 2:42 AM, Moreno Andreo <moreno.and...@evolu-s.it>
wrote:

> Hi all,
>     As many of you has read last Friday (and many has tired to help, too,
> and I still thank you very much), I had a bad service outage.
> I was pointed to reduce number of maximum connections using a pooler, and
> that's what I'm building in test lab, but I'm wondering if there's
> something I can do with my overall architecture design.
> ATM we host one database per customer (about 400 now) and every customer
> has two points of access to data:
> - Directly to database, via rubyrep, to replicate the database he has in
> his own machine
> - Wia WCF self-hosted web services to read other customers data
> Every customer can access (and replicate) his database from a number of
> different positions (max 3).
> Customers are organized in groups (max 10 per group), and there is the
> chance that someone accesses someone else's data via WCF.
> For example, pick up a group of 5: everyone running rubyrep with only one
> position enabled, and getting data from others' database.
> If I'm not wrong, it's 5 connections (rubyrep) plus 5*4 (everyone
> connecting to everyone else's database) for WCF, so 25 connections
> Now imagine a group of 10....
> Last friday I've been told that 350 connections is quite a big number and
> things can begin to slow down. Ok. When something slows down I'm used to
> search and find the bottleneck (CPU, RAM, IO, etc). If everything was
> running apparently fine (CPU < 10%, RAM used < 20%, I/O rarely over 20%),
> how can I say there's a bottleneck that's slowing down things? Am I missing
> something?
> Another thing is that on a customer server (with a similar, smaller
> architecture)  I _do_ have a connection leak problem that's under
> investigation, but when things begin to slow down I simply run a
> pg_terminate_backend on all connection with an age > 10 min and everything
> goes back to normal. On my server, last friday, it did not help, so I
> thought that was not the main problem.
> I've got no problems in splitting this architecture in how many servers I
> need, but I think I need some tips on how to design this, in order to avoid
> major issues in the near future (ask for details if needed).
>
> The current PostgreSQL 9.5.6 server is an 8 core VM with 52 GB RAM and
> Debian 8.
> WCF server is Windows 2012 R2 4-core, 16 GB RAM.
>
> While facing the issue none of them showed up any kind of overload and
> their logs were clean.
>
> I'm a bit scared it can happen again.........
>

The logs being clean doesn't help much, if your log settings are set to be
too terse.

Is log_lock_waits on?  log_checkpoints?  track_io_timing (doesn't show up
in the logs, you have to query database views)?

Is log_min_duration_statement set to a reasonable value?
 log_autovacuum_min_duration?

Are you using pg_stat_statement (also doesn't show up in the logs, you have
to query it), and perhaps auto_explain?

Cheers,

Jeff

Reply via email to