Vladimir, And, any effect on your problem?
On Thu, Feb 13, 2014 at 9:35 AM, Бородин Владимир <r...@simply.name> wrote: > I have limited max connections to 1000, reduced shared buffers to 8G and > restarted postgres. 1000 is still to much in most cases. With pgbouncer in transaction pooling mode normaly pool size 8-32, max_connections = 100 (default value) and client_connections 500-1500 looks more reasonable. > I have also noticed that this big tables stopped vacuuming automatically a > couple of weeks ago. It could be the reason of the problem, I will now try to > tune autovacuum parameters to turn it back. But yesterday I ran "vacuum > analyze" for all relations manually but that did not help. How do your autovacuum parameters look like now? > 13.02.2014, в 0:14, Ilya Kosmodemiansky <hydrobi...@gmail.com> написал(а): > > On Wed, Feb 12, 2014 at 8:57 PM, Бородин Владимир <r...@simply.name> wrote: > > > Yes, this is legacy, I will fix it. We had lots of inactive connections but > right now we use pgbouncer for this. When the workload is normal we have some > kind of 80-120 backends. Less than 10 of them are in active state. Having > problem with locks we get lots of sessions (sometimes more than 1000 of them > are in active state). According to vmstat the number of context switches is > not so big (less than 20k), so I don't think it is the main reason. Yes, it > can aggravate the problem, but imho not create it. > > > > I'am afraid that is the problem. More than 1000 backends, most of them > are simply waiting. > > > > I don't understand the correlation of shared buffers size and > synchronous_commit. Could you please explain your statement? > > > > You need to fsync your huge shared buffers any time your database > performs checkpoint. By default it usually happens too often because > checkpoint_timeout is 5min by default. Without bbu, on software raid > that leads to io spike and you commit waits for wal. > > > > 12.02.2014, в 23:37, Ilya Kosmodemiansky <hydrobi...@gmail.com> написал(а): > > another thing which is arguable - concurrency degree. How many of your > max_connections = 4000 are actually running? 4000 definitely looks like an > overkill and they could be a serious source of concurrency, especially then > you have had barrier enabled and software raid. > > Plus for 32Gb of shared buffers with synchronous_commit = on especially on > heavy workload one should definitely have bbu, otherwise performance will be > poor. > > > On Wed, Feb 12, 2014 at 8:20 PM, Бородин Владимир <r...@simply.name> wrote: > > > Oh, I haven't thought about barriers, sorry. Although I use soft raid without > batteries I have turned barriers off on one cluster shard to try. > > root@rpopdb01e ~ # mount | fgrep data > /dev/md2 on /var/lib/pgsql/9.3/data type ext4 (rw,noatime,nodiratime) > root@rpopdb01e ~ # mount -o remount,nobarrier /dev/md2 > root@rpopdb01e ~ # mount | fgrep data > /dev/md2 on /var/lib/pgsql/9.3/data type ext4 > (rw,noatime,nodiratime,nobarrier) > root@rpopdb01e ~ # > > 12.02.2014, в 21:56, Ilya Kosmodemiansky <hydrobi...@gmail.com> написал(а): > > My question was actually about barrier option, by default it is enabled on > RHEL6/ext4 and could cause serious bottleneck on io before disks are actually > involved. What says mount without arguments? > > On Feb 12, 2014, at 18:43, Бородин Владимир <r...@simply.name> wrote: > > root@rpopdb01e ~ # fgrep data /etc/fstab > UUID=f815fd3f-e4e4-43a6-a6a1-bce1203db3e0 /var/lib/pgsql/9.3/data ext4 > noatime,nodiratime 0 1 > root@rpopdb01e ~ # > > According to iostat the disks are not the bottleneck. > > 12.02.2014, в 21:30, Ilya Kosmodemiansky <hydrobi...@gmail.com> написал(а): > > Hi Vladimir, > > Just in case: how is your ext4 mount? > > Best regards, > Ilya > > On Feb 12, 2014, at 17:59, Бородин Владимир <r...@simply.name> wrote: > > Hi all. > > Today I have started getting errors like below in logs (seems that I have not > changed anything for last week). When it happens the db gets lots of > connections in state active, eats 100% cpu and clients get errors (due to > timeout). > > 2014-02-12 15:44:24.562 > MSK,"rpop","rpopdb_p6",30061,"localhost:58350",52fb5e53.756d,1,"SELECT > waiting",2014-02-12 15:43:15 MSK,143/264877,1002850566,LOG,00000,"process > 30061 still waiting for ExclusiveLock on extension of relation 26118 of > database 24590 after 1000.082 ms",,,,,"SQL statement ""insert into > rpop.rpop_imap_uidls (folder_id, uidl) values (i_folder_id, i_uidl)"" > > I have read several topics [1, 2, 3, 4] with similar problems but haven't > find a good solution. Below is some more diagnostics. > > I am running PostgreSQL 9.3.2 installed from RPM packages on RHEL 6.4. Host > is running with the following CPU (32 cores) and memory: > > root@rpopdb01e ~ # fgrep -m1 'model name' /proc/cpuinfo > model name : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz > root@rpopdb01e ~ # free -m > total used free shared buffers cached > Mem: 129028 123558 5469 0 135 119504 > -/+ buffers/cache: 3918 125110 > Swap: 16378 0 16378 > root@rpopdb01e ~ # > > PGDATA lives on RAID6 array of 8 ssd-disks with ext4, iostat and atop say the > disks are really free. Right now PGDATA takes only 95G. > The settings changed in postgresql.conf are here [5]. > > When it happens the last query from here [6] shows that almost all queries > are waiting for ExclusiveLock, but they do a simple insert. > > (extend,26647,26825,,,,,,,) | 5459 | ExclusiveLock | 1 | > (extend,26647,26825,,,,,,,) | 8053 | ExclusiveLock | 5459,8053 > (extend,26647,26828,,,,,,,) | 5567 | ExclusiveLock | 1 | > (extend,26647,26828,,,,,,,) | 5490 | ExclusiveLock | 5567,5490 > (extend,24584,25626,,,,,,,) | 5611 | ExclusiveLock | 1 | > (extend,24584,25626,,,,,,,) | 3963 | ExclusiveLock | 5611,3963 > > I have several databases running on one host with one postmaster process and > ExclusiveLock is being waited by many oids. I suppose the only common thing > for all of them is that they are bigger than others and they almost do not > get updates and deletes (only inserts and reads). Some more info about one of > such tables is here [7]. > > I have tried to look at the source code (src/backend/access/heap/hio.c) to > understand when the exclusive lock can be taken, but I could only read > comments :) I have also examined FSM for this tables and their indexes and > found that for most of them there are free pages but there are, for example, > such cases: > > rpopdb_p0=# select count(*) from pg_freespace('rpop.rpop_uidl') where avail > != 0; > count > -------- > 115953 > (1 row) > > rpopdb_p0=# select count(*) from pg_freespace('rpop.pk_rpop_uidl') where > avail != 0; > count > ------- > 0 > (1 row) > > rpopdb_p0=# \dS+ rpop.rpop_uidl > Table "rpop.rpop_uidl" > Column | Type | Modifiers | Storage | Stats target | > Description > --------+------------------------+-----------+----------+--------------+------------- > popid | bigint | not null | plain | | > uidl | character varying(200) | not null | extended | | > Indexes: > "pk_rpop_uidl" PRIMARY KEY, btree (popid, uidl) > Has OIDs: no > > rpopdb_p0=# > > > My questions are: > 1. Do we consume 100% cpu (in system) trying to get page from FSM? Or does it > happen during exclusive lock acquiring? How can I dig it? > 2. How much space do we extend to the relation when we get exclusive lock on > it? > 3. Why extended page is not visible for other backends? > 4. Is there any possibility of situation where backend A got exclusive lock > on some relation to extend it. Then OS CPU scheduler made a context switch to > backend B while backend B is waiting for exclusive lock on the same relation. > And so on for many backends. > 5. (and the main question) what can I do to get rid of such situations? It is > a production cluster and I do not have any ideas what to do with this > situation :( Any help would be really appropriate. > > [1] > http://www.postgresql.org/message-id/8bca3aa10906011613l8ac2423h8153bbd2513dc...@mail.gmail.com > [2] > http://pgsql.performance.narkive.com/IrkPbl3f/postgresql-9-2-3-performance-problem-caused-exclusive-locks > [3] http://www.postgresql.org/message-id/50a2c93e.9070...@dalibo.com > [4] > http://www.postgresql.org/message-id/cal_0b1sypyeoynkynv95nnv2d+4jxtug3hkkf6fahfw7gvg...@mail.gmail.com > [5] http://pastebin.com/raw.php?i=Bd40Vn6h > [6] http://wiki.postgresql.org/wiki/Lock_dependency_information > [7 http://pastebin.com/raw.php?i=eGrtG524] > > -- > Vladimir > > > > > > > -- > Vladimir > > > > > > > -- > Да пребудет с вами сила... > http://simply.name > > > > > > > > > -- > Vladimir > > > > > > > -- > Vladimir > > > > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance