Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

Merlin Moncure Tue, 13 Jan 2015 15:57:02 -0800

On Tue, Jan 13, 2015 at 5:42 PM, Andres Freund <and...@2ndquadrant.com> wrote:
> On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
>> On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund <and...@2ndquadrant.com> 
>> wrote:
>> > On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
>> >> I'm inclined to think that this is a livelock, and so the problem
>> >> isn't evident from the structure of the B-Tree, but it can't hurt to
>> >> check.
>> >
>> > My guess is rather that it's contention on the freelist lock via
>> > StrategyGetBuffer's. I've seen profiles like this due to exactly that
>> > before - and it fits to parallel loading quite well.
>>
>> I think I've got it to pop again. s_lock is only showing 35%
>> (increasing very slowly if at all) but performance is mostly halted.
>> Frame pointer is compiled out.  perf report attached.
>
>>  35.82%  postgres                            [.] s_lock
>>  23.71%  postgres                            [.] tas
>>  14.01%  postgres                            [.] tas
>>   6.82%  postgres                            [.] spin_delay
>>   5.93%  postgres                            [.] LWLockRelease
>>   4.36%  postgres                            [.] LWLockAcquireCommon
>
> Interesting. This profile looks quite different?


yep, it's very stable, and the database is approximately frozen.

> What kind of hardware is this on?

linux on crap workstation box, except I have good ssd (intel 3500).

 model name : Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz


top - 17:44:00 up 11 days,  6:25, 15 users,  load average: 6.94, 6.97, 5.73
Tasks: 259 total,   8 running, 250 sleeping,   0 stopped,   1 zombie
%Cpu0  : 95.7 us,  0.0 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 93.5 us,  2.2 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 95.7 us,  0.0 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 93.6 us,  2.1 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8131028 total,  7977608 used,   153420 free,    21424 buffers
KiB Swap:  8340476 total,  1884900 used,  6455576 free,  5201188 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
32492 mmoncure  20   0  175m  12m 9040 R  64.0  0.2   9:28.42 postgres
32367 mmoncure  20   0  175m  12m 9128 R  57.3  0.2   9:35.87 postgres
31792 mmoncure  20   0  175m  92m  88m R  52.9  1.2   9:57.50 postgres
32725 mmoncure  20   0  175m  13m 9852 R  52.9  0.2   9:08.50 postgres
31899 mmoncure  20   0  179m 104m  97m R  50.7  1.3   9:49.51 postgres
32368 mmoncure  20   0  175m  12m 9200 R  50.7  0.2   9:33.77 postgres
31282 mmoncure  20   0  184m 152m 140m R  46.3  1.9  10:26.29 postgres
  666 mmoncure  20   0 20640 1684 1116 R   2.2  0.0   0:00.01 top

Some more information what's happening:
This is a ghetto logical replication engine that migrates data from
sql sever to postgres, consolidating a sharded database into a single
set of tables (of which there are only two). There is only one index
on the destination table, and it's composite int,int in both cases.

A single source table is replicated in a single transaction, and it's
staged to a set of two temp tables, one for deletes, one for inserts.
I log each step in the database.  Here are the times:

cds2=# select cdsruntableid, finished - started, inserted from
cdsruntable where cdsrunid = 62 order by started;
 cdsruntableid │    ?column?     │ inserted
───────────────┼─────────────────┼──────────
           833 │ 00:00:33.24044  │    11860
           834 │ 00:00:35.44981  │    19099
           835 │ 00:02:01.725186 │   530923
           836 │ 00:01:29.578811 │   211737
           838 │ 00:01:17.393461 │    64258
           837 │ 00:00:56.849106 │   227844
           839 │ 00:02:12.016504 │   630309
           840 │ 00:00:00.00111  │
           841 │ 00:01:09.058806 │   155309
           842 │ 00:01:54.723747 │   704422
           843 │ 00:01:14.965304 │
           844 │ 00:00:45.410953 │    59934
           845 │ 00:00:34.302632 │    14848
           846 │ 00:00:46.913832 │    89893
           848 │ 00:05:22.155189 │  2410622
           847 │ 00:01:25.199141 │   268157
           849 │ 00:01:16.168352 │   117511
           850 │ 00:00:34.809862 │     1175
           851 │ 00:01:12.565397 │    67214
           852 │ 00:01:03.734776 │    20129
           853 │ 00:00:41.038456 │    62765
           854 │ 00:01:03.478917 │    14967
           855 │ 00:00:33.88803  │     6901
           856 │ 00:00:36.381886 │     6670
           857 │ 00:00:36.626623 │     8268
           858 │ 00:01:14.391584 │   363448
           859 │ 00:01:44.619005 │   533395
           860 │ 00:01:02.042255 │   212202
           861 │ 00:00:00.001065 │
           863 │ 00:00:58.265506 │   215234
           862 │ 00:02:45.572635 │   827941
           865 │ 00:01:28.049165 │   241020
           864 │ 00:01:51.465643 │   531012
           866 │ 00:01:20.273391 │   419357
           868 │ 00:01:25.479443 │   294262
           869 │ 00:00:46.400825 │    46136
           870 │ 00:00:42.337286 │    25934
           871 │ 00:00:48.350152 │    91375
           872 │ 00:00:00.001743 │
           873 │ 00:01:04.596749 │   199744
           876 │ 00:01:15.210301 │   276276
           877 │ 00:02:36.447723 │  1693887
           878 │ 00:00:55.925104 │    36938
           879 │ 00:00:00.016874 │
           882 │ 00:00:50.479248 │   171901
           881 │                 │
           884 │                 │
           883 │                 │
           867 │                 │
           874 │                 │
           875 │                 │
           880 │                 │
(52 rows)


There is a an 8 thread worker that walks that queue in id order and
migrates one source table of data to the destination postgres server.
 It's been stuck for quite some time now, and I'm certain that if I
restart the database, and rerun the incremental processing routine, it
would work those last seven tables with no issues (at least that's
what happened last time).  It's been bogged with no progress now for a
half hour again, so there's some kind of weird bug here.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

Reply via email to