Re: [sqlsmith] Unpinning error in parallel worker

2018-04-24 Thread Jonathan Rudenberg


On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote:
> On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg
> <jonat...@titanous.com> wrote:
> > This issue happened again in production, here are the stack traces from 
> > three we grabbed before nuking the >400 hanging backends.
> >
> > [...]
> > #4  0x55fccb93b21c in LWLockAcquire+188() at 
> > /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
> > #5  0x55fccb925fa7 in dsm_create+151() at 
> > /usr/lib/postgresql/10/bin/postgres at dsm.c:493
> > #6  0x55fccb6f2a6f in InitializeParallelDSM+511() at 
> > /usr/lib/postgresql/10/bin/postgres at parallel.c:266
> > [...]
> 
> Thank you.  These stacks are all blocked trying to acquire
> DynamicSharedMemoryControlLock.  My theory is that they can't because
> one backend -- the one that emitted the error "FATAL:  cannot unpin a
> segment that is not pinned" -- is deadlocked against itself.  After
> emitting that error you can see from Andreas's "seabisquit" stack that
> that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach()
> which tries to acquire DynamicSharedMemoryControlLock again, even
> though we already hold it at that point.
> 
> I'll write a patch to fix that unpleasant symptom.  While holding
> DynamicSharedMemoryControlLock we shouldn't raise any errors without
> releasing it first, because the error handling path will try to
> acquire it again.  That's a horrible failure mode as you have
> discovered.
> 
> But that isn't the root problem: we shouldn't be raising that error,
> and I'd love to see the stack of the one process that did that and
> then self-deadlocked.  I will have another go at trying to reproduce
> it here today.

Thanks for the update!

We have turned off parallel queries (using max_parallel_workers_per_gather = 0) 
for now, as the production impact of this bug is unfortunately quite 
problematic.

What will this failure look like with the patch you've proposed?

Thanks again,

Jonathan



Re: [sqlsmith] Unpinning error in parallel worker

2018-04-24 Thread Jonathan Rudenberg


On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote:
> On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg
> <jonat...@titanous.com> wrote:
> > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
> >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
> >> <jonat...@titanous.com> wrote:
> >> > Yep, I think I know approximately what it looked like, I've attached a 
> >> > lightly redacted plan. All of the hung queries were running some variant 
> >> > of this plan as far as I can tell.
> >>
> >> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
> >> "Parallel" in there.
> >
> > Oops, I'm really sorry about that. I only have the first part of the hung 
> > queries, and there are a few variants. Here's one that's parallel.
> 
> I spent some time trying to reproduce this failure without any luck,
> using query plans similar to your Gather plan fragment, and using some
> test harness code for the allocator stuff in isolation.
> 
> I had an idea that (1) freeing a large object that releases and unpins
> a segment in one backend and then (2) freeing it again in another
> backend (illegally) might produce this effect with sufficiently bad
> luck.  I'm still trying to reproduce that without any success, but I
> get other kinds of failures which I think you'd be seeing too if that
> hunch were right.  Still looking...


This issue happened again in production, here are the stack traces from three 
we grabbed before nuking the >400 hanging backends.

process: /proc/24194/mem
thread: 0x7f46263648c0, lwp: 24194, type: 1
#0  0x7f4625fdc827 in do_futex_wait+55() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205
#1  0x7f4625fdc8d4 in __new_sem_wait_slow+84() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181
#2  0x7f4625fdc97a in __new_sem_wait+58() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29
#3  0x55fccb8c8832 in PGSemaphoreLock+34() at 
/usr/lib/postgresql/10/bin/postgres at pg_sema.c:310
#4  0x55fccb93b21c in LWLockAcquire+188() at 
/usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
#5  0x55fccb925fa7 in dsm_create+151() at 
/usr/lib/postgresql/10/bin/postgres at dsm.c:493
#6  0x55fccb6f2a6f in InitializeParallelDSM+511() at 
/usr/lib/postgresql/10/bin/postgres at parallel.c:266
#7  0x55fccb815f27 in ExecInitParallelPlan+807() at 
/usr/lib/postgresql/10/bin/postgres at execParallel.c:470
#8  0x55fccb825a50 in ExecGather+1168() at 
/usr/lib/postgresql/10/bin/postgres at nodeGather.c:158
#9  0x55fccb81dbd3 in ExecAppend+83() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#10 0x55fccb81df2c in fetch_input_tuple+172() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#11 0x55fccb820293 in ExecAgg+1235() at /usr/lib/postgresql/10/bin/postgres 
at nodeAgg.c:2539
#12 0x55fccb835235 in CteScanNext+229() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#13 0x55fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres 
at execScan.c:97
#14 0x55fccb832155 in ExecNestLoop+165() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#15 0x55fccb832155 in ExecNestLoop+165() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#16 0x55fccb832155 in ExecNestLoop+165() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#17 0x55fccb81df2c in fetch_input_tuple+172() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#18 0x55fccb8201eb in ExecAgg+1067() at /usr/lib/postgresql/10/bin/postgres 
at nodeAgg.c:2347
#19 0x55fccb811dfb in standard_ExecutorRun+379() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#20 0x55fccb94d19b in PortalRunSelect+507() at 
/usr/lib/postgresql/10/bin/postgres at pquery.c:932
#21 0x55fccb94e798 in PortalRun+824() at 
/usr/lib/postgresql/10/bin/postgres at pquery.c:773
#22 0x55fccb94b329 in PostgresMain+3721() at 
/usr/lib/postgresql/10/bin/postgres at postgres.c:1984
#23 0x55fccb6840d9 in ServerLoop+3412() at 
/usr/lib/postgresql/10/bin/postgres at postmaster.c:4405
#24 0x55fccb8da78b in PostmasterMain+4235() at 
/usr/lib/postgresql/10/bin/postgres at postmaster.c:1363
#25 0x55fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at 
main.c:228
#26 0x7f4623c07830 in __libc_start_main+240() at 
/lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291
#27 0x55fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres

process: /proc/24231/mem
thread: 0x7f46263648c0, lwp: 24231, type: 1
#0  0x7f4625fdc827 in do_futex_wait+55() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205
#1  0x7f4625fdc8d4 in __new_sem_wait_slow+84() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181
#2  0x7f4625fdc97a in __new_sem_wait+58() at 
/l

Re: [sqlsmith] Unpinning error in parallel worker

2018-04-20 Thread Jonathan Rudenberg


On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote:
> On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg
> <jonat...@titanous.com> wrote:
> > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
> >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
> >> <jonat...@titanous.com> wrote:
> >> > Yep, I think I know approximately what it looked like, I've attached a 
> >> > lightly redacted plan. All of the hung queries were running some variant 
> >> > of this plan as far as I can tell.
> >>
> >> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
> >> "Parallel" in there.
> >
> > Oops, I'm really sorry about that. I only have the first part of the hung 
> > queries, and there are a few variants. Here's one that's parallel.
> 
> I spent some time trying to reproduce this failure without any luck,
> using query plans similar to your Gather plan fragment, and using some
> test harness code for the allocator stuff in isolation.
> 
> I had an idea that (1) freeing a large object that releases and unpins
> a segment in one backend and then (2) freeing it again in another
> backend (illegally) might produce this effect with sufficiently bad
> luck.  I'm still trying to reproduce that without any success, but I
> get other kinds of failures which I think you'd be seeing too if that
> hunch were right.  Still looking...

Thanks for investigating! We have monitoring in place and a plan to grab stack 
traces if this happens again. Is there anything else that would be useful for 
us to try to get in addition to a stack trace from the process processing the 
stuck query?



Re: [sqlsmith] Unpinning error in parallel worker

2018-04-17 Thread Jonathan Rudenberg


On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote:

> I don't have any theories about how that could be going wrong right
> now, but I'm looking into it. 

Thank you!

> > I don't have a backtrace yet, but I will provide them if/when the issue 
> > happens again.
> 
> Thanks, that would be much appreciated, as would any clues about what
> workload you're running.  Do you know what the query plan looks like
> for the queries that crashed?

Yep, I think I know approximately what it looked like, I've attached a lightly 
redacted plan. All of the hung queries were running some variant of this plan 
as far as I can tell.

Thanks again!

Jonathan



QUERY PLAN  


  
--
 Sort  (cost=1896.13..1896.13 rows=2 width=4336) (actual time=29.396..29.447 
rows=410 loops=1)
   Sort Key: assets."position"
   Sort Method: quicksort  Memory: 1671kB
   CTE assets
 ->  HashAggregate  (cost=1861.88..1863.61 rows=173 width=864) (actual 
time=23.328..23.883 rows=1242 loops=1)
   Group Key: assets_7.id, assets_7.name, assets_7.description, 
assets_7.asset_key, assets_7.type, assets_7.created_at, assets_7.updated_at, 
assets_7.deleted_at, assets_7.redacted_id, assets_7.section_id, 
assets_7."position", assets_7.creator_id, assets_7.modifier_id, assets_7.data, 
assets_7.thumbnail_override, assets_7.availability_start, 
assets_7.availability_end, assets_7.availability
   ->  Append  (cost=853.88..1854.10 rows=173 width=864) (actual 
time=10.871..22.002 rows=1242 loops=1)
 ->  CTE Scan on assets assets_7  (cost=853.88..861.60 rows=2 
width=864) (actual time=10.870..12.339 rows=1242 loops=1)
   Filter: (redacted_id = 32456)
   CTE assets
 ->  HashAggregate  (cost=850.45..853.88 rows=343 
width=864) (actual time=10.867..11.392 rows=1242 loops=1)
   Group Key: assets_1.id, assets_1.name, 
assets_1.description, assets_1.asset_key, assets_1.type, assets_1.created_at, 
assets_1.updated_at, assets_1.deleted_at, assets_1.redacted_id, 
assets_1.section_id, assets_1."position", assets_1.creator_id, 
assets_1.modifier_id, assets_1.data, assets_1.thumbnail_override, 
assets_1.availability_start, assets_1.availability_end, assets_1.availability
   ->  Append  (cost=1.41..835.02 rows=343 
width=864) (actual time=0.380..9.456 rows=1261 loops=1)
 ->  Nested Loop  (cost=1.41..174.78 rows=1 
width=249) (actual time=0.046..0.046 rows=0 loops=1)
   ->  Nested Loop  (cost=0.98..5.71 
rows=1 width=4) (actual time=0.046..0.046 rows=0 loops=1)
 ->  Nested Loop  
(cost=0.70..5.14 rows=1 width=8) (actual time=0.046..0.046 rows=0 loops=1)
   ->  Index Scan using 
index_user_id_on_permissible_type on user_permissions  (cost=0.42..2.64 rows=1 
width=4) (actual time=0.045..0.045 rows=0 loops=1)
 Index Cond: 
((user_id = 98789) AND ((permissible_type)::text = 'Organization'::text))
 Filter: 
((permission_level)::text = 'guest'::text)
   ->  Index Only Scan 
using organizations_pkey on organizations  (cost=0.28..2.50 rows=1 width=4) 
(never executed)
 Index Cond: (id = 
user_permissions.permissible_id)
 Heap Fetches: 0
 ->  Index Scan using 
index_redacteds_on_organization_id on redacteds redacteds_1  (cost=0.28..0.43 
rows=1 width=8) (never executed)
   Index Cond: 
(organization_id = organizations.id)
  

Re: [sqlsmith] Unpinning error in parallel worker

2018-04-17 Thread Jonathan Rudenberg
On Wed, Mar 29, 2017, at 10:50, Robert Haas wrote:
> On Wed, Mar 29, 2017 at 1:31 AM, Thomas Munro
>  wrote:
> > I considered whether the error message could be improved but it
> > matches the message for an existing similar case (where you try to
> > attach to an unknown handle).
> 
> Ugh, OK.  I committed this, but I think this whole file needs a visit
> from the message style police.

I ran into an issue that matches the symptoms of this bug on a production 
cluster running 10.3 (Ubuntu 10.3-1.pgdg16.04+1) twice in the past week.

Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock 
and pg_terminate_backend did not terminate the queries.

In the log:

> FATAL:  cannot unpin a segment that is not pinned

I don't have a backtrace yet, but I will provide them if/when the issue happens 
again.

Jonathan