Hello —

We had two issues today (once this morning and once a few minutes ago) with our 
primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB total disk space, 
20k PIOPS) where the database suddenly crashed and went into recovery mode. The 
first time this happened, we restarted the server after about 5 minutes in an 
attempt to get the system live, and the second time we let it stay in recovery 
mode until it recovered (took about 10 minutes). The system was not under high 
load in either case.

Both times that the server crashed, we saw this in the logs:

2018-06-05 23:08:44 UTC:[12173]:ERROR:  
canceling statement due to statement timeout
2018-06-05 23:08:44 UTC::@:[48863]:LOG:  worker process: parallel worker for 
PID 12173 (PID 20238) exited with exit code 1
2018-06-05 23:08:49 UTC::@:[48863]:LOG:  server process (PID 12173) was 
terminated by signal 11: Segmentation fault

After the first crash, we then started getting errors like:

2018-06-05 23:08:45 UTC:[11888]:ERROR:  
unexpected chunk number 0 (expected 1) for toast value 1592283014 in 

We were able to identify 15 rows that are corrupted and the exact fields that 
are being TOASTED. We’re following Josh Berkus’ post here: 

We have tried to update those rows to change the bad fields by using UPDATE and 
DELETE, but every time we do we get an error: ERROR: tuple concurrently updated

We’re intending to reindex the TOAST table this evening, then try to delete 
again, and then run pg_repack. However, while that may resolve the TOAST 
corruption, we don’t believe it’s the root cause of this issue. We can in 
theory restore from one of our backups, but that would result in data loss for 
our clients and may not necessarily resolve the issue. We’re worried that this 
is a Postgres bug, perhaps due to parallelization — would appreciate any 
guidance people can give.

Thank you!

Reply via email to