On Thu, Nov 30, 2023 at 9:03 AM Abdul Qoyyuum <aqoyy...@cardaccess.com.bn> wrote:
> Hi Chris, > > On Wed, Nov 29, 2023 at 7:38 PM Chris Travers <chris.trav...@gmail.com> > wrote: > >> >> >> On Wed, Nov 29, 2023 at 4:36 PM Abdul Qoyyuum <aqoyy...@cardaccess.com.bn> >> wrote: >> >>> Hi all, >>> >>> Knowing that it's a data corruption issue, the only way to fix this is >>> to vacuum and reindex the database. What was suggested was the following: >>> >>> SET zero_damaged_pages = 0; # This is so that we can have the >>> application to continue to run >>> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem >>> if possible. >>> REINDEX DATABASE "core"; # Then do a reindex and clean it up. >>> >> >> So first, to clear up some confusion on my part here: >> >> This procedure doesn't make a lot of sense to me. But did it clear up >> the issue? >> > Yeah it did fix the issue before (same issue as last year) and it has > fixed the problem that just happened a few days ago (almost exactly a year > ago). > >> >> In any of these cases, it is extremely important to diagnose the system >> properly. If you have a fault in your storage device or RAID controller, >> for example, you are asking for more corruption and data loss later. >> > >> At first I thought maybe you mistyped something and then realized there >> were a few issues with the process so it actually didn't make sense. >> >> First, zero_damaged_pages defaults to 0, and I can think of no reason to >> set it explicitly. >> Secondly, a vacuum full has to reindex, so there is no reason to do a >> reindex following. Your whole procedure is limited to a vacuum full, when >> a reindex is the only part that could affect this. If it did work, >> reindexing is the only part that would have been helpful. >> > Oh that makes sense actually. Thanks. > So for a temporary workaround, it sounds like reindexing helps for now, but yeah this really needs deeper investigation. > >> On to the question of what to do next.... >> >>> >>> We're on Postgresql 12. This has worked before it happened (almost >>> exactly a year ago) and I think this needs a more permanent solution. I've >>> looked at routine vacuuming and checked the autovacuum is set to on and the >>> following configurations: >>> >>> core=> select name, setting from pg_settings where name like >>> 'autovacuum%'; >>> name | setting >>> -------------------------------------+----------- >>> autovacuum | on >>> autovacuum_analyze_scale_factor | 0.1 >>> autovacuum_analyze_threshold | 50 >>> autovacuum_freeze_max_age | 200000000 >>> autovacuum_max_workers | 3 >>> autovacuum_multixact_freeze_max_age | 400000000 >>> autovacuum_naptime | 60 >>> autovacuum_vacuum_cost_delay | 2 >>> autovacuum_vacuum_cost_limit | -1 >>> autovacuum_vacuum_scale_factor | 0.2 >>> autovacuum_vacuum_threshold | 50 >>> autovacuum_work_mem | -1 >>> (12 rows) >>> >>> Can anyone advise if there's anything else we can do? We have no clue >>> what causes the invalid page block and we are running a High Availability >>> cluster set up but we are hoping that there may be a way to mitigate it. >>> >>> >> You need to figure out why the corruption is happening. This is most >> likely, in my experience, not a PostgreSQL bug, but usually something that >> happens on the hardware layer or an environmental factor. It could be >> failin storage or CPU. Or it could be something like bad electrical input >> or insufficient cooling (I have seen index and even table corruption issues >> from both of these). >> >> If this is a server you run, the first things I would check are: >> 1. Is there a good-quality UPS that the server is plugged into? Are the >> batteries in good working order? >> > The servers are dual powered and hooked up to both supplied electricity, > with a backup generator and if that fails, it will switch over to the UPS. > All of these are supplied and maintained by the data centre that the > servers are at. There have been no electrical problems so far. > >> 2. Is the server somewhere that may be sitting in a pocket of hot air? >> > As you can imagine, the data centre has air-conditioning and floored fans > blowing hot air up and out, keeping all servers cooled. Checking on the > blade servers that the database VM sits on shows that the temperature is > optimally sitting at 65 degrees celsius. So I don't think it's a > temperature problem either. > Ok so this is in a professionally run datacenter. That does indeed eliminate at least some of these issues. > >> Once you have ruled these out, the next things to check are CPU, memory, >> and storage health. Unfortunately checking these is harder but you can >> check SMART indications, and other diagnostic indicators. >> > Would this top stats be useful? The database does get busy especially at > peak hours. > > top - 10:04:25 up 782 days, 43 min, 2 users, load average: 1.20, 1.14, > 1.10 > Tasks: 415 total, 3 running, 412 sleeping, 0 stopped, 0 zombie > Cpu(s): 22.7%us, 13.8%sy, 0.0%ni, 63.2%id, 0.2%wa, 0.0%hi, 0.1%si, > 0.1%st > Mem: 8160156k total, 7832828k used, 327328k free, 34584k buffers > Swap: 1048572k total, 304396k used, 744176k free, 6674428k cached > > We are running on a SAN group disk storage. I guess I should take a closer > look at the disks. Thanks for the pointer. > anything in mcelog? That would be a place to check for memory issues if you have ECC RAM (which I assume you do in a datacenter). As far as the SAN goes, another thing I would pay attention to would be whether there are any indication of fsync errors in the dmesg, or other logs. In addition, of course, to a review of disk errors themselves. > >> >> However, once these errors start happening, you are in danger territory >> and need to find out why (and correct the underlying problem) before you >> get data loss. >> > Thanks for the warning. We'll see what we can find. > > >> -- >> Best Wishes, >> Chris Travers >> >> Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor >> lock-in. >> http://www.efficito.com/learn_more >> > -- Best Wishes, Chris Travers Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in. http://www.efficito.com/learn_more