3192429

Chris Travers Wed, 29 Nov 2023 20:16:05 -0800

On Thu, Nov 30, 2023 at 9:03 AM Abdul Qoyyuum <[email protected]>
wrote:


> Hi Chris,
>
> On Wed, Nov 29, 2023 at 7:38 PM Chris Travers <[email protected]>
> wrote:
>
>>
>>
>> On Wed, Nov 29, 2023 at 4:36 PM Abdul Qoyyuum <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> Knowing that it's a data corruption issue, the only way to fix this is
>>> to vacuum and reindex the database. What was suggested was the following:
>>>
>>> SET zero_damaged_pages = 0; # This is so that we can have the
>>> application to continue to run
>>> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem
>>> if possible.
>>> REINDEX DATABASE "core"; # Then do a reindex and clean it up.
>>>
>>
>> So first, to clear up some confusion on my part here:
>>
>> This procedure doesn't make a lot of sense to me.  But did it clear up
>> the issue?
>>
> Yeah it did fix the issue before (same issue as last year) and it has
> fixed the problem that just happened a few days ago (almost exactly a year
> ago).
>
>>
>> In any of these cases, it is extremely important to diagnose the system
>> properly.  If you have a fault in your storage device or RAID controller,
>> for example, you are asking for more corruption and data loss later.
>>
>
>> At first I thought maybe you mistyped something and then realized there
>> were a few issues with the process so it actually didn't make sense.
>>
>> First, zero_damaged_pages defaults to 0, and I can think of no reason to
>> set  it explicitly.
>> Secondly, a vacuum full has to reindex, so there is no reason to do a
>> reindex following.  Your whole procedure is limited to a vacuum full, when
>> a reindex is the only part that could affect this.   If it did work,
>> reindexing is the only part that would have been helpful.
>>
> Oh that makes sense actually. Thanks.
>

So for a temporary workaround, it sounds like reindexing helps for now, but
yeah this really needs deeper investigation.

>
>> On to the question of what to do next....
>>
>>>
>>> We're on Postgresql 12. This has worked before it happened (almost
>>> exactly a year ago) and I think this needs a more permanent solution. I've
>>> looked at routine vacuuming and checked the autovacuum is set to on and the
>>> following configurations:
>>>
>>> core=> select name, setting from pg_settings where name like
>>> 'autovacuum%';
>>>                 name                 |  setting
>>> -------------------------------------+-----------
>>>  autovacuum                          | on
>>>  autovacuum_analyze_scale_factor     | 0.1
>>>  autovacuum_analyze_threshold        | 50
>>>  autovacuum_freeze_max_age           | 200000000
>>>  autovacuum_max_workers              | 3
>>>  autovacuum_multixact_freeze_max_age | 400000000
>>>  autovacuum_naptime                  | 60
>>>  autovacuum_vacuum_cost_delay        | 2
>>>  autovacuum_vacuum_cost_limit        | -1
>>>  autovacuum_vacuum_scale_factor      | 0.2
>>>  autovacuum_vacuum_threshold         | 50
>>>  autovacuum_work_mem                 | -1
>>> (12 rows)
>>>
>>> Can anyone advise if there's anything else we can do? We have no clue
>>> what causes the invalid page block and we are running a High Availability
>>> cluster set up but we are hoping that there may be a way to mitigate it.
>>>
>>>
>> You need to figure out why the corruption is happening.  This is most
>> likely, in my experience, not a PostgreSQL bug, but usually something that
>> happens on the hardware layer or an environmental factor.  It could be
>> failin storage or CPU.  Or it could be something like bad electrical input
>> or insufficient cooling (I have seen index and even table corruption issues
>> from both of these).
>>
>> If this is a server you run, the first things I would check are:
>> 1.  Is there a good-quality UPS that the server is plugged into?  Are the
>> batteries in good working order?
>>
> The servers are dual powered and hooked up to both supplied electricity,
> with a backup generator and if that fails, it will switch over to the UPS.
> All of these are supplied and maintained by the data centre that the
> servers are at. There have been no electrical problems so far.
>
>> 2.  Is the server somewhere that may be sitting in a pocket of hot air?
>>
> As you can imagine, the data centre has air-conditioning and floored fans
> blowing hot air up and out, keeping all servers cooled. Checking on the
> blade servers that the database VM sits on shows that the temperature is
> optimally sitting at 65 degrees celsius. So I don't think it's a
> temperature problem either.
>

Ok so this is in a professionally run datacenter.  That does indeed
eliminate at least some of these issues.

>
>> Once you have ruled these out, the next things to check are CPU, memory,
>> and storage health.  Unfortunately checking these is harder but you can
>> check SMART indications, and other diagnostic indicators.
>>
> Would this top stats be useful? The database does get busy especially at
> peak hours.
>
> top - 10:04:25 up 782 days, 43 min,  2 users,  load average: 1.20, 1.14,
> 1.10
> Tasks: 415 total,   3 running, 412 sleeping,   0 stopped,   0 zombie
> Cpu(s): 22.7%us, 13.8%sy,  0.0%ni, 63.2%id,  0.2%wa,  0.0%hi,  0.1%si,
>  0.1%st
> Mem:   8160156k total,  7832828k used,   327328k free,    34584k buffers
> Swap:  1048572k total,   304396k used,   744176k free,  6674428k cached
>
> We are running on a SAN group disk storage. I guess I should take a closer
> look at the disks. Thanks for the pointer.
>

anything in mcelog?  That would be a place to check for memory issues if
you have ECC RAM (which I assume you do in a datacenter).

As far as the SAN goes, another thing I would pay attention to would be
whether there are any indication of fsync errors in the dmesg, or other
logs.   In addition, of course, to a review of disk errors themselves.


>
>>
>> However, once these errors start happening, you are in danger territory
>> and need to find out why (and correct the underlying problem) before you
>> get data loss.
>>
> Thanks for the warning. We'll see what we can find.
>



>
>> --
>> Best Wishes,
>> Chris Travers
>>
>> Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
>> lock-in.
>> http://www.efficito.com/learn_more
>>
>

-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more

Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

Reply via email to