Re: [GENERAL] Re: corruption issue after server crash - ERROR: unexpected chunk number 0

2013-11-26 Thread Shaun Thomas
> So if this problem replicated to our standby servers does that indicate
> that the potential problematic fsync occurred during a pg_xlog write?

Pretty much. You have a couple issues here, and no easy way to approach them. 
Primarily, you got data corruption during a sync operation. This means either 
the OS or the hardware somewhere along the line lied about the write, or the 
write was corrupted and the filesystem log replayed incorrectly upon reboot. 
Once that happens, you can't trust *any* data in your database. Pre-checksum 
PostgreSQL has no way to verify integrity of existing data, and system crashes 
can corrupt quite a bit of data that was only tangentially involved.

What likely happens in these scenarios, is that the database startup succeeds, 
and then it read some rows in from a corrupted table. By corrupted, I mean even 
a single data page with a mangled pointer. That mangled pointer gave the 
database incorrect information about the state of that data page's contents, 
and the database continued on that information. That means subsequent 
transaction logs from that point are *also* corrupt, and hence any streaming or 
warm standby replicas are subsequently damaged as well. But they'll be damaged 
differently, because they likely didn't have the initial corruption, just the 
byte changes dictated by the WAL stream.

Unless you know where the initial corruption came from, the system that caused 
it should be quarantined for verification. RAM, disk, CPU, everything should 
pass integrity checks before putting it back into production.

> Would breaking replication at the time of the crash have prevented
> this from cascading or was it already too late at that point?

Most likely. If, at the time of the crash, you switched to one of your replicas 
and made it the new master, it would give you the opportunity to check out the 
crashed system before it spread the love. Even if you don't have a true STONITH 
model, starting up a potentially data-compromised node in an active cluster is 
a gamble.

I did something similar once. One of our DRBD nodes crashed and came back up 
and re-attached to the DRBD pair after a quick data discard and replay. I 
continued with some scheduled system maintenance, and performed a node failover 
with no incident. It wasn't until 20 minutes later that the corrupt disk pages 
started making their presence felt, and by then It was too late. Luckily we 
were still verifying, but with our secondaries ruined, we had to restore from 
backup. A 30-minute outage became a 4-hour one.

Afterwards, we put in a new policy that any crash means a DRBD verify at 
minimum, and until the node passes, it is to be considered invalid and 
unusable. If you haven't already, I suggest something similar for your setup. 
Verify a crashed node before using it again, no matter how much pressure you're 
under. It can always get worse.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com


__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Re: corruption issue after server crash - ERROR: unexpected chunk number 0

2013-11-26 Thread Mike Broers
The restore of a post-crash production backup worked as hoped and the 2nd
replication slave is back into its happy hot standby state.

So if this problem replicated to our standby servers does that indicate
that the potential problematic fsync occurred during a pg_xlog write?
 Would breaking replication at the time of the crash have prevented this
from cascading or was it already too late at that point?

Thanks again for the input, its been very helpful!
Mike




On Mon, Nov 25, 2013 at 12:20 PM, Mike Broers  wrote:

> Thanks Shaun,
>
> Im planning to schedule a time to do the vacuum freeze suggested
> previously.  So far the extent of the problem seems limited to the one
> session table and the one session row that was being used by a heavy bot
> scan at the time of the crash.  Currently Im testing a recovery of a
> production backup from today to rebase one of the replication targets that
> I was using to test fixes last week.  Hopefully that validates the current
> backups and I can proceed inquiring with our managed services provider
> about the false notification of the disk write and ways to prevent that
> going forward.
>
> I'll update the list if I uncover anything interesting in the process
> and/or need more advice, thanks again for your input - its much appreciated
> as always.  Nothing like a little crash corruption to get the blood flowing!
>
> Mike
>
>
> On Mon, Nov 25, 2013 at 10:29 AM, Shaun Thomas 
> wrote:
>
>> > Update - I have two hot replication slaves of this db, both have the
>> problem.
>> > I took one out of recovery and ran REINDEX table session_session and it
>> > fixed the errors about this row.  Now Im going to run vacuum and see if
>> > there are other tables that complain, but Im guessing if so I will need
>> to see
>> > if there is a way to force vacuum to continue on error, worst case I
>> might
>> > have to script a table by table vacuum script I guess..  If anyone has
>> a better
>> > suggestion for determining the extent of the damage Id appreciate it.
>>
>> Oh man. I'm sorry, Mike.
>>
>> One of the cardinal rules I have is to disconnect any replication
>> following a database crash. It's just too easy for damaged replicated rows
>> to be propagated unless you're on 9.3 and have checksums enabled. If you
>> want to perform a  table-by-table check, don't vacuum the database, but the
>> individual tables. I'd go with a DO loop and have it raise notices into the
>> log so you can investigate further:
>>
>> COPY (
>> SELECT 'VACUUM ' || oid::regclass::text || ';'
>>   FROM pg_class
>>  WHERE relkind = 'r'
>> ) to '/tmp/vac_all.sql';
>>
>> Run the /tmp/vac_all.sql through psql and pipe the contents into a log
>> file. Any table that doesn't vacuum successfully will need to be repaired
>> manually. One way you can do this if there are dupes, is by checking the
>> ctid value after disabling index scans:
>>
>> SET enable_indexscan TO False;
>>
>> SELECT ctid, * FROM [broken_table] WHERE ...;
>>
>> Just construct the WHERE clause based on the error output, and you should
>> get all rows if there are dupes. You'll need to figure out which row to
>> keep, then delete the bad row based on the ctid. Do this as many times as
>> it takes, then reindex to make sure the proper row versions are indexed.
>>
>> It's also a good idea to dump any table that came back with an error,
>> just in case.
>>
>> After you've done all of that, you should re-base your replicas once
>> you've determined your production system is usable. In the meantime, I
>> highly recommend you set up a VIP you can assign to one of your replicas if
>> your production system dies again, and remove any autostart code. If your
>> production system crashes, switch the VIP immediately to a replica, and
>> invalidate your old production system. Data corruption is insidious when
>> streaming replication is involved.
>>
>> Look into tools like repmgr to handle managing your replicas as a cluster
>> to make forced invalidation and re-basing easier.
>>
>> Good luck!
>>
>> --
>> Shaun Thomas
>> OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
>> 312-676-8870
>> stho...@optionshouse.com
>>
>> __
>>
>> See http://www.peak6.com/email_disclaimer/ for terms and conditions
>> related to this email
>>
>
>


Re: [GENERAL] Re: corruption issue after server crash - ERROR: unexpected chunk number 0

2013-11-25 Thread Mike Broers
Thanks Shaun,

Im planning to schedule a time to do the vacuum freeze suggested
previously.  So far the extent of the problem seems limited to the one
session table and the one session row that was being used by a heavy bot
scan at the time of the crash.  Currently Im testing a recovery of a
production backup from today to rebase one of the replication targets that
I was using to test fixes last week.  Hopefully that validates the current
backups and I can proceed inquiring with our managed services provider
about the false notification of the disk write and ways to prevent that
going forward.

I'll update the list if I uncover anything interesting in the process
and/or need more advice, thanks again for your input - its much appreciated
as always.  Nothing like a little crash corruption to get the blood flowing!

Mike


On Mon, Nov 25, 2013 at 10:29 AM, Shaun Thomas wrote:

> > Update - I have two hot replication slaves of this db, both have the
> problem.
> > I took one out of recovery and ran REINDEX table session_session and it
> > fixed the errors about this row.  Now Im going to run vacuum and see if
> > there are other tables that complain, but Im guessing if so I will need
> to see
> > if there is a way to force vacuum to continue on error, worst case I
> might
> > have to script a table by table vacuum script I guess..  If anyone has a
> better
> > suggestion for determining the extent of the damage Id appreciate it.
>
> Oh man. I'm sorry, Mike.
>
> One of the cardinal rules I have is to disconnect any replication
> following a database crash. It's just too easy for damaged replicated rows
> to be propagated unless you're on 9.3 and have checksums enabled. If you
> want to perform a  table-by-table check, don't vacuum the database, but the
> individual tables. I'd go with a DO loop and have it raise notices into the
> log so you can investigate further:
>
> COPY (
> SELECT 'VACUUM ' || oid::regclass::text || ';'
>   FROM pg_class
>  WHERE relkind = 'r'
> ) to '/tmp/vac_all.sql';
>
> Run the /tmp/vac_all.sql through psql and pipe the contents into a log
> file. Any table that doesn't vacuum successfully will need to be repaired
> manually. One way you can do this if there are dupes, is by checking the
> ctid value after disabling index scans:
>
> SET enable_indexscan TO False;
>
> SELECT ctid, * FROM [broken_table] WHERE ...;
>
> Just construct the WHERE clause based on the error output, and you should
> get all rows if there are dupes. You'll need to figure out which row to
> keep, then delete the bad row based on the ctid. Do this as many times as
> it takes, then reindex to make sure the proper row versions are indexed.
>
> It's also a good idea to dump any table that came back with an error, just
> in case.
>
> After you've done all of that, you should re-base your replicas once
> you've determined your production system is usable. In the meantime, I
> highly recommend you set up a VIP you can assign to one of your replicas if
> your production system dies again, and remove any autostart code. If your
> production system crashes, switch the VIP immediately to a replica, and
> invalidate your old production system. Data corruption is insidious when
> streaming replication is involved.
>
> Look into tools like repmgr to handle managing your replicas as a cluster
> to make forced invalidation and re-basing easier.
>
> Good luck!
>
> --
> Shaun Thomas
> OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
> 312-676-8870
> stho...@optionshouse.com
>
> __
>
> See http://www.peak6.com/email_disclaimer/ for terms and conditions
> related to this email
>


Re: [GENERAL] Re: corruption issue after server crash - ERROR: unexpected chunk number 0

2013-11-25 Thread Shaun Thomas
> Update - I have two hot replication slaves of this db, both have the problem.
> I took one out of recovery and ran REINDEX table session_session and it
> fixed the errors about this row.  Now Im going to run vacuum and see if
> there are other tables that complain, but Im guessing if so I will need to see
> if there is a way to force vacuum to continue on error, worst case I might
> have to script a table by table vacuum script I guess..  If anyone has a 
> better
> suggestion for determining the extent of the damage Id appreciate it.

Oh man. I'm sorry, Mike.

One of the cardinal rules I have is to disconnect any replication following a 
database crash. It's just too easy for damaged replicated rows to be propagated 
unless you're on 9.3 and have checksums enabled. If you want to perform a  
table-by-table check, don't vacuum the database, but the individual tables. I'd 
go with a DO loop and have it raise notices into the log so you can investigate 
further:

COPY (
SELECT 'VACUUM ' || oid::regclass::text || ';'
  FROM pg_class
 WHERE relkind = 'r'
) to '/tmp/vac_all.sql';

Run the /tmp/vac_all.sql through psql and pipe the contents into a log file. 
Any table that doesn't vacuum successfully will need to be repaired manually. 
One way you can do this if there are dupes, is by checking the ctid value after 
disabling index scans:

SET enable_indexscan TO False;

SELECT ctid, * FROM [broken_table] WHERE ...;

Just construct the WHERE clause based on the error output, and you should get 
all rows if there are dupes. You'll need to figure out which row to keep, then 
delete the bad row based on the ctid. Do this as many times as it takes, then 
reindex to make sure the proper row versions are indexed.

It's also a good idea to dump any table that came back with an error, just in 
case.

After you've done all of that, you should re-base your replicas once you've 
determined your production system is usable. In the meantime, I highly 
recommend you set up a VIP you can assign to one of your replicas if your 
production system dies again, and remove any autostart code. If your production 
system crashes, switch the VIP immediately to a replica, and invalidate your 
old production system. Data corruption is insidious when streaming replication 
is involved.

Look into tools like repmgr to handle managing your replicas as a cluster to 
make forced invalidation and re-basing easier.

Good luck!

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Re: corruption issue after server crash - ERROR: unexpected chunk number 0

2013-11-21 Thread Mike Broers
Update - I have two hot replication slaves of this db, both have the
problem.  I took one out of recovery and ran REINDEX table session_session
and it fixed the errors about this row.  Now Im going to run vacuum and see
if there are other tables that complain, but Im guessing if so I will need
to see if there is a way to force vacuum to continue on error, worst case I
might have to script a table by table vacuum script I guess..  If anyone
has a better suggestion for determining the extent of the damage Id
appreciate it.


On Thu, Nov 21, 2013 at 2:10 PM, Mike Broers  wrote:

> Hello we are running postgres 9.2.5 on RHEL6, our production server
> crashed hard and when it came back up our logs were flooded with:
>
> STATEMENT:  SELECT "session_session"."session_key",
> "session_session"."session_data", "session_session"."expire_date",
> "session_session"."nonce" FROM "session_session" WHERE
> ("session_session"."session_key" = 'gk9aap5d7btp6tzquh0kf73gpfmik5w3'  AND
> "session_session"."expire_date" > '2013-11-21 13:27:33.107913' )
>
> ERROR:  unexpected chunk number 0 (expected 1) for toast value 117927127
> in pg_toast_19122
>
> We restarted the application and whatever session was constantly hitting
> that row stopped, but Im concerned about remediation.  When I attempt to
> read from that row the error occurs.
>
> select * from session_session where session_key =
> 'gk9aap5d7btp6tzquh0kf73gpfmik5w3';
> ERROR:  unexpected chunk number 0 (expected 1) for toast value 117927127
> in pg_toast_19122
>
> When I attempt to delete this row I get this error:
> delete from session_session where session_key =
> 'gk9aap5d7btp6tzquh0kf73gpfmik5w3';
> ERROR:  tuple concurrently updated
>
> We happen to have a maintenance window tonight so I will have some time
> when the app is down to run some database fixes.  I saw other threads
> suggesting a reindex of the toast table, but this is a 14GB table and I'm
> not sure how long that will take or if it will even be successful.  We also
> have a full db vacuum/analyze scheduled nightly for 2am so I am expecting
> to learn if there are other impacted tables, but its troubling if I dont
> know what the remediation.  This particular table could be truncated if
> necessary if that is an option but Im not sure about other tables.
>
> Any suggestions for how to handle the tuple concurrently updated error? Or
> if a reindex is likely to help with the unexpected chunk error?
>
> Thanks
> Mike
>