Re: [firebird-support] internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 239)

Alexey Kovyazin a...@ib-aid.com [firebird-support] Tue, 04 Oct 2016 03:09:17 -0700

Hi Ian,

It seems that you have multi-volume database, is it correct?


Regards,
Alexey Kovyazin
IBSurgeon

Hi FB Support!
We have been running FB since IB4.2 days and have only previously hadany database corruption when we've had hardware issues...untilrecently, when I've seen this error on 2 different servers that I'mreasonably confident are both fine! However, our setup is probablyfairly unique. I'm not panicking at the moment as we recovered all thedata from the backup, but as I've had this twice in a month ondifferent machines I'm keen to adjust the setup so it doesn't happenagain.
We are running about 20 customers on a server, each with their own setof Firebird/PHP-FPM/Nginx/custom python servers running as servicesbound to separate IP addresses – this is (in theory!) to keep eachcustomer isolated from the others whilst sharing most of the resourceson the box without too much overhead from containers or fullvisualisation.
FB is listening as a tcpsvd service but with a shared lock folder:
exec tcpsvd -c 60 -u firebird:firebird -l $remote_bind_addr$remote_bind_addr $remote_bind_port /usr/sbin/fb_inet_server -i -e"/srv/$CUSTOMER/interbase/" -el "/tmp/firebird/tmpfs"
and with a “localhost” for our own convenience:
exec tcpsvd -c 60 -u firebird:firebird -l 127.0.0.1 127.0.0.1 3050/usr/sbin/fb_inet_server -i -e "/var/interbase/" -el "/tmp/firebird/tmpfs"
I had assumed because they were using the share lock folder that thisis allowed.
Each night we loop through all the customers and run :


for database in /customers/*.gdb

do

echo “Doing $database:”

echo -n “gbak ”

gbak -b -g localhost:$database /backupdata/$database.gbak

echo $?

echo -n “gfix ”

gfix -sweep localhost:$database

echo $?

echo -n “isql “

echo “EXECURE PROCEDURE nightlySQL;” | isql localhost$database

echo $?

done
Whilst this backup is running we don't disable any external access andhave a mixture of python and php clients accessing throughout thenight using the $remote_bind_addr:$database connection string ratherthan localhost:.
Over the weekend a 12GB backup finished successfully, but errorsstarted appearing in the firebird2.5.log file at the same time as thenext database was being backed up – suggesting that either the gfix orisql had tripped up before the script moved on. However, the outputfrom the backup was:
Doing <database>:

gbak 0

gfix 0

isql 0
The firebird log has the next customer's sweep starting immediatelyafter the first “internal Firebird consistency check (decompressionoverran buffer (179), file: sqz.cpp line: 239)” on the now corruptdatabase which suggests the isql line is “to blame” - but thenightlySQL doesn't do a lot, just deletes from a table andre-populates a load of summaries from a big table. The firebird logerror coincides with an NGINX request for the data that the nightlySQLis building, but I've repeated this today and it doesn't by itselfkill the database. I've got nothing else in either syslog or dmesg.
I've run IBSurgeon against the file. It gives me the following output:


03/10/2016 11:23:53 INFO: Open database files: Z:\home\**-bad.gdb


03/10/2016 11:23:53 INFO: Analyzing database low-level structures...

03/10/2016 11:27:42 INFO: Actual PageCount: 1534112 found in database

03/10/2016 11:27:42 ERROR: Found 18 undefined or unrecognized pages.
03/10/2016 11:27:42 INFO: ====== DATABASE IS READY FOR DIAGNOSING ANDREPAIRING. ====
03/10/2016 11:27:42 INFO: ====== Now choose "Diagnose" or "Repair". ====

03/10/2016 11:51:31 INFO: ------------------- Starting diagnose

03/10/2016 11:51:31 INFO: Running procedure: Header page check

03/10/2016 11:51:31 INFO: ODS Major = 11 (32779)

03/10/2016 11:51:31 INFO: ODS Minor = 2

03/10/2016 11:51:31 INFO: Next transaction = 13910909

03/10/2016 11:51:31 INFO: Oldest transaction = 13910907

03/10/2016 11:51:31 INFO: Oldest active = 13910908

03/10/2016 11:51:31 INFO: Oldest snapshot = 13910908

03/10/2016 11:51:31 INFO: PageSize is Ok = 8192
03/10/2016 11:51:31 INFO: Running procedure: Checking of RDB$Pagesconsistency
03/10/2016 11:53:42 INFO: Checking of RDB$Pages consistency: Ok
03/10/2016 11:53:42 INFO: Running procedure: Low-level check of allrelations
03/10/2016 11:53:43 INFO: Relation RDB$DATABASE (1) is OK

03/10/2016 11:53:44 INFO: Relation RDB$FIELDS (2) is OK

03/10/2016 11:53:46 INFO: Relation RDB$INDEX_SEGMENTS (3) is OK

03/10/2016 11:53:47 INFO: Relation RDB$INDICES (4) is OK

03/10/2016 11:53:47 INFO: Relation RDB$RELATION_FIELDS (5) is OK

03/10/2016 11:53:48 INFO: Relation RDB$RELATIONS (6) is OK

03/10/2016 11:53:48 INFO: Relation RDB$VIEW_RELATIONS (7) is OK

03/10/2016 11:53:48 INFO: Relation RDB$FORMATS (8) is OK

03/10/2016 11:53:48 INFO: Relation RDB$SECURITY_CLASSES (9) is OK

03/10/2016 11:53:48 ERROR: DP#1533855 has wrong rel#:136

03/10/2016 11:53:48 ERROR: Found 1 record errors on datapage#1533855

03/10/2016 11:53:48 ERROR: Error on data page #1533855

03/10/2016 11:53:48 INFO: Pointer page #24 checking: found 1 errors.

03/10/2016 11:53:48 ERROR: Error in checking relation #10 Found 1 errors.

03/10/2016 11:53:48 ERROR: Relation RDB$FILES (10) is CORRUPT

03/10/2016 11:53:48 INFO: Relation RDB$TYPES (11) is OK


and then everything after that is OK until the end…


03/10/2016 12:21:03 INFO: Relation SYSRELATIONSHIP (487) is OK

03/10/2016 12:21:03 ERROR: All relations check found 1 errors.

03/10/2016 12:21:03 INFO: ------------------- Finished diagnose--------

=============== !!!!!!!!!!! ==================


On the smaller database the reported error was:


03/10/2016 16:05:06 INFO: Relation RDB$SECURITY_CLASSES (9) is OK

03/10/2016 16:05:06 ERROR: DP#23043 has wrong rel#:234

03/10/2016 16:05:06 ERROR: Found 1 record errors on datapage#23043

03/10/2016 16:05:06 ERROR: Error on data page #23043

03/10/2016 16:05:06 INFO: Pointer page #24 checking: found 1 errors.

03/10/2016 16:05:06 ERROR: Error in checking relation #10 Found 1 errors.

03/10/2016 16:05:06 ERROR: Relation RDB$FILES (10) is CORRUPT

03/10/2016 16:05:06 INFO: Relation RDB$TYPES (11) is OK


which looks to me like the same problem.
I'm open to the idea that we should always be using the sameconnection string to access the same database, but I was expecting theshared lock folder to do everything I wanted in a classic world.
Apart from that, does anyone have any idea what might be happening?
Is there any way to get back from the error above without having toreach for the last backup?
Thanks in advance,


Ian

Re: [firebird-support] internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 239)

Reply via email to