[firebird-support] internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 239)

n6auei4hz6ypittow3d2xkak432y2oszb2gth...@yahoo.com [firebird-support] Mon, 03 Oct 2016 11:51:08 -0700
Hi FB Support!
 
 
 We have been running FB since IB4.2 days and have only previously had any 
database corruption when we've had hardware issues...until recently, when I've 
seen this error on 2 different servers that I'm reasonably confident are both 
fine! However, our setup is probably fairly unique. I'm not panicking at the 
moment as we recovered all the data from the backup, but as I've had this twice 
in a month on different machines I'm keen to adjust the setup so it doesn't 
happen again.
 
 
 We are running about 20 customers on a server, each with their own set of 
Firebird/PHP-FPM/Nginx/custom python servers running as services bound to 
separate IP addresses – this is (in theory!) to keep each customer isolated 
from the others whilst sharing most of the resources on the box without too 
much overhead from containers or full visualisation.
 
 
 FB is listening as a tcpsvd service but with a shared lock folder:
 
 
 exec tcpsvd -c 60 -u firebird:firebird -l $remote_bind_addr $remote_bind_addr 
$remote_bind_port /usr/sbin/fb_inet_server -i -e "/srv/$CUSTOMER/interbase/" 
-el "/tmp/firebird/tmpfs"
 
 
 and with a “localhost” for our own convenience:
 
 
 exec tcpsvd -c 60 -u firebird:firebird -l 127.0.0.1 127.0.0.1 3050 
/usr/sbin/fb_inet_server -i -e "/var/interbase/" -el "/tmp/firebird/tmpfs"
 
 
 I had assumed because they were using the share lock folder that this is 
allowed. 
 
 
 Each night we loop through all the customers and run :
 
 
 for database in /customers/*.gdb
 do
 echo “Doing $database:”
 echo -n “gbak ”
 gbak -b -g localhost:$database /backupdata/$database.gbak
 echo $?
 echo -n “gfix ”
 gfix -sweep localhost:$database
 echo $?
 echo -n “isql “ 
 echo “EXECURE PROCEDURE nightlySQL;” | isql localhost$database
 echo $?
 done
 
 
 Whilst this backup is running we don't disable any external access and have a 
mixture of python and php clients accessing throughout the night using the 
$remote_bind_addr:$database connection string rather than localhost:. 
 
 
 Over the weekend a 12GB backup finished successfully, but errors started 
appearing in the firebird2.5.log file at the same time as the next database was 
being backed up – suggesting that either the gfix or isql had tripped up before 
the script moved on. However, the output from the backup was:
 
 
 Doing <database>: 
 gbak 0
 gfix 0
 isql 0
 
 
 The firebird log has the next customer's sweep starting immediately after the 
first “internal Firebird consistency check (decompression overran buffer (179), 
file: sqz.cpp line: 239)” on the now corrupt database which suggests the isql 
line is “to blame” - but the nightlySQL doesn't do a lot, just deletes from a 
table and re-populates a load of summaries from a big table. The firebird log 
error coincides with an NGINX request for the data that the nightlySQL is 
building, but I've repeated this today and it doesn't by itself kill the 
database. I've got nothing else in either syslog or dmesg.
 
 
 I've run IBSurgeon against the file. It gives me the following output:
 
 
 03/10/2016 11:23:53 INFO: Open database files: Z:\home\**-bad.gdb
 
 
 03/10/2016 11:23:53 INFO: Analyzing database low-level structures...
 03/10/2016 11:27:42 INFO: Actual PageCount: 1534112 found in database
 03/10/2016 11:27:42 ERROR: Found 18 undefined or unrecognized pages.
 03/10/2016 11:27:42 INFO: ====== DATABASE IS READY FOR DIAGNOSING AND 
REPAIRING. ====
 03/10/2016 11:27:42 INFO: ====== Now choose "Diagnose" or "Repair". ====
 03/10/2016 11:51:31 INFO: ------------------- Starting diagnose
 03/10/2016 11:51:31 INFO: Running procedure: Header page check
 03/10/2016 11:51:31 INFO: ODS Major = 11 (32779)
 03/10/2016 11:51:31 INFO: ODS Minor = 2
 03/10/2016 11:51:31 INFO: Next transaction = 13910909
 03/10/2016 11:51:31 INFO: Oldest transaction = 13910907
 03/10/2016 11:51:31 INFO: Oldest active = 13910908
 03/10/2016 11:51:31 INFO: Oldest snapshot = 13910908
 03/10/2016 11:51:31 INFO: PageSize is Ok = 8192
 03/10/2016 11:51:31 INFO: Running procedure: Checking of RDB$Pages consistency
 03/10/2016 11:53:42 INFO: Checking of RDB$Pages consistency: Ok
 03/10/2016 11:53:42 INFO: Running procedure: Low-level check of all relations
 03/10/2016 11:53:43 INFO: Relation RDB$DATABASE (1) is OK
 03/10/2016 11:53:44 INFO: Relation RDB$FIELDS (2) is OK
 03/10/2016 11:53:46 INFO: Relation RDB$INDEX_SEGMENTS (3) is OK
 03/10/2016 11:53:47 INFO: Relation RDB$INDICES (4) is OK
 03/10/2016 11:53:47 INFO: Relation RDB$RELATION_FIELDS (5) is OK
 03/10/2016 11:53:48 INFO: Relation RDB$RELATIONS (6) is OK
 03/10/2016 11:53:48 INFO: Relation RDB$VIEW_RELATIONS (7) is OK
 03/10/2016 11:53:48 INFO: Relation RDB$FORMATS (8) is OK
 03/10/2016 11:53:48 INFO: Relation RDB$SECURITY_CLASSES (9) is OK
 03/10/2016 11:53:48 ERROR: DP#1533855 has wrong rel#:136
 03/10/2016 11:53:48 ERROR: Found 1 record errors on datapage#1533855
 03/10/2016 11:53:48 ERROR: Error on data page #1533855
 03/10/2016 11:53:48 INFO: Pointer page #24 checking: found 1 errors.
 03/10/2016 11:53:48 ERROR: Error in checking relation #10 Found 1 errors.
 03/10/2016 11:53:48 ERROR: Relation RDB$FILES (10) is CORRUPT
 03/10/2016 11:53:48 INFO: Relation RDB$TYPES (11) is OK
 
 
 and then everything after that is OK until the end…
 
 
 03/10/2016 12:21:03 INFO: Relation SYSRELATIONSHIP (487) is OK
 03/10/2016 12:21:03 ERROR: All relations check found 1 errors.
 03/10/2016 12:21:03 INFO: ------------------- Finished diagnose--------
 =============== !!!!!!!!!!! ==================
 
 
 On the smaller database the reported error was:
 
 
 03/10/2016 16:05:06 INFO: Relation RDB$SECURITY_CLASSES (9) is OK
 03/10/2016 16:05:06 ERROR: DP#23043 has wrong rel#:234
 03/10/2016 16:05:06 ERROR: Found 1 record errors on datapage#23043
 03/10/2016 16:05:06 ERROR: Error on data page #23043
 03/10/2016 16:05:06 INFO: Pointer page #24 checking: found 1 errors.
 03/10/2016 16:05:06 ERROR: Error in checking relation #10 Found 1 errors.
 03/10/2016 16:05:06 ERROR: Relation RDB$FILES (10) is CORRUPT
 03/10/2016 16:05:06 INFO: Relation RDB$TYPES (11) is OK
 
 
 which looks to me like the same problem.
 
 
 I'm open to the idea that we should always be using the same connection string 
to access the same database, but I was expecting the shared lock folder to do 
everything I wanted in a classic world. 
 
 
 Apart from that, does anyone have any idea what might be happening? 
 
 
 Is there any way to get back from the error above without having to reach for 
the last backup?
 
 
 Thanks in advance,
 
 
 Ian
[firebird-support] internal Firebird consistency check (decompression overran buffer (179), file: sqz.cpp line: 239)

Reply via email to