I have a client who is experiencing intermitent database problems.
We are not sure of the cause. The symptoms are:
17-Aug 20:22 system100-dir: Start Backup JobId 1352, Job=store10.2007-
08-17_20.20.06
17-Aug 20:22 system100-sd: Spooling data ...
18-Aug 01:58 system100-sd: User specified spool size reached.
18-Aug 01:58 system100-sd: Writing spooled data to Volume. Despooling
88,832,184,162 bytes ...
18-Aug 03:17 system100-sd: Despooling elapsed time = 01:18:31,
Transfer rate = 18.85 M bytes/second
18-Aug 03:17 system100-sd: Spooling data again ...
18-Aug 09:54 system100-sd: User specified spool size reached.
18-Aug 09:54 system100-sd: Writing spooled data to Volume. Despooling
88,832,184,971 bytes ...
18-Aug 10:12 system100-sd: End of Volume "AHP660L3" at 596:9589 on
device "NEO 2000" (/dev/nsa0). Write of 64512 bytes got 0.
18-Aug 10:12 system100-sd: Re-read of last block succeeded.
18-Aug 10:12 system100-sd: End of medium on Volume "AHP660L3"
Bytes=576,881,077,248 Blocks=8,942,228 at 18-Aug-2007 10:12.
18-Aug 10:13 system100-sd: 3307 Issuing autochanger "unload slot 1,
drive 0" command.
18-Aug 10:14 system100-sd: 3304 Issuing autochanger "load slot 16,
drive 0" command.
18-Aug 10:14 system100-sd: 3305 Autochanger "load slot 16, drive 0",
status is OK.
18-Aug 10:14 system100-sd: 3301 Issuing autochanger "loaded? drive 0"
command.
18-Aug 10:14 system100-sd: 3302 Autochanger "loaded? drive 0", result
is Slot 16.
18-Aug 10:15 system100-sd: Volume "AHP675L3" previously written,
moving to end of data.
18-Aug 10:15 system100-sd: Ready to append to end of Volume
"AHP675L3" at file=102.
18-Aug 10:15 system100-sd: New volume "AHP675L3" mounted on device
"NEO 2000" (/dev/nsa0) at 18-Aug-2007 10:15.
18-Aug 11:10 system100-sd: Despooling elapsed time = 01:12:54,
Transfer rate = 20.30 M bytes/second
18-Aug 11:10 system100-sd: Spooling data again ...
18-Aug 11:25 system100-sd: Job write elapsed time = 15:00:07,
Transfer rate = 3.396 M bytes/second
18-Aug 11:25 system100-sd: Committing spooled data to Volume
"AHP675L3". Despooling 6,057,078,796 bytes ...
18-Aug 11:30 system100-sd: Despooling elapsed time = 00:04:55,
Transfer rate = 20.53 M bytes/second
18-Aug 11:30 system100-sd: Sending spooled attrs to the Director.
Despooling 1,325,598,167 bytes ...
18-Aug 15:13 system100-dir: store10.2007-08-17_20.20.06 Fatal error:
sql_create.c:751 sql_create.c:751 insert INSERT INTO File
(FileIndex,JobId,PathId,FilenameId,LS
tat,MD5) VALUES (2170218,1352,189086,44596933,'QT GYA0u IGA B fS fS
Z5/TY CjY EAA Y BGFlEy BGFlEy BGFlE1 A A C','2RFcBjCxy4nhE/kbqDD/dQ')
failed:
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
18-Aug 15:13 system100-dir: sql_create.c:751 INSERT INTO File
(FileIndex,JobId,PathId,FilenameId,LStat,MD5) VALUES
(2170218,1352,189086,44596933,'QT GYA0u IGA B fS
fS Z5/TY CjY EAA Y BGFlEy BGFlEy BGFlE1 A A
C','2RFcBjCxy4nhE/kbqDD/dQ')
18-Aug 15:13 system100-dir: store10.2007-08-17_20.20.06 Fatal error:
sql_create.c:753 Create db File record INSERT INTO File
(FileIndex,JobId,PathId,FilenameId,LSta
t,MD5) VALUES (2170218,1352,189086,44596933,'QT GYA0u IGA B fS fS
Z5/TY CjY EAA Y BGFlEy BGFlEy BGFlE1 A A C','2RFcBjCxy4nhE/kbqDD/dQ')
failed. ERR=server closed th
e connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
I think we'll all agree that 113,000 lines in the log file with
similar error messages is not very useful to anyone. ;)
I'd like to discuss better handling of this situation. At present,
processing does not terminate and all subsequent database
interactions generate another error. The log swells with errors
similar to the above. I think we can do better.
I propose: when this type of error occurs, retry the connection
(perhaps a few times, sleeping in between). When that fails,
terminate the job, don't continue spooling attributes. The backup
data will be there on the Volume and the Catalog will be incomplete.
Notify the user accordingly. If necessary, a bscan can be done.
Points to consider:
1 - Does this propose change make sense? Any objections/suggestions?
2 - is this relatively easy to implement? I am not familiar with
this part of the code, but am willing to implement it.
--
Dan Langille - http://www.langille.org/
Available for hire: http://www.freebsddiary.org/dan_langille.php
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel