Hi, 12.10.2007 10:32,, Henning P. Schmiedehausen wrote:: > Hi, > > since a few days I'm trying to migrate an existing backup > infrastructure to Bacula 2.2.4. I have the pleasure of a fairly > uniform OS landscape (mainly CentOS 5 with a few Fedora 7 and the odd > Windows XP box thrown in) and getting it to run took some time but > basically everything is in order now and chugging along well with TLS > and everything. > > This is a 2.2.4 installation with all patches for 2.2.4 applied, so > there should be no grave danger to my data. > > I am using a DAT tape drive for testing which can hold 12 G and a > spooling disk with 50 G before it to make sure that I can run nightly > jobs in parallel. This setup seems to work fine. > > I noticed a few things: > > The sd is *very* sensitive to disk full errors in its working > directory.
Indeed. > I had /var/spool/bacula on a partition that filled up and > the sd was dead. It still accepted connections (visible with tcpdump) > but never answered. Even clearing the disk full condition did not > recover it. This is not the spooling area for the tape but the working > directory of > bacula where it puts its attr files (this is what filled up the disk). Yes, the attribute spool can grow quite big... the problem you found should probably be clearly noted in the manual, but otherwise, it's something to be solved administratively, IMO. > Restarting the sd was a bad idea. It rewound the tape and then told me: > > 12-Oct 09:10 bacula-dir: Start Backup JobId 18, > Job=smalljob.2007-10-12_03.05.01 > 12-Oct 09:10 bacula-dir: Using Device "DAT72Storage" > 12-Oct 09:10 bacula-sd: Volume "backup-0002" previously written, moving to > end of data. > 12-Oct 09:11 bacula-sd: smalljob.2007-10-12_03.05.01 Error: Bacula cannot > write on tape Volume "backup-0002" because: > The number of files mismatch! Volume=4 Catalog=3 > 12-Oct 09:11 bacula-sd: Marking Volume "backup-0002" in Error in Catalog. > > This error condition left a massive amount of half written files in > both the working directory and the storage area: ... > As the corresponding jobs have failed: will the files be cleaned up? IIRC, on restart of Bacula they are removed. You can safely delete them manually, too. > Or better, is there a way to re-start the jobs at the point where they > broke off? No. > Another question is the communication between the fd, dir and > sd. While the director triggers the actual backups, it just tells the > fd and sd to connect and then just waits for them to report back, > doesn't it? Yes. But it's waiting for vital information, namely the end-of-job report from both daemons. The SD, by the way, sends the attributes to the DIR while the job runs - unless they are spooled, too. > What if the director crashes or gets restarted for an yreason? Does it > "sync up" again by polling the configured sd and fd daemons for their > state? No. As it doesn't know about the state of jobs running when it crashed, it can't. As the FD and SD don't actively connect to the DIR, they can't inform it of running jobs when they notice the DIR is gone. > I noticed that if a daemon crashes, dir and fd might be in > disagreement over job states: True, that can happen, but is usually not a serious problem. > * stat dir > [...] > Terminated Jobs: > JobId Level Files Bytes Status Finished Name > ==================================================================== > 23 Incr 208 80.90 M OK 12-Oct-07 03:10 server1 > 28 Incr 0 0 Error 12-Oct-07 03:26 ... another > server ... > 24 Full 0 0 Error 12-Oct-07 09:09 ... another > server ... > 15 Full 120,975 27.59 G Error 12-Oct-07 09:10 ... another > server ... > 27 Full 8,864 175.2 M Error 12-Oct-07 09:10 ... another > server ... > 22 Full 87,658 17.44 G Error 12-Oct-07 09:10 ... another > server ... > 20 Full 477,251 2.573 G Error 12-Oct-07 09:10 ... another > server ... > 17 Full 185,325 1.709 G Error 12-Oct-07 09:10 ... another > server ... > 18 Incr 0 0 Cancel 12-Oct-07 09:13 ... another > server ... > 29 Full 1 47.52 M OK 12-Oct-07 09:16 BackupCatalog > > * stat stor > Terminated Jobs: > JobId Level Files Bytes Status Finished Name > =================================================================== > 2 Full 1 36.90 K OK 11-Oct-07 12:24 BackupCatalog > 3 Full 0 0 Error 11-Oct-07 12:25 BackupCatalog > 4 Full 1 37.45 K OK 11-Oct-07 12:26 BackupCatalog > 5 Full 54,063 582.5 M OK 11-Oct-07 12:38 ... another > server ... > 7 Full 79 43.16 M OK 11-Oct-07 12:56 ... another > server ... > 6 Full 44,691 1.519 G OK 11-Oct-07 13:14 ... another > server ... > 8 Full 79 43.16 M OK 11-Oct-07 13:14 ... another > server ... > 9 Full 31 88.97 M OK 11-Oct-07 13:15 ... another > server ... > 18 Incr 0 0 Cancel 12-Oct-07 09:13 ... another > server ... > 29 Full 1 47.52 M OK 12-Oct-07 09:16 BackupCatalog > ==== > > So while the sd knows about jobs 18 and 29, it never got the record > for 23. Huh? I thought that this might be related to the filled up > disk, but job 23 was recorded as "OK" at the director. The "Terminated Jobs" output is only informational and represents what the daemon stored locally. The more important stuff is in the catalog. If you query it, you get the information about data actually processed, and the catalog holds the information that really matters. > I also ran a job on another server which finished ok: > > *stat client=server2 > Connecting to Client server2 at server2:9102 > > server2 Version: 2.2.4 (14 September 2007) i686-redhat-linux-gnu redhat > Daemon started 11-Oct-07 12:32, 2 Jobs run since started. > Heap: heap=2,039,808 smbytes=149,668 max_bytes=562,938 bufs=84 max_bufs=1,529 > Sizeof: boffset_t=8 size_t=4 debug=1 trace=0 > > Running Jobs: > Director connected at: 12-Oct-07 10:18 > No Jobs running. > ==== > > Terminated Jobs: > JobId Level Files Bytes Status Finished Name > ====================================================================== > 10 Full 43,608 541.6 M OK 11-Oct-07 13:50 server2 > > So on Oct 11th at 13:50, this job ran ok to the sd. But it does not > show up on the sd log? (The director has it no longer in its "last > jobs" list). As above, the SD output doesn't matter that much. If the job is marked as T (terminated OK) in the catalog, the data is safe. > I like the fact that there is a native Windows client, that I can > schedule different jobs and that tape schedules etc. are much more > flexible than our current backup solution (amanda, I'm moving from "a" > to "b"). Oh... what would be "c"? ;-) > Getting bacula to run was a bit harder than amanda, mainly because the > docs are a bit incomplete especially in regards to TLS. It took me a > while to know that I had to add > > TLS Enable = yes > TLS Require = yes > TLS CA Certificate Dir = /etc/pki/tls/certs > > to the FileDaemon section of each client so that it will talk to the > SD using TLS, the example on > http://www.bacula.org/dev-manual/Bacula_TLS_Communication.html does > not mention this (it is clear, once you figured out how the certicates > are used). Hmm... any suggestions for the manual? > The fact that the sd is very sensitive to disk full conditions is a > bit alarming to me. Maybe bacula should do disk space checks for the > spooling and work directories and ensure that it leaves a few percent > of the disk free in any circumstances. A few years ago, I had discussed that with Kern, and he stated that it was a) difficult to implement, b) not absolutely proof, as other processes might fill the partition independent of Bacula, c) would require quite a bit of work, and d) is, provided the administrator takes care, is not a serious problem. You can guess his conclusion :-) Arno > Best regards > Henning > -- Arno Lehmann IT-Service Lehmann www.its-lehmann.de ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users