Hi,

12.10.2007 10:32,, Henning P. Schmiedehausen wrote::
> Hi,
> 
> since a few days I'm trying to migrate an existing backup
> infrastructure to Bacula 2.2.4. I have the pleasure of a fairly
> uniform OS landscape (mainly CentOS 5 with a few Fedora 7 and the odd
> Windows XP box thrown in) and getting it to run took some time but
> basically everything is in order now and chugging along well with TLS
> and everything.
> 
> This is a 2.2.4 installation with all patches for 2.2.4 applied, so
> there should be no grave danger to my data.
> 
> I am using a DAT tape drive for testing which can hold 12 G and a
> spooling disk with 50 G before it to make sure that I can run nightly
> jobs in parallel. This setup seems to work fine.
> 
> I noticed a few things:
> 
> The sd is *very* sensitive to disk full errors in its working
> directory.

Indeed.

> I had /var/spool/bacula on a partition that filled up and
> the sd was dead. It still accepted connections (visible with tcpdump)
> but never answered. Even clearing the disk full condition did not
> recover it. This is not the spooling area for the tape but the working 
> directory of
> bacula where it puts its attr files (this is what filled up the disk).

Yes, the attribute spool can grow quite big... the problem you found 
should probably be clearly noted in the manual, but otherwise, it's 
something to be solved administratively, IMO.

> Restarting the sd was a bad idea. It rewound the tape and then told me:
> 
> 12-Oct 09:10 bacula-dir: Start Backup JobId 18, 
> Job=smalljob.2007-10-12_03.05.01
> 12-Oct 09:10 bacula-dir: Using Device "DAT72Storage"
> 12-Oct 09:10 bacula-sd: Volume "backup-0002" previously written, moving to 
> end of data.
> 12-Oct 09:11 bacula-sd: smalljob.2007-10-12_03.05.01 Error: Bacula cannot 
> write on tape Volume "backup-0002" because:
> The number of files mismatch! Volume=4 Catalog=3
> 12-Oct 09:11 bacula-sd: Marking Volume "backup-0002" in Error in Catalog.
> 
> This error condition left a massive amount of half written files in
> both the working directory and the storage area:
...
> As the corresponding jobs have failed: will the files be cleaned up?

IIRC, on restart of Bacula they are removed. You can safely delete 
them manually, too.

> Or better, is there a way to re-start the jobs at the point where they
> broke off?

No.

> Another question is the communication between the fd, dir and
> sd. While the director triggers the actual backups, it just tells the
> fd and sd to connect and then just waits for them to report back,
> doesn't it?

Yes. But it's waiting for vital information, namely the end-of-job 
report from both daemons. The SD, by the way, sends the attributes to 
the DIR while the job runs - unless they are spooled, too.

> What if the director crashes or gets restarted for an yreason? Does it
> "sync up" again by polling the configured sd and fd daemons for their
> state?

No. As it doesn't know about the state of jobs running when it 
crashed, it can't. As the FD and SD don't actively connect to the DIR, 
they can't inform it of running jobs when they notice the DIR is gone.

> I noticed that if a daemon crashes, dir and fd might be in
> disagreement over job states:

True, that can happen, but is usually not a serious problem.

> * stat dir
> [...]
> Terminated Jobs:
>  JobId  Level    Files      Bytes   Status   Finished        Name 
> ====================================================================
>     23  Incr        208    80.90 M  OK       12-Oct-07 03:10 server1
>     28  Incr          0         0   Error    12-Oct-07 03:26 ... another 
> server ...
>     24  Full          0         0   Error    12-Oct-07 09:09 ... another 
> server ...
>     15  Full    120,975    27.59 G  Error    12-Oct-07 09:10 ... another 
> server ...
>     27  Full      8,864    175.2 M  Error    12-Oct-07 09:10 ... another 
> server ...
>     22  Full     87,658    17.44 G  Error    12-Oct-07 09:10 ... another 
> server ...
>     20  Full    477,251    2.573 G  Error    12-Oct-07 09:10 ... another 
> server ...
>     17  Full    185,325    1.709 G  Error    12-Oct-07 09:10 ... another 
> server ...
>     18  Incr          0         0   Cancel   12-Oct-07 09:13 ... another 
> server ...
>     29  Full          1    47.52 M  OK       12-Oct-07 09:16 BackupCatalog
> 
> * stat stor
> Terminated Jobs:
>  JobId  Level    Files      Bytes   Status   Finished        Name 
> ===================================================================
>      2  Full          1    36.90 K  OK       11-Oct-07 12:24 BackupCatalog
>      3  Full          0         0   Error    11-Oct-07 12:25 BackupCatalog
>      4  Full          1    37.45 K  OK       11-Oct-07 12:26 BackupCatalog
>      5  Full     54,063    582.5 M  OK       11-Oct-07 12:38 ... another 
> server ...
>      7  Full         79    43.16 M  OK       11-Oct-07 12:56 ... another 
> server ...
>      6  Full     44,691    1.519 G  OK       11-Oct-07 13:14 ... another 
> server ...
>      8  Full         79    43.16 M  OK       11-Oct-07 13:14 ... another 
> server ...
>      9  Full         31    88.97 M  OK       11-Oct-07 13:15 ... another 
> server ...
>     18  Incr          0         0   Cancel   12-Oct-07 09:13 ... another 
> server ...
>     29  Full          1    47.52 M  OK       12-Oct-07 09:16 BackupCatalog
> ====
> 
> So while the sd knows about jobs 18 and 29, it never got the record
> for 23. Huh? I thought that this might be related to the filled up
> disk, but job 23 was recorded as "OK" at the director. 

The "Terminated Jobs" output is only informational and represents what 
the daemon stored locally.

The more important stuff is in the catalog. If you query it, you get 
the information about data actually processed, and the catalog holds 
the information that really matters.

> I also ran a job on another server which finished ok:
> 
> *stat client=server2
> Connecting to Client server2 at server2:9102
> 
> server2 Version: 2.2.4 (14 September 2007)  i686-redhat-linux-gnu redhat 
> Daemon started 11-Oct-07 12:32, 2 Jobs run since started.
>  Heap: heap=2,039,808 smbytes=149,668 max_bytes=562,938 bufs=84 max_bufs=1,529
>  Sizeof: boffset_t=8 size_t=4 debug=1 trace=0
> 
> Running Jobs:
> Director connected at: 12-Oct-07 10:18
> No Jobs running.
> ====
> 
> Terminated Jobs:
>  JobId  Level    Files      Bytes   Status   Finished        Name 
> ======================================================================
>     10  Full     43,608    541.6 M  OK       11-Oct-07 13:50 server2
> 
> So on Oct 11th at 13:50, this job ran ok to the sd. But it does not
> show up on the sd log? (The director has it no longer in its "last
> jobs" list).

As above, the SD output doesn't matter that much. If the job is marked 
as T (terminated OK) in the catalog, the data is safe.

> I like the fact that there is a native Windows client, that I can
> schedule different jobs and that tape schedules etc. are much more
> flexible than our current backup solution (amanda, I'm moving from "a"
> to "b").

Oh... what would be "c"? ;-)

> Getting bacula to run was a bit harder than amanda, mainly because the
> docs are a bit incomplete especially in regards to TLS. It took me a
> while to know that I had to add
> 
>   TLS Enable = yes
>   TLS Require = yes
>   TLS CA Certificate Dir = /etc/pki/tls/certs
> 
> to the FileDaemon section of each client so that it will talk to the
> SD using TLS, the example on
> http://www.bacula.org/dev-manual/Bacula_TLS_Communication.html does
> not mention this (it is clear, once you figured out how the certicates
> are used).

Hmm... any suggestions for the manual?

> The fact that the sd is very sensitive to disk full conditions is a
> bit alarming to me. Maybe bacula should do disk space checks for the
> spooling and work directories and ensure that it leaves a few percent
> of the disk free in any circumstances.

A few years ago, I had discussed that with Kern, and he stated that it 
was a) difficult to implement, b) not absolutely proof, as other 
processes might fill the partition independent of Bacula, c) would 
require quite a bit of work, and d) is, provided the administrator 
takes care, is not a serious problem. You can guess his conclusion :-)

Arno

>    Best regards
>       Henning
> 

-- 
Arno Lehmann
IT-Service Lehmann
www.its-lehmann.de

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to