[Bacula-users] Solution for a strange network error affecting long running backups

2020-08-05 Thread Josh Fisher
Thought I would share my experience debugging a strange problem 
affecting long running backups. A particular job was failing with a 
"network send error to SD" at between 90-120 minutes after 100s of GB 
were written. Enabling heartbeat on Dir, SD, and client had no effect 
and the problem persisted.


Some background. The client was a KVM VM running Centos 7 and Bacula 
9.6.5. Bacula SD and Dir run together on one node of a 
Pacemaker-Corosync cluster, also Centos 7 and Bacula 9.6.5. Bacula 
server daemons can failover successfully for incremental backups, but 
not for full (no dual-port backup devices). Cluster uses a mix of DRBD 
volumes and iSCSI LUNs. There are three networks involved; one dedicated 
to DRBD, one dedicated to iSCSI, and a LAN connecting everything else. 
There were no obvious problems with any other VMs or cluster nodes. 
There didn't appear to be any networking issues. In both VM and cluster 
nodes, OS is Centos 7.8 with stock Centos kernel 3.10.0-1127.13.1 and 
qemu-kvm 1.5.3-173


I have had issues with Bacula jobs failing due to intermittent network 
issues in the past and they turned out to be either hardware errors or 
buggy NIC drivers. Therefore, the first thing I tried was moving the 
client VM to run on the same cluster node that the Bacula daemons were 
running on. This way the VM's virtual NIC and the cluster node's 
physical NIC are attached to the same Linux bridge interface, so traffic 
between the two should never go on the wire, eliminating the possibility 
of switch, wiring, and other external hardware problems. No luck. 
Exactly the same problem.


Next I turned on debugging for the SD. This produced a tremendous amount 
of logging with no errors or warnings until after several hundred GB of 
data was received from the client and suddenly there was a bad packet 
received, causing the connection to be dropped. The Bacula log didn't 
lie. There was indeed a network send error. But why?


Not having any knowledge of the internals of the Linux bridge device 
code, I thought that perhaps the host's physical NIC also attached to 
the bridge that bacula-sd is listening on, might somehow cause a 
problem. To eliminate that, I swapped the NIC in that host. I didn't 
have a different type of NIC to try, so it was replaced with another 
Intel i350 and of course used the same igb driver. Didn't work, but 
shows that it's not likely a NIC hardware error. Could a bug in the igb 
driver cause this? Maybe, but the NIC appeared to work flawlessly for 
everything else on the cluster node, including a web server VM connected 
to it through the same bridge. Or could it be the virtio_net driver? 
Again, it appears to work fine for the web server VM, but let's start 
with the virtio_net driver, since virtio_net (the client VM) is the 
sender and igb (bacula-sd listening on the cluster node's physical NIC) 
is the receiver.


So I searched for virtio-net and/or qemu-kvm network problems. I didn't 
find anything like this, exactly, but I did find that people reported VM 
network performance problems and latency issues and that, several 
qemu-kvm versions ago, the solution was to disable some TCP offloading 
features. I didn't have high expectations, but I disabled segmentation 
offload (TCP and UDP), as well as generic receive offload, on all 
involved NICs, started the job again, and SURPRISE, it worked! Ran for 
almost 3 hours, backing up 700 GB compressed and had no errors.


Conclusion: There is a bug somewhere! I think maybe the checksum 
calculation is failing when segmentation offload is enabled. It seems 
that checksum offload works so long as segmentation offload is disabled. 
I didn't try disabling checksum offload and re-enabling segmentation 
offload, nor did I try re-enabling generic receive offload.


To disable segmentation offload I used:

/sbin/ethtool -K ethX tso off gso off gro off

I disabled those on all interfaces involved. It may only be necessary to 
do this on one of the involved interfaces. I don't know. I just don't 
have time to try all permutations, and this seems to work with little or 
no performance degradation, at least in my case.


Once again, Bacula shows itself to be the most demanding network app 
that I know of, and so able to trigger all of the most obscure and 
intermittent networking problems.





___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


[Bacula-users] Still present

2020-08-05 Thread Crystian Dorabiatto
Good morning. How are you?

Some of my Jobs I see this warning when I recycle volumes.

alucard1.provo1.latam-eig.com-sd JobId 5151: Recycled volume "Daily-S3-7"
on Cloud device "S3-Cloud-16" (/var/bacula/S3-Cloud), all previous data
lost.
alucard1.provo1.latam-eig.com-sd JobId 5151: Warning:
truncate_cloud_volume: Daily-S3-7/part.218 is still present
alucard1.provo1.latam-eig.com-sd JobId 5151: Warning:
truncate_cloud_volume: Daily-S3-7/part.217 is still present
alucard1.provo1.latam-eig.com-sd JobId 5151: Warning:
truncate_cloud_volume: Daily-S3-7/part.216 is still present
alucard1.provo1.latam-eig.com-sd JobId 5151: Warning:
truncate_cloud_volume: Daily-S3-7/part.215 is still present
alucard1.provo1.latam-eig.com-sd JobId 5151: Warning:
truncate_cloud_volume: Daily-S3-7/part.214 is still present
alucard1.provo1.latam-eig.com-sd JobId 5151: Warning:
truncate_cloud_volume: Daily-S3-7/part.213 is still present
alucard1.provo1.latam-eig.com-sd JobId 5151: Warning:
truncate_cloud_volume: Daily-S3-7/part.212 is still present
alucard1.provo1.latam-eig.com-sd JobId 5151: Warning:
truncate_cloud_volume: Daily-S3-7/part.211 is still present

Do I need to worry about anything?

Do I need to do anything?

Thanks.
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


[Bacula-users] design challenges - file-cloud backup

2020-08-05 Thread Žiga Žvan


Dear all,
I have tested bacula sw (9.6.5) and I must say I'm quite happy with the 
results (eg. compression, encryption, configureability). However I have 
some configuration/design questions I hope, you can help me with.


Regarding job schedule, I would like to:
- create incremental daily backup (retention 1 week)
- create weekly full backup (retention 1 month)
- create monthly full backup (retention 1 year)

I am using dummy cloud driver that writes to local file storage.  Volume 
is a directory with fileparts. I would like to have seperate 
volumes/pools for each client. I would like to delete the data on disk 
after retention period expires. If possible, I would like to delete just 
the fileparts with expired backup.


Questions:
a) At the moment, I'm using two backup job definitions per client and 
central schedule definition for all my clients. I have noticed that my 
incremental job gets promoted to full after monthly backup ("No prior 
Full backup Job record found"; because monthly backup is a seperate job, 
but bacula searches for full backups inside the same job). Could you 
please suggest a better configuration. If possible, I would like to keep 
central schedule definition (If I manipulate pools in a schedule 
resource, I would need to define them per client).


b) I would like to delete expired backups on disk (and in the catalog as 
well). At the moment I'm using one volume in a daily/weekly/monthly pool 
per client. In a volume, there are fileparts belonging to expired 
backups (eg. part1-23 in the output bellow). I have tried to solve this 
with purge/prune scripts in my BackupCatalog job (as suggested in the 
whitepapers) but the data does not get deleted. Is there any way to 
delete fileparts? Should I create separate volumes after retention 
period? Please suggest a better configuration.


c) Do I need a restore job for each client? I would just like to restore 
backup on the same client, default to /restore folder... When I use 
bconsole restore all command, the wizard asks me all the questions (eg. 
5- last backup for a client, which client,fileset...) but at the end it 
asks for a restore job which changes all previously defined things (eg. 
client).


d) At the moment, I have not implemented autochanger functionality. 
Clients compress/encrypt the data and send them to bacula server, which 
writes them on one central storage system. Jobs are processed in 
sequential order (one at a time). Do you expect any significant 
performance gain if i implement autochanger in order to have jobs run 
simultaneously?


Relevant part of configuration attached bellow.

Looking forward to move in the production...
Kind regards,
Ziga Zvan


*Volume example *(fileparts 1-23 should be deleted)*:*
[root@bacula cetrtapot-daily-vol-0022]# ls -ltr
total 0
-rw-r--r--. 1 bacula disk   262 Jul 28 23:05 part.1
-rw-r--r--. 1 bacula disk 35988 Jul 28 23:06 part.2
-rw-r--r--. 1 bacula disk 35992 Jul 28 23:07 part.3
-rw-r--r--. 1 bacula disk 36000 Jul 28 23:08 part.4
-rw-r--r--. 1 bacula disk 35981 Jul 28 23:09 part.5
-rw-r--r--. 1 bacula disk 328795126 Jul 28 23:10 part.6
-rw-r--r--. 1 bacula disk 35988 Jul 29 23:09 part.7
-rw-r--r--. 1 bacula disk 35995 Jul 29 23:10 part.8
-rw-r--r--. 1 bacula disk 35981 Jul 29 23:11 part.9
-rw-r--r--. 1 bacula disk 35992 Jul 29 23:12 part.10
-rw-r--r--. 1 bacula disk 453070890 Jul 29 23:12 part.11
-rw-r--r--. 1 bacula disk 35995 Jul 30 23:09 part.12
-rw-r--r--. 1 bacula disk 35993 Jul 30 23:10 part.13
-rw-r--r--. 1 bacula disk 36000 Jul 30 23:11 part.14
-rw-r--r--. 1 bacula disk 35984 Jul 30 23:12 part.15
-rw-r--r--. 1 bacula disk 580090514 Jul 30 23:13 part.16
-rw-r--r--. 1 bacula disk 35994 Aug  3 23:09 part.17
-rw-r--r--. 1 bacula disk 35936 Aug  3 23:12 part.18
-rw-r--r--. 1 bacula disk 35971 Aug  3 23:13 part.19
-rw-r--r--. 1 bacula disk 35984 Aug  3 23:14 part.20
-rw-r--r--. 1 bacula disk 35973 Aug  3 23:15 part.21
-rw-r--r--. 1 bacula disk 35977 Aug  3 23:17 part.22
-rw-r--r--. 1 bacula disk 108461297 Aug  3 23:17 part.23
-rw-r--r--. 1 bacula disk 35974 Aug  4 23:09 part.24
-rw-r--r--. 1 bacula disk 35987 Aug  4 23:10 part.25
-rw-r--r--. 1 bacula disk 35971 Aug  4 23:11 part.26
-rw-r--r--. 1 bacula disk 36000 Aug  4 23:12 part.27
-rw-r--r--. 1 bacula disk 398437855 Aug  4 23:12 part.28

*Cache (deleted as expected):*

[root@bacula cetrtapot-daily-vol-0022]# ls -ltr 
/mnt/backup_bacula/cloudcache/cetrtapot-daily-vol-0022/

total 4
-rw-r-. 1 bacula disk 262 Jul 28 23:05 part.1

*Relevant part of central configuration*

# Backup the catalog database (after the nightly save)
Job {
  Name = "BackupCatalog"
  JobDefs = "CatalogJob"
  Level = Full
  FileSet="Catalog"
  Schedule = "WeeklyCycleAfterBackup"
  RunBeforeJob = "/opt/bacula/scripts/make_catalog_backup.pl MyCatalog"
  # This deletes the copy of the catalog
  RunAfterJob  = "/opt/bacula/scripts/delete_catalog_backup"