[Bacula-users] Solution for a strange network error affecting long running backups
Thought I would share my experience debugging a strange problem affecting long running backups. A particular job was failing with a "network send error to SD" at between 90-120 minutes after 100s of GB were written. Enabling heartbeat on Dir, SD, and client had no effect and the problem persisted. Some background. The client was a KVM VM running Centos 7 and Bacula 9.6.5. Bacula SD and Dir run together on one node of a Pacemaker-Corosync cluster, also Centos 7 and Bacula 9.6.5. Bacula server daemons can failover successfully for incremental backups, but not for full (no dual-port backup devices). Cluster uses a mix of DRBD volumes and iSCSI LUNs. There are three networks involved; one dedicated to DRBD, one dedicated to iSCSI, and a LAN connecting everything else. There were no obvious problems with any other VMs or cluster nodes. There didn't appear to be any networking issues. In both VM and cluster nodes, OS is Centos 7.8 with stock Centos kernel 3.10.0-1127.13.1 and qemu-kvm 1.5.3-173 I have had issues with Bacula jobs failing due to intermittent network issues in the past and they turned out to be either hardware errors or buggy NIC drivers. Therefore, the first thing I tried was moving the client VM to run on the same cluster node that the Bacula daemons were running on. This way the VM's virtual NIC and the cluster node's physical NIC are attached to the same Linux bridge interface, so traffic between the two should never go on the wire, eliminating the possibility of switch, wiring, and other external hardware problems. No luck. Exactly the same problem. Next I turned on debugging for the SD. This produced a tremendous amount of logging with no errors or warnings until after several hundred GB of data was received from the client and suddenly there was a bad packet received, causing the connection to be dropped. The Bacula log didn't lie. There was indeed a network send error. But why? Not having any knowledge of the internals of the Linux bridge device code, I thought that perhaps the host's physical NIC also attached to the bridge that bacula-sd is listening on, might somehow cause a problem. To eliminate that, I swapped the NIC in that host. I didn't have a different type of NIC to try, so it was replaced with another Intel i350 and of course used the same igb driver. Didn't work, but shows that it's not likely a NIC hardware error. Could a bug in the igb driver cause this? Maybe, but the NIC appeared to work flawlessly for everything else on the cluster node, including a web server VM connected to it through the same bridge. Or could it be the virtio_net driver? Again, it appears to work fine for the web server VM, but let's start with the virtio_net driver, since virtio_net (the client VM) is the sender and igb (bacula-sd listening on the cluster node's physical NIC) is the receiver. So I searched for virtio-net and/or qemu-kvm network problems. I didn't find anything like this, exactly, but I did find that people reported VM network performance problems and latency issues and that, several qemu-kvm versions ago, the solution was to disable some TCP offloading features. I didn't have high expectations, but I disabled segmentation offload (TCP and UDP), as well as generic receive offload, on all involved NICs, started the job again, and SURPRISE, it worked! Ran for almost 3 hours, backing up 700 GB compressed and had no errors. Conclusion: There is a bug somewhere! I think maybe the checksum calculation is failing when segmentation offload is enabled. It seems that checksum offload works so long as segmentation offload is disabled. I didn't try disabling checksum offload and re-enabling segmentation offload, nor did I try re-enabling generic receive offload. To disable segmentation offload I used: /sbin/ethtool -K ethX tso off gso off gro off I disabled those on all interfaces involved. It may only be necessary to do this on one of the involved interfaces. I don't know. I just don't have time to try all permutations, and this seems to work with little or no performance degradation, at least in my case. Once again, Bacula shows itself to be the most demanding network app that I know of, and so able to trigger all of the most obscure and intermittent networking problems. ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
[Bacula-users] Still present
Good morning. How are you? Some of my Jobs I see this warning when I recycle volumes. alucard1.provo1.latam-eig.com-sd JobId 5151: Recycled volume "Daily-S3-7" on Cloud device "S3-Cloud-16" (/var/bacula/S3-Cloud), all previous data lost. alucard1.provo1.latam-eig.com-sd JobId 5151: Warning: truncate_cloud_volume: Daily-S3-7/part.218 is still present alucard1.provo1.latam-eig.com-sd JobId 5151: Warning: truncate_cloud_volume: Daily-S3-7/part.217 is still present alucard1.provo1.latam-eig.com-sd JobId 5151: Warning: truncate_cloud_volume: Daily-S3-7/part.216 is still present alucard1.provo1.latam-eig.com-sd JobId 5151: Warning: truncate_cloud_volume: Daily-S3-7/part.215 is still present alucard1.provo1.latam-eig.com-sd JobId 5151: Warning: truncate_cloud_volume: Daily-S3-7/part.214 is still present alucard1.provo1.latam-eig.com-sd JobId 5151: Warning: truncate_cloud_volume: Daily-S3-7/part.213 is still present alucard1.provo1.latam-eig.com-sd JobId 5151: Warning: truncate_cloud_volume: Daily-S3-7/part.212 is still present alucard1.provo1.latam-eig.com-sd JobId 5151: Warning: truncate_cloud_volume: Daily-S3-7/part.211 is still present Do I need to worry about anything? Do I need to do anything? Thanks. ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
[Bacula-users] design challenges - file-cloud backup
Dear all, I have tested bacula sw (9.6.5) and I must say I'm quite happy with the results (eg. compression, encryption, configureability). However I have some configuration/design questions I hope, you can help me with. Regarding job schedule, I would like to: - create incremental daily backup (retention 1 week) - create weekly full backup (retention 1 month) - create monthly full backup (retention 1 year) I am using dummy cloud driver that writes to local file storage. Volume is a directory with fileparts. I would like to have seperate volumes/pools for each client. I would like to delete the data on disk after retention period expires. If possible, I would like to delete just the fileparts with expired backup. Questions: a) At the moment, I'm using two backup job definitions per client and central schedule definition for all my clients. I have noticed that my incremental job gets promoted to full after monthly backup ("No prior Full backup Job record found"; because monthly backup is a seperate job, but bacula searches for full backups inside the same job). Could you please suggest a better configuration. If possible, I would like to keep central schedule definition (If I manipulate pools in a schedule resource, I would need to define them per client). b) I would like to delete expired backups on disk (and in the catalog as well). At the moment I'm using one volume in a daily/weekly/monthly pool per client. In a volume, there are fileparts belonging to expired backups (eg. part1-23 in the output bellow). I have tried to solve this with purge/prune scripts in my BackupCatalog job (as suggested in the whitepapers) but the data does not get deleted. Is there any way to delete fileparts? Should I create separate volumes after retention period? Please suggest a better configuration. c) Do I need a restore job for each client? I would just like to restore backup on the same client, default to /restore folder... When I use bconsole restore all command, the wizard asks me all the questions (eg. 5- last backup for a client, which client,fileset...) but at the end it asks for a restore job which changes all previously defined things (eg. client). d) At the moment, I have not implemented autochanger functionality. Clients compress/encrypt the data and send them to bacula server, which writes them on one central storage system. Jobs are processed in sequential order (one at a time). Do you expect any significant performance gain if i implement autochanger in order to have jobs run simultaneously? Relevant part of configuration attached bellow. Looking forward to move in the production... Kind regards, Ziga Zvan *Volume example *(fileparts 1-23 should be deleted)*:* [root@bacula cetrtapot-daily-vol-0022]# ls -ltr total 0 -rw-r--r--. 1 bacula disk 262 Jul 28 23:05 part.1 -rw-r--r--. 1 bacula disk 35988 Jul 28 23:06 part.2 -rw-r--r--. 1 bacula disk 35992 Jul 28 23:07 part.3 -rw-r--r--. 1 bacula disk 36000 Jul 28 23:08 part.4 -rw-r--r--. 1 bacula disk 35981 Jul 28 23:09 part.5 -rw-r--r--. 1 bacula disk 328795126 Jul 28 23:10 part.6 -rw-r--r--. 1 bacula disk 35988 Jul 29 23:09 part.7 -rw-r--r--. 1 bacula disk 35995 Jul 29 23:10 part.8 -rw-r--r--. 1 bacula disk 35981 Jul 29 23:11 part.9 -rw-r--r--. 1 bacula disk 35992 Jul 29 23:12 part.10 -rw-r--r--. 1 bacula disk 453070890 Jul 29 23:12 part.11 -rw-r--r--. 1 bacula disk 35995 Jul 30 23:09 part.12 -rw-r--r--. 1 bacula disk 35993 Jul 30 23:10 part.13 -rw-r--r--. 1 bacula disk 36000 Jul 30 23:11 part.14 -rw-r--r--. 1 bacula disk 35984 Jul 30 23:12 part.15 -rw-r--r--. 1 bacula disk 580090514 Jul 30 23:13 part.16 -rw-r--r--. 1 bacula disk 35994 Aug 3 23:09 part.17 -rw-r--r--. 1 bacula disk 35936 Aug 3 23:12 part.18 -rw-r--r--. 1 bacula disk 35971 Aug 3 23:13 part.19 -rw-r--r--. 1 bacula disk 35984 Aug 3 23:14 part.20 -rw-r--r--. 1 bacula disk 35973 Aug 3 23:15 part.21 -rw-r--r--. 1 bacula disk 35977 Aug 3 23:17 part.22 -rw-r--r--. 1 bacula disk 108461297 Aug 3 23:17 part.23 -rw-r--r--. 1 bacula disk 35974 Aug 4 23:09 part.24 -rw-r--r--. 1 bacula disk 35987 Aug 4 23:10 part.25 -rw-r--r--. 1 bacula disk 35971 Aug 4 23:11 part.26 -rw-r--r--. 1 bacula disk 36000 Aug 4 23:12 part.27 -rw-r--r--. 1 bacula disk 398437855 Aug 4 23:12 part.28 *Cache (deleted as expected):* [root@bacula cetrtapot-daily-vol-0022]# ls -ltr /mnt/backup_bacula/cloudcache/cetrtapot-daily-vol-0022/ total 4 -rw-r-. 1 bacula disk 262 Jul 28 23:05 part.1 *Relevant part of central configuration* # Backup the catalog database (after the nightly save) Job { Name = "BackupCatalog" JobDefs = "CatalogJob" Level = Full FileSet="Catalog" Schedule = "WeeklyCycleAfterBackup" RunBeforeJob = "/opt/bacula/scripts/make_catalog_backup.pl MyCatalog" # This deletes the copy of the catalog RunAfterJob = "/opt/bacula/scripts/delete_catalog_backup"