Hi, We are using Bacula to back up our company's data. All storages are ordinary Debian Jessie Linux servers with spinning disks, we don't use tapes. Bacula version is 7.0.5+dfsg-4~bpo80+1 and 7.4.3+dfsg-1+sid1~bpo8+1 (we tried both).
We need 2 copies of each backup placed in separate datacenters, so we run periodic Copy jobs to mirror data between storages. We want to use odd-numbered storages to make a backup, and then copy it to even-numbered storage. Our current configuration suffers from occasional deadlocks, when Bacula tries to read and write from single storage. I thought it is probably caused by mistakes in config, where storages have he same Media Type (as documented at http://www.bacula.org/7.4.x-manuals/en/main/Migration_Copy.html#SECTION002830000000000000000 ). For this reason we decided to create new config where every storage have different type from every other. When I tested this new config in testing environment, jobs got stuck and never finished. status storage=bacst2-stor showed: Device is BLOCKED waiting to create a volume for: Pool: zdenek-test-pp_old-full-pool-mirror Media type: File-storspec-mirror Available Space=5.323 GB and never making progress - the device is unusable for all jobs (they simply wait). I tried mount and label a new volume, it didnẗ made any difference. The only thig that helps is to restart the storage daemon, which makes the stuck job fail. Strace of storage daemon on bacst2 revealed that director connects to it, both authenticate to each other and storage sends "\0\0\0\0223000 OK Hello 305\n" to director. Storage then reads from socket and never gets any reply - thread just blocks in read() syscall indefinitely. Strace of director confirms this - thread connects to storage, authenticates, reads Hello and then never reply. Instead it opens communication with bacst1 and starts sending commands. Even after several minutes (test backups are several KB in size and usually finishes in few seconds) the network socket to bacst2 is still open and no communication is taking place. I verified this with tcpdump and there's nothing suspicious - the connection works normally, last packet sent is the Hello message described above. Communication on that four-tuple then simply stops, nobody sends anything, never closing the connection. There is no firewall or NAT between the servers - they are connected to single internal network. I also tried to upgrade our 7.0 install to latest 7.4 from Debian, results are exactly the same. Configuration and strace output are at: https://drive.google.com/file/d/0B4bjslETcBa-ZHVkOHU4dlZCZ2s/view?usp=sharing I can reliably replicate the issue by running (on director): for i in `seq 1 2` ; do for job in bacst1_storage-job --bacst1_storage-incremental-job-mirror \ --bacst1_storage-full-job-mirror bacdir1_director-job \ --bacdir1_director-incremental-job-mirror \ --bacdir1_director-full-job-mirror ; do echo "run job=$job yes" | bacula-console ; done ; done Is this a known problem? Is there any workaround? ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users