Hello Mike Try to upgrade your very old bacula client to a new one (at least as the same version of bacula-dir)
Best regards *Wanderlei Hüttel* http://www.bacula.com.br Em qua, 8 de mai de 2019 às 13:31, Mike Benoit <mike.ben...@gmail.com> escreveu: > We have some jobs that backup SQL databases, so immediately after the > backup is complete we have a "AfterJob" that restores it to a test system, > then we run a DiskToCatalog verify job. However the restore fails with a > Block checksum mismatch and immediately after the verify job succeeds. > > The backup job seems to succeed without any errors: > ------------------------------------ > 07-May 21:05 server-dir JobId 206335: Start Backup JobId 206335, > Job=test.domain.com-sql.2019-05-07_21.05.01_27 > 07-May 21:05 server-dir JobId 206335: Using Device "FileStorage" to write. > 07-May 21:05 test.domain.com-fd JobId 206335: shell command: run > ClientRunBeforeJob "scripts/postgresql_dump.sh" > 07-May 21:15 server-sd JobId 206335: Elapsed time=00:04:22, Transfer > rate=13.49 M Bytes/second > 07-May 21:15 server-sd JobId 206335: Sending spooled attrs to the > Director. Despooling 4,073 bytes ... > 07-May 21:15 server-dir JobId 206335: Bacula server-dir 9.0.6 (20Nov17): > Build OS: x86_64-pc-linux-gnu ubuntu 18.04 > JobId: 206335 > Job: test.domain.com-sql.2019-05-07_21.05.01_27 > Backup Level: Incremental, since=2019-05-06 21:11:28 > Client: "test.domain.com-fd" 5.2.6 (21Feb12) > x86_64-pc-linux-gnu,debian,jessie/sid > FileSet: "test-sql" 2017-08-30 21:05:00 > Pool: "File" (From Job resource) > Catalog: "MyCatalog" (From Client resource) > Storage: "File" (From Pool resource) > Scheduled time: 07-May-2019 21:05:01 > Start time: 07-May-2019 21:11:00 > End time: 07-May-2019 21:15:24 > Elapsed time: 4 mins 24 secs > Priority: 500 > FD Files Written: 15 > SD Files Written: 15 > FD Bytes Written: 3,535,794,319 (3.535 GB) > SD Bytes Written: 3,535,796,681 (3.535 GB) > Rate: 13393.2 KB/s > Software Compression: None > Comm Line Compression: None > Snapshot/VSS: no > Encryption: no > Accurate: no > Volume name(s): Vol0043 > Volume Session Id: 4 > Volume Session Time: 1557261339 > Last Volume Bytes: 26,223,400,919 (26.22 GB) > Non-fatal FD errors: 0 > SD Errors: 0 > FD termination status: OK > SD termination status: OK > Termination: Backup OK > > 07-May 21:15 server-dir JobId 206335: Begin pruning Jobs older than 2 > months . > 07-May 21:15 server-dir JobId 206335: No Jobs found to prune. > 07-May 21:15 server-dir JobId 206335: Begin pruning Files. > 07-May 21:15 server-dir JobId 206335: No Files found to prune. > 07-May 21:15 server-dir JobId 206335: End auto prune. > > 07-May 21:15 server-dir JobId 206335: shell command: run AfterJob > "scripts/restore_by_jobid test.domain.com-fd 206335" > ------------------------------------ > > > The restore job soon after fails with this log: > ------------------------------------ > 08-May 00:13 server-dir JobId 206424: Start Restore Job > TestRestoreProductionSQL.2019-05-07_21.15.24_57 > 08-May 00:13 server-dir JobId 206424: Using Device "FileStorage" to read. > 08-May 00:13 server-sd JobId 206424: Ready to read from volume "Vol0043" > on File device "FileStorage" (/home/backup). > 08-May 00:13 server-sd JobId 206424: Forward spacing Volume "Vol0043" to > addr=22681860551 > 08-May 00:13 server-sd JobId 206424: Error: block_util.c:470 Volume data > error at 0:0! > * Block checksum mismatch *in block=1723 len=64512: calc=24873f8f > blk=2deb5777 > 08-May 00:13 server-sd JobId 206424: Elapsed time=00:00:01, Transfer > rate=111.0 M Bytes/second > 08-May 00:13 server-sd JobId 206424: Fatal error: fd_cmds.c:223 Command > error with FD msg="rechdr 4 1557261339 2 2 65536", SD hanging up. > 08-May 00:13 test.domain.com-fd JobId 206424: Error: attribs.c:558 File > size of restored file > /var/lib/postgresql/tmp/test-db_postgresql_dump.sql.tar.gz not correct. > Original 464348732, restored 111017984. > 08-May 00:13 server-dir JobId 206424: Error: Bacula server-dir 9.0.6 > (20Nov17): > 08-May 00:13 server-dir JobId 206424: Error: Bacula server-dir 9.0.6 > (20Nov17): > Build OS: x86_64-pc-linux-gnu ubuntu 18.04 > JobId: 206424 > Job: TestRestoreProductionSQL.2019-05-07_21.15.24_57 > Restore Client: test.domain.com-fd > Start time: 08-May-2019 00:13:33 > End time: 08-May-2019 00:13:35 > Files Expected: 15 > Files Restored: 2 > Bytes Restored: 111,018,596 > Rate: 55509.3 KB/s > FD Errors: 1 > FD termination status: Error > SD termination status: Error > Termination: *** Restore Error *** > ------------------------------------ > > Here is the log from the successful verify job 3 minutes after the failed > restore job (though it only takes 16 seconds which seems too fast): > ------------------------------------ > 08-May 00:16 server-dir JobId 206428: Verifying against JobId=206335 > Job=test.domain.com-sql.2019-05-07_21.05.01_27 > 08-May 00:16 server-dir JobId 206428: Start Verify JobId=206428 > Level=DiskToCatalog Job=verify-test.domain.com-sql.2019-05-07_22.20.00_32 > 08-May 00:17 server-dir JobId 206428: Bacula server-dir 9.0.6 (20Nov17): > Build OS: x86_64-pc-linux-gnu ubuntu 18.04 > JobId: 206428 > Job: verify-test.domain.com-sql.2019-05-07_22.20.00_32 > FileSet: test-sql > Verify Level: DiskToCatalog > Client: test.domain.com-fd > Verify JobId: 206335 > Verify Job: > Start time: 08-May-2019 00:16:47 > End time: 08-May-2019 00:17:03 > Elapsed time: 16 secs > Files Examined: 16 > Non-fatal FD errors: 0 > FD termination status: OK > Termination: Verify OK > > 08-May 00:17 server-dir JobId 206428: Begin pruning Jobs older than 2 > months . > 08-May 00:17 server-dir JobId 206428: No Jobs found to prune. > 08-May 00:17 server-dir JobId 206428: Begin pruning Files. > 08-May 00:17 server-dir JobId 206428: No Files found to prune. > 08-May 00:17 server-dir JobId 206428: End auto prune. > ------------------------------------ > > Clearly the verify job either isn't verifying anything or isn't verifying > what I expect it to be verifying. I noticed in the v7.4.0 release notes > there was a new feature of "Level=Data" for verification jobs, but I > couldn't actually find any more information on that in the documentation, > just information about "DiskToCatalog" and "VolumeToCatalog" verification > levels. Does "Level=Data" actually exist, or is there additional > documentation on it somewhere that I'm missing? This is all critical data, > so we want to do as much verification as possible on it which should in > theory mimic everything a restore would do so we can be 100% certain the > data was intact at that time. > > The next question is why is the data being corrupted to begin with? The > bacula server uses a RAID1 BTRFS array to store the pool volumes on and > doing a BTRFS scrub on the entire block device shows no checksum errors > whatsoever. There are also no hardware errors appearing in the dmesg logs, > and SMART monitoring on the drives isn't showing any errors (drives are a > few months old). > > Any ideas what would cause bacula to show checksum errors but BTRFS isn't? > > The SQL backup jobs are being run across multiple servers in multiple > countries, and they seem to be the only jobs that we have experienced the > checksum errors with so far and its happening almost every night when the > jobs are scheduled to run. If we run the jobs manually the next morning > everything works fine and the restore succeeds without a problem. > > Any insights would be greatly appreciated. > _______________________________________________ > Bacula-users mailing list > Bacula-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bacula-users >
_______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users