Re: [Bacula-users] Verify job succeeds, but restore fails with checksum error?

Wanderlei Huttel Wed, 08 May 2019 10:10:47 -0700

Hello Mike

Try to upgrade your very old bacula client to a new one (at least as the
same version of bacula-dir)



Best regards

*Wanderlei Hüttel*
http://www.bacula.com.br


Em qua, 8 de mai de 2019 às 13:31, Mike Benoit <mike.ben...@gmail.com>
escreveu:

> We have some jobs that backup SQL databases, so immediately after the
> backup is complete we have a "AfterJob" that restores it to a test system,
> then we run a DiskToCatalog verify job. However the restore fails with a
> Block checksum mismatch and immediately after the verify job succeeds.
>
> The backup job seems to succeed without any errors:
> ------------------------------------
> 07-May 21:05 server-dir JobId 206335: Start Backup JobId 206335,
> Job=test.domain.com-sql.2019-05-07_21.05.01_27
> 07-May 21:05 server-dir JobId 206335: Using Device "FileStorage" to write.
> 07-May 21:05 test.domain.com-fd JobId 206335: shell command: run
> ClientRunBeforeJob "scripts/postgresql_dump.sh"
> 07-May 21:15 server-sd JobId 206335: Elapsed time=00:04:22, Transfer
> rate=13.49 M Bytes/second
> 07-May 21:15 server-sd JobId 206335: Sending spooled attrs to the
> Director. Despooling 4,073 bytes ...
> 07-May 21:15 server-dir JobId 206335: Bacula server-dir 9.0.6 (20Nov17):
>   Build OS:               x86_64-pc-linux-gnu ubuntu 18.04
>   JobId:                  206335
>   Job:                    test.domain.com-sql.2019-05-07_21.05.01_27
>   Backup Level:           Incremental, since=2019-05-06 21:11:28
>   Client:                 "test.domain.com-fd" 5.2.6 (21Feb12)
> x86_64-pc-linux-gnu,debian,jessie/sid
>   FileSet:                "test-sql" 2017-08-30 21:05:00
>   Pool:                   "File" (From Job resource)
>   Catalog:                "MyCatalog" (From Client resource)
>   Storage:                "File" (From Pool resource)
>   Scheduled time:         07-May-2019 21:05:01
>   Start time:             07-May-2019 21:11:00
>   End time:               07-May-2019 21:15:24
>   Elapsed time:           4 mins 24 secs
>   Priority:               500
>   FD Files Written:       15
>   SD Files Written:       15
>   FD Bytes Written:       3,535,794,319 (3.535 GB)
>   SD Bytes Written:       3,535,796,681 (3.535 GB)
>   Rate:                   13393.2 KB/s
>   Software Compression:   None
>   Comm Line Compression:  None
>   Snapshot/VSS:           no
>   Encryption:             no
>   Accurate:               no
>   Volume name(s):         Vol0043
>   Volume Session Id:      4
>   Volume Session Time:    1557261339
>   Last Volume Bytes:      26,223,400,919 (26.22 GB)
>   Non-fatal FD errors:    0
>   SD Errors:              0
>   FD termination status:  OK
>   SD termination status:  OK
>   Termination:            Backup OK
>
> 07-May 21:15 server-dir JobId 206335: Begin pruning Jobs older than 2
> months .
> 07-May 21:15 server-dir JobId 206335: No Jobs found to prune.
> 07-May 21:15 server-dir JobId 206335: Begin pruning Files.
> 07-May 21:15 server-dir JobId 206335: No Files found to prune.
> 07-May 21:15 server-dir JobId 206335: End auto prune.
>
> 07-May 21:15 server-dir JobId 206335: shell command: run AfterJob
> "scripts/restore_by_jobid test.domain.com-fd 206335"
> ------------------------------------
>
>
> The restore job soon after fails with this log:
> ------------------------------------
> 08-May 00:13 server-dir JobId 206424: Start Restore Job
> TestRestoreProductionSQL.2019-05-07_21.15.24_57
> 08-May 00:13 server-dir JobId 206424: Using Device "FileStorage" to read.
> 08-May 00:13 server-sd JobId 206424: Ready to read from volume "Vol0043"
> on File device "FileStorage" (/home/backup).
> 08-May 00:13 server-sd JobId 206424: Forward spacing Volume "Vol0043" to
> addr=22681860551
> 08-May 00:13 server-sd JobId 206424: Error: block_util.c:470 Volume data
> error at 0:0!
> * Block checksum mismatch *in block=1723 len=64512: calc=24873f8f
> blk=2deb5777
> 08-May 00:13 server-sd JobId 206424: Elapsed time=00:00:01, Transfer
> rate=111.0 M Bytes/second
> 08-May 00:13 server-sd JobId 206424: Fatal error: fd_cmds.c:223 Command
> error with FD msg="rechdr 4 1557261339 2 2 65536", SD hanging up.
> 08-May 00:13 test.domain.com-fd JobId 206424: Error: attribs.c:558 File
> size of restored file
> /var/lib/postgresql/tmp/test-db_postgresql_dump.sql.tar.gz not correct.
> Original 464348732, restored 111017984.
> 08-May 00:13 server-dir JobId 206424: Error: Bacula server-dir 9.0.6
> (20Nov17):
> 08-May 00:13 server-dir JobId 206424: Error: Bacula server-dir 9.0.6
> (20Nov17):
>   Build OS:               x86_64-pc-linux-gnu ubuntu 18.04
>   JobId:                  206424
>   Job:                    TestRestoreProductionSQL.2019-05-07_21.15.24_57
>   Restore Client:         test.domain.com-fd
>   Start time:             08-May-2019 00:13:33
>   End time:               08-May-2019 00:13:35
>   Files Expected:         15
>   Files Restored:         2
>   Bytes Restored:         111,018,596
>   Rate:                   55509.3 KB/s
>   FD Errors:              1
>   FD termination status:  Error
>   SD termination status:  Error
>   Termination:            *** Restore Error ***
> ------------------------------------
>
> Here is the log from the successful verify job 3 minutes after the failed
> restore job (though it only takes 16 seconds which seems too fast):
> ------------------------------------
> 08-May 00:16 server-dir JobId 206428: Verifying against JobId=206335
> Job=test.domain.com-sql.2019-05-07_21.05.01_27
> 08-May 00:16 server-dir JobId 206428: Start Verify JobId=206428
> Level=DiskToCatalog Job=verify-test.domain.com-sql.2019-05-07_22.20.00_32
> 08-May 00:17 server-dir JobId 206428: Bacula server-dir 9.0.6 (20Nov17):
>   Build OS:               x86_64-pc-linux-gnu ubuntu 18.04
>   JobId:                  206428
>   Job:                    verify-test.domain.com-sql.2019-05-07_22.20.00_32
>   FileSet:                test-sql
>   Verify Level:           DiskToCatalog
>   Client:                 test.domain.com-fd
>   Verify JobId:           206335
>   Verify Job:
>   Start time:             08-May-2019 00:16:47
>   End time:               08-May-2019 00:17:03
>   Elapsed time:           16 secs
>   Files Examined:         16
>   Non-fatal FD errors:    0
>   FD termination status:  OK
>   Termination:            Verify OK
>
> 08-May 00:17 server-dir JobId 206428: Begin pruning Jobs older than 2
> months .
> 08-May 00:17 server-dir JobId 206428: No Jobs found to prune.
> 08-May 00:17 server-dir JobId 206428: Begin pruning Files.
> 08-May 00:17 server-dir JobId 206428: No Files found to prune.
> 08-May 00:17 server-dir JobId 206428: End auto prune.
> ------------------------------------
>
> Clearly the verify job either isn't verifying anything or isn't verifying
> what I expect it to be verifying. I noticed in the v7.4.0 release notes
> there was a new feature of "Level=Data" for verification jobs, but I
> couldn't actually find any more information on that in the documentation,
> just information about "DiskToCatalog" and "VolumeToCatalog" verification
> levels. Does "Level=Data" actually exist, or is there additional
> documentation on it somewhere that I'm missing? This is all critical data,
> so we want to do as much verification as possible on it which should in
> theory mimic everything a restore would do so we can be 100% certain the
> data was intact at that time.
>
> The next question is why is the data being corrupted to begin with? The
> bacula server uses a RAID1 BTRFS array to store the pool volumes on  and
> doing a BTRFS scrub on the entire block device shows no checksum errors
> whatsoever. There are also no hardware errors appearing in the dmesg logs,
> and SMART monitoring on the drives isn't showing any errors (drives are a
> few months old).
>
> Any ideas what would cause bacula to show checksum errors but BTRFS isn't?
>
> The SQL backup jobs are being run across multiple servers in multiple
> countries, and they seem to be the only jobs that we have experienced the
> checksum errors with so far and its happening almost every night when the
> jobs are scheduled to run. If we run the jobs manually the next morning
> everything works fine and the restore succeeds without a problem.
>
> Any insights would be greatly appreciated.
> _______________________________________________
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users
>

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Verify job succeeds, but restore fails with checksum error?

Reply via email to