Re: [Bacula-users] Verify job succeeds, but restore fails with checksum error?

2019-05-09 Thread Martin Simmons
> On Wed, 8 May 2019 09:28:05 -0700, Mike Benoit said:
> 
> Clearly the verify job either isn't verifying anything or isn't verifying
> what I expect it to be verifying.

Was the test restore directory empty before you did the restore?


>   I noticed in the v7.4.0 release notes
> there was a new feature of "Level=Data" for verification jobs, but I
> couldn't actually find any more information on that in the documentation,
> just information about "DiskToCatalog" and "VolumeToCatalog" verification
> levels. Does "Level=Data" actually exist, or is there additional
> documentation on it somewhere that I'm missing? This is all critical data,
> so we want to do as much verification as possible on it which should in
> theory mimic everything a restore would do so we can be 100% certain the
> data was intact at that time.

There is a little more documentation here:

https://www.bacula.org/9.4.x-manuals/en/main/New_Features_in_7_4_0.html#SECTION00621000

and here:

https://www.bacula.org/9.4.x-manuals/en/main/New_Features_in_9_0_0.html#SECTION005013000


> The next question is why is the data being corrupted to begin with? The
> bacula server uses a RAID1 BTRFS array to store the pool volumes on  and
> doing a BTRFS scrub on the entire block device shows no checksum errors
> whatsoever. There are also no hardware errors appearing in the dmesg logs,
> and SMART monitoring on the drives isn't showing any errors (drives are a
> few months old).
> 
> Any ideas what would cause bacula to show checksum errors but BTRFS isn't?

That suggests the data was corrupted after Bacula's checksum but before BTRFS
computed its checksum.  You could try running bls on that volume to see if it
also detects the checksum mismatch.  Something like

bls -j /home/backup/Vol0043

If that does detect it, then try with bls -j -p to see if there are more
errors.  It may be worthwhile testing other volumes as well.


> The SQL backup jobs are being run across multiple servers in multiple
> countries, and they seem to be the only jobs that we have experienced the
> checksum errors with so far and its happening almost every night when the
> jobs are scheduled to run. If we run the jobs manually the next morning
> everything works fine and the restore succeeds without a problem.

Does the SD have ECC RAM?

__Martin


___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Verify job succeeds, but restore fails with checksum error?

2019-05-08 Thread Gary R. Schmidt

On 09/05/2019 03:07, Wanderlei Huttel wrote:

Hello Mike

Try to upgrade your very old bacula client to a new one (at least as the 
same version of bacula-dir)


Slight correction, clients (file daemons) can not be newer than 
directors or storage daemons, so only update to the same version as your 
director, if possible.


Cheers,
GaryB-)


___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Verify job succeeds, but restore fails with checksum error?

2019-05-08 Thread Wanderlei Huttel
Hello Mike

Try to upgrade your very old bacula client to a new one (at least as the
same version of bacula-dir)


Best regards

*Wanderlei Hüttel*
http://www.bacula.com.br


Em qua, 8 de mai de 2019 às 13:31, Mike Benoit 
escreveu:

> We have some jobs that backup SQL databases, so immediately after the
> backup is complete we have a "AfterJob" that restores it to a test system,
> then we run a DiskToCatalog verify job. However the restore fails with a
> Block checksum mismatch and immediately after the verify job succeeds.
>
> The backup job seems to succeed without any errors:
> 
> 07-May 21:05 server-dir JobId 206335: Start Backup JobId 206335,
> Job=test.domain.com-sql.2019-05-07_21.05.01_27
> 07-May 21:05 server-dir JobId 206335: Using Device "FileStorage" to write.
> 07-May 21:05 test.domain.com-fd JobId 206335: shell command: run
> ClientRunBeforeJob "scripts/postgresql_dump.sh"
> 07-May 21:15 server-sd JobId 206335: Elapsed time=00:04:22, Transfer
> rate=13.49 M Bytes/second
> 07-May 21:15 server-sd JobId 206335: Sending spooled attrs to the
> Director. Despooling 4,073 bytes ...
> 07-May 21:15 server-dir JobId 206335: Bacula server-dir 9.0.6 (20Nov17):
>   Build OS:   x86_64-pc-linux-gnu ubuntu 18.04
>   JobId:  206335
>   Job:test.domain.com-sql.2019-05-07_21.05.01_27
>   Backup Level:   Incremental, since=2019-05-06 21:11:28
>   Client: "test.domain.com-fd" 5.2.6 (21Feb12)
> x86_64-pc-linux-gnu,debian,jessie/sid
>   FileSet:"test-sql" 2017-08-30 21:05:00
>   Pool:   "File" (From Job resource)
>   Catalog:"MyCatalog" (From Client resource)
>   Storage:"File" (From Pool resource)
>   Scheduled time: 07-May-2019 21:05:01
>   Start time: 07-May-2019 21:11:00
>   End time:   07-May-2019 21:15:24
>   Elapsed time:   4 mins 24 secs
>   Priority:   500
>   FD Files Written:   15
>   SD Files Written:   15
>   FD Bytes Written:   3,535,794,319 (3.535 GB)
>   SD Bytes Written:   3,535,796,681 (3.535 GB)
>   Rate:   13393.2 KB/s
>   Software Compression:   None
>   Comm Line Compression:  None
>   Snapshot/VSS:   no
>   Encryption: no
>   Accurate:   no
>   Volume name(s): Vol0043
>   Volume Session Id:  4
>   Volume Session Time:1557261339
>   Last Volume Bytes:  26,223,400,919 (26.22 GB)
>   Non-fatal FD errors:0
>   SD Errors:  0
>   FD termination status:  OK
>   SD termination status:  OK
>   Termination:Backup OK
>
> 07-May 21:15 server-dir JobId 206335: Begin pruning Jobs older than 2
> months .
> 07-May 21:15 server-dir JobId 206335: No Jobs found to prune.
> 07-May 21:15 server-dir JobId 206335: Begin pruning Files.
> 07-May 21:15 server-dir JobId 206335: No Files found to prune.
> 07-May 21:15 server-dir JobId 206335: End auto prune.
>
> 07-May 21:15 server-dir JobId 206335: shell command: run AfterJob
> "scripts/restore_by_jobid test.domain.com-fd 206335"
> 
>
>
> The restore job soon after fails with this log:
> 
> 08-May 00:13 server-dir JobId 206424: Start Restore Job
> TestRestoreProductionSQL.2019-05-07_21.15.24_57
> 08-May 00:13 server-dir JobId 206424: Using Device "FileStorage" to read.
> 08-May 00:13 server-sd JobId 206424: Ready to read from volume "Vol0043"
> on File device "FileStorage" (/home/backup).
> 08-May 00:13 server-sd JobId 206424: Forward spacing Volume "Vol0043" to
> addr=22681860551
> 08-May 00:13 server-sd JobId 206424: Error: block_util.c:470 Volume data
> error at 0:0!
> * Block checksum mismatch *in block=1723 len=64512: calc=24873f8f
> blk=2deb5777
> 08-May 00:13 server-sd JobId 206424: Elapsed time=00:00:01, Transfer
> rate=111.0 M Bytes/second
> 08-May 00:13 server-sd JobId 206424: Fatal error: fd_cmds.c:223 Command
> error with FD msg="rechdr 4 1557261339 2 2 65536", SD hanging up.
> 08-May 00:13 test.domain.com-fd JobId 206424: Error: attribs.c:558 File
> size of restored file
> /var/lib/postgresql/tmp/test-db_postgresql_dump.sql.tar.gz not correct.
> Original 464348732, restored 111017984.
> 08-May 00:13 server-dir JobId 206424: Error: Bacula server-dir 9.0.6
> (20Nov17):
> 08-May 00:13 server-dir JobId 206424: Error: Bacula server-dir 9.0.6
> (20Nov17):
>   Build OS:   x86_64-pc-linux-gnu ubuntu 18.04
>   JobId:  206424
>   Job:TestRestoreProductionSQL.2019-05-07_21.15.24_57
>   Restore Client: test.domain.com-fd
>   Start time: 08-May-2019 00:13:33
>   End time:   08-May-2019 00:13:35
>   Files Expected: 15
>   Files Restored: 2
>   Bytes Restored: 111,018,596
>   Rate:   55509.3 KB/s
>   FD Errors:  1
>   FD termination status: