Re: [Bacula-users] Restore errors

Mair Wolfgang-awm013 Tue, 24 Jul 2007 03:56:00 -0700

Hello,

This is exactly what I experienced last week. I submitted this under the 
subject: ' Restore Error of linux-install-fdFul'.

However, I didn't had the time yet to track this as much down as Doytchin did. 
Great work! 
This morning (before reading through all this) I also found that if I do a 
single backup the restore runs fine. My setting has concurrent jobs = 3. If I 
backup with this I get the same errors as already described. 

In order to contribute something to this here, this is my setup and what I did 
so far:

Opensuse 10.2 
Bacula 2.0.3

First setup was in a vmware machine on a opensuse 10.2. Since I could not find 
anything wrong with the OS or file system. I thought about the virtual machine.
I moved to a different (no virtual) system. Installed the OS new and compiled 
bacula 2.0.3 from scratch. (./configure --enable-smartalloc --with-mysql) 
copied the config files and created the needed mysql tables with the supplied 
scripts.
The first manual backup and restore I did went without problems. I tried two 
big machines. 
Then I left it run the usual backup with 3 concurrent jobs. Tried a restore and 
it failed with the known problems. 

As Doytchin already tracked down, it is a matter of the concurrent running 
backup jobs. I fully agree with this. 

Currently I've also set the concurrent jobs = 1 and the backup is still 
running. I don't know if this would be usable in our environment, since it 
takes now a quite long time to complete. 
Hopefully the restore will work out fine now with this setting. I'll keep you 
updated.

Regards
Wolfgang

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Doytchin 
Spiridonov
Sent: Tuesday, July 24, 2007 07:29
To: bacula-users
Subject: Re: [Bacula-users] Restore errors

Hello,

done. Found where is the problem after some more tests (and once again it is 
not in our hadrware or OS or broken things). It is where I initially suggested 
- the concurrent jobs.

After the first (and native configuration) we used (concurrent jobs, with gzip) 
we tested the following:

1. concurrent jobs, w/o gzip
- we got similar errors (1 wrong filesize from 4 jobs, but 3 of 4 jobs with 
less files than expected, the 4th usually is very small - 100 files - and never 
had errors, so I would say 100% of jobs was invalid)

2. no concurrent jobs (Maximum Concurrent Jobs = 1 at dir and sd), w/o gzip
- good news, all restores are OK, no errors, Files Expected and Files Restored 
match!

3. no concurrent jobs WITH gzip
- again OK, all restores are OK, no errors, Files Expected and Files Restored 
match!

So until now we have:
- the problem is not caused by a corrupted file system
- volumes are consistent and bls doesn't show errors
- MySQL is OK (initially 4.1.x now 5.0.37)
- when running concurrent jobs both 2.0.3 and 2.1.28 say backups are OK but 
restores fail with one of the 3 kinds of errors listed below
- when concurrent jobs are turned off everything is OK
- gzip on/off doesn't affect the errors

Once again the 3 types of errors are:

1. some static files (i.e. not log files!) are restored with wrong (always 
larger) size, while first N bytes match, and the rest is filled with a part of 
another file (not sure if this is just file with a wrong size and some old data 
at the disk appears at the end, or bacula restores part of another file and 
append it to the end). The file can be restored correctly if marked alone but 
the error 3. below is generated (which seems to be just a bogus error). An 
example error is:
---
b0: Restore_b0.d6.int.2007-07-23_22.37.34 Error: attribs.c:410 File size of 
restored file 
/home/bacula/res/b3.2/usr/src/redhat/RPMS/i686/glibc-2.2.5-44.i686.rpm
not correct. Original 3826291, restored 10620921.
---
When this error is present (always) the second error below (but w/o additional 
error messages) is present as well (missing files)

2. large amount of files are missing (while they are present in the catalog and 
selected) - tens of thousands (not sockets or anything else that Bacula ignores 
by default). When this happens usually an error like this appear (if not the 
first one above):
---
b3: Restore_b3.d6.int.2007-07-23_17.31.47 Fatal error: Record header file index 
42452 not equal record index 0
Storage: Restore_b3.d6.int.2007-07-23_17.31.47 Fatal error: read.c:124 Error 
sending to File daemon. ERR=Connection reset by peer
Storage: Restore_b3.d6.int.2007-07-23_17.31.47 Error: bsock.c:306 Write error 
sending 30 bytes to client:10.2.1.13:36643: ERR=Connection reset by peer
---

3. when a file from error 1 is restored alone it is OK, but another bogus error 
is generated:
---
Storage: Restore_b0.d6.int.2007-07-23_22.57.42 Error: block.c:275 Volume data 
error at 0:3999743252! Wanted ID: "BB02", got "Иnлу".
Buffer discarded.
---
Found that the above number (3999743252) is not present as block address for 
any block in the volumes, but the same number appears as part of JobMedia 
record in the database.

This is everything in 2.1.28 sumarized, that poped up as a problem or fact.
(2.0.3 had another bug with bogus errors about sockets' attributes and
2.1.26 had a bogus SQL error messages but those are fixed OK in 2.1.28).

If anyone wants, feel free to reopen the bug in Mantis (903). I'm not going to 
do so as I am personally disappointed by the attitude "this is not a bug - work 
it out yourself" and the suggestion to send you our servers as a gift to test 
with, plus support fees... nice. Now it's up to you to create better test cases 
to catch more bugs if any.

We will start our backup again w/o concurrent jobs and we will continue to 
monitor restores on a daily basis as the above tests are just 3 and I agree 
there is a posibility that it was just a chance that the later two tests went 
OK. But it was my suggestion from the beginning that the problem is Bacula 
damages either database numbers or volume records when concurrent jobs are 
running and so far the facts proved this.

(!) The workaround for the problem is to switch off concurrent jobs as if not - 
the chance you have invalid backups are high (some 90% from our own cases and 
at least with our servers/os/configuration; this is so if it is not said that 
100% of backups are wrong as after diff/incremental backups Bacula restores 
files that are deleted which is really a bad behaviour in many cases/services).

Regards

Tuesday, July 24, 2007, 12:15:43 AM:

DL> On 23 Jul 2007 at 21:57, Doytchin Spiridonov wrote:

>> Hello,
>> 
>> I've filed this as a bug, but while Kern couldn't reproduce it he 
>> gave up. So let us find here what could be the problem. There are 
>> actually two problems, they could be linked.

DL> Please.  If anyone can solve the issue given what you supplied, they 
DL> would.  You were asked to supply a reproducible situation.  
DL> Hopefully we can get to that position quickly without further 
DL> unnecessary distractions.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/ 
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Restore errors

Reply via email to