Hello,

I've filed this as a bug, but while Kern couldn't reproduce it he gave
up. So let us find here what could be the problem. There are actually
two problems, they could be linked.

Here is the history:
Initially we were using 2.0.3. Running backups for several weeks I
wanted to restore a file and was surprised that I can't restore it. It
was listed in the catalog, I could select it and run a restore job,
but the file didn't come up. Investigating what happened I run a full
restore job and was surprised that in that directory (where the file
is) several files are missing. Also the error message similar to the
one in my first post here were present. In addition to it there was a
big difference between marked files and actually restored files (sure
not hard links, sockets or anything else that is ignored by Bacula -
at one of the tests the whole /home/ directory was missing).
After that we startd with tests (backup full/diff/inc, restore etc)
for a week. Every time (but at random places/files) similar error
happen. Sometimes there are errors, sometimes not. Haven't run so much
tests so I could come up with a decision when this happens. But IT
HAPPENS and as a result we don't have a reliable backup. I know a lot
of people run backups w/o testing restores and that's why (if this is
not related to our specific setup) those problem could appear only if
they have emergency which actually doesn't happen often. Anyway, here
are the hardware and setup details:

*** Bacula: 2.1.28 on all servers.
>From yesterday we cleaned everything (bacula DB and volumes) and
installed everywhere the latest beta *2.1.28* (note this is not the
problem of the beta as we discovered when we had 2.0.3). 2.1.28 fixed
2 other problems we discovered with 2.0.3, but this one is still
there.
Director and most of the servers are 64 bit, two of the servers are 32
bit.
*** OS: Linux CentOS 4.5
*** MySQL: 5.0.37
*** Servers (all are almost identical): Supermicro, PDSME - Intel
E7230 (Mukilteo) chipset, Intel Pentium D 930 Dual Core 3.0GHz, 3Ware
IDE RAID Controller Escalade 9550SX. Servers have 4 disks each in RAID
1+0, only the Bacula server has many disks in RAID 5.
*** Some servers are plain CentOS, some have Xen with virtual servers,
the Bacula server itsels also has Xen, but the Bacula is running in
Dom0, no other virtual machines at this time are running on it.
*** Those servers with Xen als have LVM.
*** We run (and I guess here is the problem of Bacula) concurrent
jobs.
*** GZIP compression is enabled.
*** we save volumes on harddisk, their size is set to 4480MB

--- How to get an error:
As initially we discovered the error after several weeks of backups,
We guessed that this could ba caused by us by a wrong setting of
Volume Retention or any other Retention time and some files are
purged.

We started everything from zero again, and after 3 days (it happened
that the first was Full, the next Differential and the last
Incremental) we performed a test and that error happened again! So we
were sure this is not caused by purge of some files accidentally.

After that we could get that error even after just a full backup,
trying to restore immediately after it is finished.

Yesterday we cleaned everything again and compiled (from SRPMs) the
latest 2.1.28.

We run again full backup (again all concurret jobs) and the errors
described here happen when we try to restore files from every job
(except one where there are just 150 files).

So the problems are two:
- sometimes some files are restored with higher size, while the first
part of the file matches exactly the original file (not log files or
dynamic files) This happens on very rare cases (~one case per 5 jobs)
- sometimes not all files are restored, but tens of thousands are
missing, an example:
  Files Expected:         190,718
  Files Restored:         166,097
This happens more often (~one case per 2 jobs).

Note that once the error happens we can reproduce it on every restore
at the same place for the same file and the same number of missing
files (i.e. this is not a problem of restore, it is most a problem of
volumes).

What are our future tests:
1. we will do the same (concurrent jobs) but w.o using GZIP
2. if it happens again we will set max jobs to 1 so every job is run
alone. Because when testing AFAIR we didn't get errors when we run
just one full backup job. This always happen when we do several at
once (but I am not 100% sure, thats why we will test this)
3. if it still happens we will run it with normal kernel (so to exclude
the Xen influence)
4. last we will try w/o LVM (which would be harder)

Regards
P.S. sorry for my English :)


Monday, July 23, 2007, 9:03:45 PM:

RN> -----BEGIN PGP SIGNED MESSAGE-----
RN> Hash: SHA1

RN> Doytchin Spiridonov wrote:
>> Hello,
>> 
>> trying to identify a bug in bacula and/or our system setup.
>> 
>> Is there anyone that on restore had errors like this:
>> 
>> Error: attribs.c:410 File size of restored file
>> /home/bacula/res/b3/usr/src/redhat/RPMS/i686/glibc-2.2.5-44.i686.rpm
>> not correct. Original 3826291, restored 10620921.
>> 
>> - the file is not a log file or any file that has changed during the
>> backup (in which cases an error like the one above should be normal)
>> 
>> - the wrong file size is always larger that the original; if we cut
>> the first N bytes, where the N is the correct file size, the original
>> and restored files match; we noted that the appended data is part of
>> another file from the backup, not a garbage data. Note that this other
>> file (from which some part has been appended to the file with wrong
>> size) is restored correctly, so the only problem is wrong file size
>> decision by bacula and reading further than its end (seems this is
>> some internal buffer of Bacula as the data is stored in the volumes
>> using GZIP and just reading further would break everything and the
>> appended data should be garbage, not unzipped data).

RN> This has been brought up several times within the last week, but never
RN> with the explanation and examination. I wonder if some of the other who
RN> have experienced it (I do not know their names -- hopefully they can
RN> chime in) can do the same thing for us. This is potentially serious,
RN> seems like, if it is a widespread problem.

RN> I think if the others can verify it, this should also be copied to
RN> Bacula devel. I think I will try a large restore of my own today to see
RN> what happens.

RN> Please give the rest of the details of your setup, however -- you don't
RN> even include the Bacula version, and that is a very basic piece of
RN> information. Operating system (presumably RedHat Linux from the file you
RN> backed up, but who knows), architecture... all would be useful.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to