Greetings,
I just spent the last seven hours digging deep into the internals of
BackupPC to figure out why a backup was failing. I finally managed to
fix the symptom of the problem, but not the root cause. I'm writing to
y'all about it because (a) perhaps my narrative will be enough for
someone who is more familiar with the code to find and fix the root
cause, and (b) the visible manifestation of the problem was the dreaded
"backup failing with SIGPIPE" problem to which there are many
references, but few solutions, on the Web. Since not many people are as
stubborn as I am, I may be the first person who actually debugged this
all the way to the end, so it is possible that other people who are
having the SIGPIPE problem are having it for the same reason as I was
and don't know it.
The backup was being done using rsync over ssh. I ran strace on the
rsync process on the client and discovered that rsync was claiming right
before exiting that it was unable to allocate memory ("out of memory in
receive_sums"), which was strange because the client machine didn't
appear to be anywhere close to running out of RAM.
I ended up recompiling rsync on the client with debugging symbols and
without optimization, and then attaching to it in gdb while running the
backup, to find out just why it was claiming to be unable to allocate
memory. From this, I discovered that it was receiving a file size from
the other end (i.e., from File::RsyncP) which was so outrageously large
that it didn't even try to allocate the memory to hold the checksums for
the file. In particular, this code in the rsync's util.c came into
play, and the return statement I've put in bold was actually being
executed:
#define MALLOC_MAX 0x40000000
void *_new_array(unsigned int size, unsigned long num)
{
if (num >= MALLOC_MAX/size)
return NULL;
return malloc(size * num);
}
>From this, I went digging into File::RsyncP on the server, and from
there into BackupPC::Xfer::RsyncFileIO, and from there into
BackupPC::Attrib.
What I discovered, to my surprise, was that in the last successful
backup of this host, somehow a single attribute structure in a single
attrib file got corrupted. The symptom of the corruption was that the
sizeDiv4GB value for that file was set to 4294967295, which if I'm not
mistaken is the signed value -1 cast to an unsigned. The actual size of
the file in question was only 300 bytes, so obviously sizeDiv4GB should
have been set to 0. This backup was made with BackupPC 2.1.2.
I wrote a little script to read in the attrib file with
BackupPC::Attrib, fix the broken value, and then write out the fixed
attrib file, and after doing this, I was able to run a new full backup
successfully to completion.
I checked the logs for the backup in which the attrib structure got
corrupted, and there was no indication in them that anything had gone
wrong.
I hope the seven hours I spent digging into this ends up being useful to
someone :-).
Thanks,
Jonathan Kamens
Operations Manager / Principal Engineer
Tamale Software
201 South Street, Floor 3
Boston, MA 02211
(617) 261-0264 ext. 133
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
BackupPC-devel mailing list
[email protected]
List: https://lists.sourceforge.net/lists/listinfo/backuppc-devel
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/