Re: [BackupPC-users] BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)

2011-04-04 Thread Holger Parplies
Hi,

John Rouillard wrote on 2011-03-31 15:20:23 + [[BackupPC-users] 
BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)]:
> [...]
> I get a bunch of output (the share being backed up /etc on a centos
> 5.5. box) which ends with:
> 
>   attribSet: dir=f%2fetc exists
>   attribSet(dir=f%2fetc, file=zshrc, size=640, placeholder=1)
>   Starting file 0 (.), blkCnt=134217728, blkSize=131072, remainder=0
>   .: size doesn't match (12288 vs 17592185913344)

at first glance, this would appear to be an indication of something I have
been suspecting for a long time: corruption - caused by whatever - in an
attrib file leading to the SIGALRM abort. If I remember correctly, someone
(presumably File::RsyncP) would ordinarily try to allocate space for the file
(though that doesn't seem to make sense, so I probably remember incorrectly)
and either gives up when that fails or refrains from trying in the first
place, because the amount is obviously insane.

The weird thing in this case is that we're seeing a directory. There is
absolutely no reason (unless I am missing something) to worry about the
*size* of a directory. The value is absolutely file system dependant and
not even necessarily an indication of the *current* amount of entries in
the directory. In any case, you restore the contents of a directory by
restoring the files in it, and you (incrementally) backup a directory by
determining if any files have changed or been added. The *size* of a
directory will not help with that decision.

Then again, the problematic file (or attrib file entry) may or may not be the
last one reported (maybe it's the first one not reported?).

> [...] I have had similar hanging issues before
> but usully scheduling a full backup or removing a prior backup or two
> in the chain will let things work again. However I would like to
> actually get this fixed this time around as it seems to be occurring
> more often recently (on different backuppc servers and against
> different hosts).

I agree with you there. This is probably one of the most frustrating problems
to be encountered with BackupPC, because there is no obvious cause and nothing
obvious to correct (throwing away part of your backup history for no better
reason than "after that it works again" is somewhat unsatisfactory).

The reason not to investigate this matter any further so far seems to have
been that it is usually "solved" by removing the reference backup (I believe
simply running a full backup will encounter the same problem again), because
people tend to want to get their backups back up and running. There are two
things to think about here:

1.) Why does attrib file corruption cause the backup to hang? Is there no
sane(r) way to deal with the situation?
2.) How does the attrib file get corrupted in the first place?

Presuming it *is* attrib file corruption. Could you please send me a copy of
the attrib file off-list?

> If I dump the root attrib file (where /etc starts) for either last
> successful or the current (partial) failing backup I see:
> 
>   '/etc' => {
> 'uid' => 0,
> 'mtime' => 1300766985,
> 'mode' => 16877,
> 'size' => 12288,
> 'sizeDiv4GB' => 0,
> 'type' => 5,
> 'gid' => 0,
> 'sizeMod4GB' => 12288
>   },

I would expect the interesting part to be the '.' entry in the attrib file for
'/etc' (f%2fetc/attrib of the last successful backup, that is). And I would be
curious about how the attrib file was decoded, because I'd implement decoding
differently from how BackupPC does, though BackupPC's method does appear to be
well tested ;-).

> [...] the last few lines of strace show:
> 
> [...]
>   19368 15:00:38.199634 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout)
> <59.994597>

I believe this is the result of File::RsyncP having given up on the transfer
because of either a failing malloc() or a suppressed malloc(). I'll have to
find some time to check in more detail. I vaguely remember it was a rather
complicated matter, and there was never really enough evidence to support that
corrupted attrib files were really the cause. But I sure would like to get to
the bottom of this :-).

Regards,
Holger

--
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself; 
WebMatrix provides all the features you need to develop and 
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)

2011-04-04 Thread Jeffrey J. Kosowsky
Holger Parplies wrote at about 18:59:00 +0200 on Monday, April 4, 2011:
 > Hi,
 > 
 > John Rouillard wrote on 2011-03-31 15:20:23 + [[BackupPC-users] 
 > BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)]:
 > > [...]
 > > I get a bunch of output (the share being backed up /etc on a centos
 > > 5.5. box) which ends with:
 > > 
 > >   attribSet: dir=f%2fetc exists
 > >   attribSet(dir=f%2fetc, file=zshrc, size=640, placeholder=1)
 > >   Starting file 0 (.), blkCnt=134217728, blkSize=131072, remainder=0
 > >   .: size doesn't match (12288 vs 17592185913344)
 > 
 > at first glance, this would appear to be an indication of something I have
 > been suspecting for a long time: corruption - caused by whatever - in an
 > attrib file leading to the SIGALRM abort. If I remember correctly, someone
 > (presumably File::RsyncP) would ordinarily try to allocate space for the file
 > (though that doesn't seem to make sense, so I probably remember incorrectly)
 > and either gives up when that fails or refrains from trying in the first
 > place, because the amount is obviously insane.
 > 
 > The weird thing in this case is that we're seeing a directory. There is
 > absolutely no reason (unless I am missing something) to worry about the
 > *size* of a directory. The value is absolutely file system dependant and
 > not even necessarily an indication of the *current* amount of entries in
 > the directory. In any case, you restore the contents of a directory by
 > restoring the files in it, and you (incrementally) backup a directory by
 > determining if any files have changed or been added. The *size* of a
 > directory will not help with that decision.
 > 

I too don't understand why rsync would try to treat a directory '.'
like a file. I mean block count and block size aren't meaningful when
rsyncing a directory (at least according to my understanding of the
rsync algorithm).
Could it possibly be a character encoding issue where there is some
file (maybe with spaces, non-printable characters, or some
non-recognized font) that rsync ends up seeing as '.' rather than as
the true file name. For example, I know Windows and Linux have different 
concepts
of what characters are allowed in file names and that cygwin tries to
do its best to make sense of the differences but doesn't always
succeed. 

 > Then again, the problematic file (or attrib file entry) may or may
 > not be the last one reported (maybe it's the first one not
 > reported?).

I would suggest some trial-and-error.
What if you move the directory somewhere else and try a new backup
(using say a new machine alias)?
Can you do a binary-search to narrow down the error?
In particular, is the problem only when doing an incremental? Does it
work if you force a full? Does it work if you make this the first
backup on a new machine alias?
 > 
 > > [...] I have had similar hanging issues before
 > > but usully scheduling a full backup or removing a prior backup or two
 > > in the chain will let things work again. However I would like to
 > > actually get this fixed this time around as it seems to be occurring
 > > more often recently (on different backuppc servers and against
 > > different hosts).
 > 
 > I agree with you there. This is probably one of the most frustrating problems
 > to be encountered with BackupPC, because there is no obvious cause and 
 > nothing
 > obvious to correct (throwing away part of your backup history for no better
 > reason than "after that it works again" is somewhat unsatisfactory).
 > 
 > The reason not to investigate this matter any further so far seems to have
 > been that it is usually "solved" by removing the reference backup (I believe
 > simply running a full backup will encounter the same problem again), because
 > people tend to want to get their backups back up and running. There are two
 > things to think about here:
 > 
 > 1.) Why does attrib file corruption cause the backup to hang? Is there no
 > sane(r) way to deal with the situation?
 > 2.) How does the attrib file get corrupted in the first place?
 > 
 > Presuming it *is* attrib file corruption. Could you please send me a copy of
 > the attrib file off-list?
 > 
 > > If I dump the root attrib file (where /etc starts) for either last
 > > successful or the current (partial) failing backup I see:
 > > 
 > >   '/etc' => {
 > > 'uid' => 0,
 > > 'mtime' => 1300766985,
 > > 'mode' => 16877,
 > > 'size' => 12288,
 > > 'sizeDiv4GB' => 0,
 > > 'type' => 5,
 > > 'gid' => 0,
 > > 'sizeMod4GB' => 12288
 > >   },
 > 
 > I would expect the interesting part to be the '.' entry in the attrib file 
 > for
 > '/etc' (f%2fetc/attrib of the last successful backup, that is). And I would 
 > be
 > curious about how the attrib file was decoded, because I'd implement decoding
 > differently from how BackupPC does, though BackupPC's method does appear to 
 > be
 > well tested ;-).
 > 
 > > [...] the last few lines of strace show:
 > > 
 > > [

Re: [BackupPC-users] BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)

2011-04-06 Thread John Rouillard

Sadly I have some bad news.

I would love to add more info to this, but it appears the
@#^$@&#$#% backup decided to stop failing at 10PM on April
1st. Backups have been working fine since 8-(.  Sort of a
reverse april fools joke I guess.

However my comments are inline below as I have attempted to
gather more info for the next time this happens. 

On Mon, Apr 04, 2011 at 06:59:00PM +0200, Holger Parplies wrote:
> John Rouillard wrote on 2011-03-31 15:20:23 + [[BackupPC-users] 
> BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)]:
> > [...]
> > I get a bunch of output (the share being backed up /etc on a centos
> > 5.5. box) which ends with:
> > 
> >   attribSet: dir=f%2fetc exists
> >   attribSet(dir=f%2fetc, file=zshrc, size=640, placeholder=1)
> >   Starting file 0 (.), blkCnt=134217728, blkSize=131072, remainder=0
> >   .: size doesn't match (12288 vs 17592185913344)
> 
> at first glance, this would appear to be an indication of
> something I have been suspecting for a long time:
> corruption - caused by whatever - in an attrib file
> leading to the SIGALRM abort. [...]
>
> The weird thing in this case is that we're seeing a
> directory. There is absolutely no reason (unless I am
> missing something) to worry about the *size* of a
> directory. [...] The
> *size* of a directory will not help with that [how to
> restore a directory] decision.
> 
> Then again, the problematic file (or attrib file entry)
> may or may not be the last one reported (maybe it's the
> first one not reported?).
> 
> > [...] I have had similar hanging issues before
> > but usully scheduling a full backup or removing a prior backup or two
> > in the chain will let things work again. However I would like to
> > actually get this fixed this time around as it seems to be occurring
> > more often recently (on different backuppc servers and against
> > different hosts).
> 
> I agree with you there. This is probably one of the most
> frustrating problems to be encountered with BackupPC,
> because there is no obvious cause and nothing obvious to
> correct (throwing away part of your backup history for no
> better reason than "after that it works again" is somewhat
> unsatisfactory).
> 
> The reason not to investigate this matter any further so
> far seems to have been that it is usually "solved" by
> removing the reference backup (I believe simply running a
> full backup will encounter the same problem again),

That is correct, the last good backup was on:

   2011-03-22 22:00 at level 2

the next should have been a level 3 incremental on
2011-03-23. That incremental ran according to the logs but
failed on sigALRM and is missing from the completed backup
list (I assume because it failed and was not retained). Then
the full backup started on the 24th as I expected.

> because people tend to want to get their backups back up
> and running. There are two things to think about here:
> 
> 1.) Why does attrib file corruption cause the backup to
> hang? Is there no sane(r) way to deal with the situation?
> 2.) How does the attrib file get corrupted in the first
> place?

The underlying filesystem is a hardware raid 5 (3ware) with
4 1T disks that is scrubbed/verified weekly. Because of the
size of the backup partition 1.9T, I haven't gotten an fsck
to complete on it, but I have no indication of any memory or
disk hardware/raid controller issues that could scramble the
data.

> Presuming it *is* attrib file corruption. Could you please
> send me a copy of the attrib file off-list?

I will send you the attrib file for the top level and the
etc directory from the last completed backup before the
problem started.

> > If I dump the root attrib file (where /etc starts) for either last
> > successful or the current (partial) failing backup I see:
> > 
> >   '/etc' => {
> > 'uid' => 0,
> > 'mtime' => 1300766985,
> > 'mode' => 16877,
> > 'size' => 12288,
> > 'sizeDiv4GB' => 0,
> > 'type' => 5,
> > 'gid' => 0,
> > 'sizeMod4GB' => 12288
> >   },
> 
> I would expect the interesting part to be the '.' entry in
> the attrib file for '/etc' (f%2fetc/attrib of the last
> successful backup, that is).

If I dump .../293/f%2fetc/etc attrib file:

   -rw-r- 2 backup dev 1094 Mar 22 22:00 attrib

for the incremental from 2011-03-22 (the last working on
prior to the full that started working on the 1st of April),
I don't see any entry for '.'. It looks like only
subdirectories and files are listed there.  There is one
oddity. One of the files listed in the attrib file as:

  'vgconfig_backup_VolGroup00' => {
'uid' => 0,
'mtime' => 1300756921,  ## Tue Mar 22 01:22:01 2011
'mode' => 33152,
'size' => 2945,
'sizeDiv4GB' => 0,
'type' => 0,
'gid' => 0,
'sizeMod4GB' => 2945
  },

looks like this in ls -l:

  -rw-r-  2 backup dev 1085 Mar 23 22:00 fvgconfig_backup_VolGroup00

(note the 23 march mod time). With only two links, I am not
sure what is happening there. Tha

Re: [BackupPC-users] BackupPC_dump hangs with: .: size doesn't match (12288 vs 17592185913344)

2011-04-06 Thread John Rouillard
On Mon, Apr 04, 2011 at 09:49:55PM -0400, Jeffrey J. Kosowsky wrote:
> Holger Parplies wrote at about 18:59:00 +0200 on Monday, April 4, 2011:
>  > Hi,
>  > 
>  > John Rouillard wrote on 2011-03-31 15:20:23 +
>  > [[BackupPC-users] BackupPC_dump hangs with: .: size
>  > doesn't match (12288 vs 17592185913344)]: 
>  > > [...]
>  > > I get a bunch of output (the share being backed up
>  > /etc on a centos 5.5. box) which ends with:
>  > > 
>  > >   attribSet: dir=f%2fetc exists
>  > >   attribSet(dir=f%2fetc, file=zshrc, size=640, placeholder=1)
>  > >   Starting file 0 (.), blkCnt=134217728, blkSize=131072, remainder=0
>  > >   .: size doesn't match (12288 vs 17592185913344)
>  > 
>  > at first glance, this would appear to be an indication
>  > of something I have been suspecting for a long time:
>  > corruption - caused by whatever - in an attrib file
>  > leading to the SIGALRM abort. If I remember correctly,
>  > someone (presumably File::RsyncP) would ordinarily try
>  > to allocate space for the file (though that doesn't
>  > seem to make sense, so I probably remember incorrectly)
>  > and either gives up when that fails or refrains from
>  > trying in the first place, because the amount is
>  > obviously insane.
>  > 
>  > The weird thing in this case is that we're seeing a
>  > directory. There is absolutely no reason (unless I am
>  > missing something) to worry about the *size* of a
>  > directory. The value is absolutely file system
>  > dependant and not even necessarily an indication of the
>  > *current* amount of entries in the directory. In any
>  > case, you restore the contents of a directory by
>  > restoring the files in it, and you (incrementally)
>  > backup a directory by determining if any files have
>  > changed or been added. The *size* of a directory will
>  > not help with that decision.
>  > 
> 
> I too don't understand why rsync would try to treat a
> directory '.'  like a file. I mean block count and block
> size aren't meaningful when rsyncing a directory (at least
> according to my understanding of the rsync algorithm).
> Could it possibly be a character encoding issue where
> there is some file (maybe with spaces, non-printable
> characters, or some non-recognized font) that rsync ends
> up seeing as '.' rather than as the true file name. For
> example, I know Windows and Linux have different concepts
> of what characters are allowed in file names and that
> cygwin tries to do its best to make sense of the
> differences but doesn't always succeed.

Perhaps, but this is a bog standard centos (redhat) 5.5 system.
 
>  > Then again, the problematic file (or attrib file entry) may or may
>  > not be the last one reported (maybe it's the first one not
>  > reported?).
> 
> I would suggest some trial-and-error.
> What if you move the directory somewhere else and try a new backup
> (using say a new machine alias)?
> Can you do a binary-search to narrow down the error?

Sadly things started working again.  However I can answer some of your
questions.

> In particular, is the problem only when doing an incremental? Does it
> work if you force a full?

No and No. After backup 293 on March 22 which completed a
level 2 incremntal successfully, the level 3 incremental
on March 23'rd failed with a sig alarm. Then the full backup
that started on the 24 failed with sigalarms till April 1.

Some of the fulls were manually forced fulls via the web
interface and one was an automatically scheduled full.

I believe the traces/debugging from my original post were all from a
full dump using BackupPC_dump.

> Does it work if you make this the first backup on a new
> machine alias?

How would that work? The last valid backup was an incremental that
would have to be turned into a full somehow right?

-- 
-- rouilj

John Rouillard   System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

--
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/