On 7/16/19 4:27 PM, Adam Goryachev wrote:
On 17/7/19 4:22 am, David Koski wrote:

Regards,
David Koski
dko...@sutinen.com

On 7/8/19 6:16 PM, Adam Goryachev wrote:
On 9/7/19 10:23 am, David Koski wrote:
I am trying to back up about 24TB of data that has millions of files.  It takes a day or to before it starts backing up and then stops with an error.  I did a CLI dump and trapped the output and can see the error message:

Can't write 32780 bytes to socket
Read EOF: Connection reset by peer
Tried again: got 0 bytes
finish: removing in-process file Shares/Archives/<path-removed>/COR_2630.png
Child is aborting
Done: 589666 files, 1667429241846 bytes
Got fatal error during xfer (aborted by signal=PIPE)
Backup aborted by user signal
Not saving this as a partial backup since it has fewer files than the prior one (got 589666 and 589666 files versus 4225016)
dump failed: aborted by signal=PIPE

This backup is doing rsync over ssh.  I enabled SSH keepalive but it does not appear to be due to an idle network.  It does not appear to be a random network interruption because the time it takes to fail is pretty consistent, about three days. I'm stumped.


Did you check:

$Conf{ClientTimeout} = 72000;

Also, what version of rsync on the client, what version of BackupPC on the server, etc?

I think BPC v4 handles this scenario significantly better, in fact a server I used to have trouble with on BPC3.x all the time has since been combined with 4 other server (so 4 x the number of files and total size of data) and BPC4 handles it easily.


Thank you all for your input.  More information:

rsync version on client: 3.0.8 (Windows)
rsync version on server: 3.1.2 (Debian)
BackupPC version: 3.3.1
$(Config{ClientTimeout} = 604800

I just compared the output of two verbose BackupPC_dump runs and it looks like the files are reported to be backed up even though they are not.  For example, this appears in logs of both backup runs:

create   644  4616/545  1085243184 <path-removed>/<name-removed>3412.zip

I checked and the file time stamp is year 2018.  The log files are full of these.  I checked the real time clock on both systems and they are correct.  There are also files that have been backed up that are not in the logs.

I suspect there are over ten million files but I don't have a good way of telling now.  Oddly, there are about 500,000 files backed according to the log captured from BackupPC_dump and almost the same number actually backed up and found in pc/<host>/0, but they are different subsets of files.  I have been tracking memory and swap usage on the server and see no issues.

Is this a possible bug in BackupPC 3.3.1?

Please don't top-post if you can avoid it, at least not on mailing lists.

I just realised:

Read EOF: Connection reset by peer

This is a networking issue, not BackupPC. In other words, something has broken the network connection (in the middle of transferring a file, so I would presume it isn't due to some idle timeout, dropped NAT entry, etc). BackupPC has been told by the operating system that the connection is no longer valid, and so it has "cleaned up" by removing the in-progress file (partial).

I just completed another backup cycle that failed in the same manner but this time with a continuous ping with captured output.  It didn't miss a beat.


It takes a day to start (presumably reading ALL the files on the client takes this long, you could improve disk performance, or increase RAM on the client to improve this).

You might be right.  But it's not a show stopper.


"and then stops with an error" - is that on the first file, or are some files successfully transferred? Is that the first large file? Does it always fail on the same file (seems not, since it previously got many more).

Good points.  Confirmed: Not the first file (over 600,000 files transferred first), not a large file (less than 20Meg), does not always fail on the same file or directory.


I'm thinking you need to check and/or improve network reliability, make sure both client and server are not running out of RAM/etc (mainly the backuppc client, the OOM might kill the rsync process), etc. Check your system logs on both client and server, and/or watch top output on both systems during the backup.

The network did not miss a beat and generally appears responsive. It has been checked.  The client and server RAM usage are tracked in Zabbix and not close to running out.  Only curious thing is swap is running out on the client (Windows Server 2016) even with 10GB RAM available, but still has about 2GB before crash.  Server system logs (kern.log, syslog) show no signs of issues.


Try backing up other systems, try backing up a smaller subset (exclude some large directories, and then add them back in if you complete a backup successfully).

That is a good idea.  I'll try adding incrementally to the data backed up.


Overall, I would advise to upgrade to BPC v4.x, it handles backups of systems with huge number of files much better.

If incrementally adding doesn't solve the problem I'll try an upgrade.

Thank you,
David Koski


This doesn't look like a BPC bug, maybe a network driver, kernel, or something else, but not BPC (IMHO).

Regards,
Adam




_______________________________________________
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Reply via email to