Matt:

Thanks for the help.

Upgrading will be done, but not really an option at this point, unless of course, that is the fix. I'm a limited resource and other stuff is backing up.

I made my original conclusion on the ssh issue since:

Server a backs up eight servers, say b through i, with Server b acting as an ssh tunnel for servers c - i. And Server b backs up Server A and that works fine. So, the connection between the two shouldn't be an issue.

All of these worked for the two years I've been here. But about 2 weeks ago, one of those servers failed. Since then about another two fail as well.

I don't think this is an disk space issue. I've checked that, and the wrapper script that does the rsync first deletes off older backups.

Here is an strace from server A.


11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:04:55 [11242] recv"..., 92) = 92
11242 write(1, "recv_file_name(usr/share/icons/L"..., 64) = 64
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, "9\0\0\t", 4)             = 4
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, "make_file(1,usr/share/icons/Loco"..., 57) = 57
11242 time(NULL)                        = 1228482295
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:04:55 [11242] make"..., 85) = 85
11242 write(1, "make_file(1,usr/share/icons/Loco"..., 57) = 57
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, "\f\0\0\7", 4)            = 4
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, "\276&\5p.pngt\1\0\0", 12) = 12
11242 time(NULL)                        = 1228482295
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:04:55 [11242] recv"..., 88) = 88
11242 write(1, "recv_file_name(usr/share/icons/L"..., 60) = 60
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, ":\0\0\t", 4)             = 4
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, "make_file(1,usr/share/icons/Loco"..., 58) = 36
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0} <unfinished ...>
11243 <... select resumed> )            = 1 (in [4])
11243 read(4, "", 8192)                 = 0
11243 rt_sigaction(SIGWINCH, NULL, {0x8000b6d0, [], 0}, 8) = 0
11243 rt_sigaction(SIGWINCH, {SIG_DFL}, NULL, 8) = 0
11243 close(5)                          = 0
11243 close(6)                          = 0
11243 close(7)                          = 0
11243 munmap(0xb7df7000, 1150976)       = 0
11243 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfe9e178) = -1 EINVAL (Invalid argument) 11243 fcntl64(0, F_GETFL) = 0x802 (flags O_RDWR| O_NONBLOCK)
11243 fcntl64(0, F_SETFL, O_RDWR)       = 0
11243 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfe9e178) = -1 EINVAL (Invalid argument) 11243 fcntl64(1, F_GETFL) = 0x802 (flags O_RDWR| O_NONBLOCK)
11243 fcntl64(1, F_SETFL, O_RDWR)       = 0
11243 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
11243 write(2, "Connection to localhost closed b"..., 48) = 48
11243 gettimeofday({1228483264, 414090}, NULL) = 0
11243 shutdown(4, 2 /* send and receive */) = 0
11243 close(4)                          = 0
11243 brk(0x81fce000)                   = 0x81fce000
11243 exit_group(-1)                    = ?
11242 <... select resumed> )            = 1 (in [6], left {50, 782000})
11242 --- SIGCHLD (Child exited) @ 0 (0) ---
11242 waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG) = 11243
11242 waitpid(-1, 0xbfc2bab0, WNOHANG)  = -1 ECHILD (No child processes)
11242 sigreturn()                       = ? (mask now [])
11242 read(6, "", 22)                   = 0
11242 time(NULL)                        = 1228483264
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:21:04 [11242] rsyn"..., 109) = 109
11242 write(2, "rsync: connection unexpectedly c"..., 80) = 80
11242 write(2, "\n", 1)                 = 1
11242 rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
11242 rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
11242 waitpid(11243, 0xbfc2bd98, WNOHANG) = -1 ECHILD (No child processes)
11242 kill(11243, SIGUSR1)              = -1 ESRCH (No such process)
11242 time(NULL)                        = 1228483264
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:21:04 [11242] rsyn"..., 100) = 100
11242 write(2, "rsync error: unexplained error ("..., 71) = 71
11242 write(2, "\n", 1)                 = 1
11242 time(NULL)                        = 1228483264
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:21:04 [11242] _exi"..., 97) = 97
11242 write(1, "_exit_cleanup(code=12, file=io.c"..., 69) = 69
11242 exit_group(255)                   = ?


The server being backed up doesn't run rsync as a daemon. I've read in the rsync docs that backups can be done initiated from the server with out the client running rsync as a daemon. For this case, I don't think I can get an strace. But, if possible, let me know and I will track down how to do that.

I also tried to back-up another server, and captured an strace from Server A. Here is that trace.

The big difference here was that I had to ^C the process after about an hour. Usually, it looks like ssh shuts down the tunnel after a period of 15 minutes.

4872  <... select resumed> )            = 2 (in [4], out [6])
4891  select(7, [6], [], NULL, {60, 0} <unfinished ...>
4872 write(6, "\374\17\0\7glade2\0\f.xpm\0\347images\0\6/yelp\0"..., 16384 <unfinished ...>
4891  <... select resumed> )            = 1 (in [6], left {60, 0})
4872  <... write resumed> )             = 16384
4891  read(6,  <unfinished ...>
4872  read(4,  <unfinished ...>
4891 <... read resumed> "urrent.desktop\0\5win95.desktop\0\3756"..., 4092) = 4092 4872 <... read resumed> "\243\36c\274aVv\324\16\307z \347\17\274\247\34\331\327\333X\256\177\222\360\246\"\v\301\"pA%"..., 8192) = 8192
4891  select(7, [6], [], NULL, {60, 0} <unfinished ...>
4872  select(8, [4 5], [], NULL, NULL <unfinished ...>
4891  <... select resumed> )            = 1 (in [6], left {60, 0})
4872  <... select resumed> )            = 1 (in [4])
4891  read(6,  <unfinished ...>
4872  read(4,  <unfinished ...>
4891  <... read resumed> "\374\17\0\7", 4) = 4
4872 <... read resumed> "\270\335\204#\373QM\217U\225K \241\22\2750\357\256\2504\215\206L\255M\264\364\362\267\366D \335\350"..., 8192) = 3040
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "t-preferences.glade2\0\1asciitable"..., 4092) = 4092
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "\374\17\0\7", 4)         = 4
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4872  select(8, [4 5], [6], NULL, NULL <unfinished ...>
4891  read(6,  <unfinished ...>
4872  <... select resumed> )            = 1 (out [6])
4891 <... read resumed> "glade2\0\f.xpm\0\347images\0\6/yelp\0\5/ em"..., 4092) = 4092 4872 write(6, "\374\17\0\7ORWARDING_DENIED\0\1TP_DISABLE"..., 16384 <unfinished ...>
4891  select(7, [6], [], NULL, {60, 0} <unfinished ...>
4872  <... write resumed> )             = 16384
4891  <... select resumed> )            = 1 (in [6], left {60, 0})
4872  select(8, [4 5], [], NULL, NULL <unfinished ...>
4891  read(6, "\374\17\0\7", 4)         = 4
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "t-config-nfs.png\0\371test.py\0\0mainW"..., 4092) = 4092
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "\374\17\0\7", 4)         = 4
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "RWARDING_DENIED\0\1TP_DISABLED\0\3FA"..., 4092) = 4092
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "\374\17\0\7", 4)         = 4
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "NG_DENIED\0\1TP_DISABLED\0\3FAILURE\0"..., 4092) = 4092
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "\374\17\0\7", 4)         = 4
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "ORWARDING_DENIED\0\1TP_DISABLED\0\3F"..., 4092) = 4092
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "\374\17\0\7", 4)         = 4
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "elogs\0\377suexec\0\0certwatch\0\0aepdae"..., 4092) = 4092
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "\374\17\0\7", 4)         = 4
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "slinux\0\0talk\0\1elnet\0\0ime\0\377wget\0\0"..., 4092) = 4092
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "\374\17\0\7", 4)         = 4
4891  select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891  read(6, "changethehostname\0\0einit_wrapper"..., 4092) = 4092
4891  select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871  <... select resumed> )            = 0 (Timeout)
4871  select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891  <... select resumed> )            = 0 (Timeout)
4891  select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871  <... select resumed> )            = 0 (Timeout)
4871  select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891  <... select resumed> )            = 0 (Timeout)
4891  select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871  <... select resumed> )            = 0 (Timeout)

.
.
same message another 100 times
.
.

4871  <... select resumed> )            = 0 (Timeout)
4871  select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891  <... select resumed> )            = 0 (Timeout)
4891  select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871  <... select resumed> )            = 0 (Timeout)
4871  select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891  <... select resumed> )            = 0 (Timeout)
4891  select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871  <... select resumed> )            = 0 (Timeout)
4871  select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891  <... select resumed> )            = 0 (Timeout)
4891  select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871  <... select resumed> )            = 0 (Timeout)
4871  select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891  <... select resumed> )            = 0 (Timeout)
4891  select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871  <... select resumed> )            = 0 (Timeout)
4871  select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891  <... select resumed> )            = 0 (Timeout)
4891 select(7, [6], [], NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted) 4872 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) 4871 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted)
4891  --- SIGINT (Interrupt) @ 0 (0) ---
4872  --- SIGINT (Interrupt) @ 0 (0) ---
4871  --- SIGINT (Interrupt) @ 0 (0) ---
4891  gettimeofday( <unfinished ...>
4871  gettimeofday( <unfinished ...>
4891  <... gettimeofday resumed> {1228436081, 424261}, NULL) = 0
4871  <... gettimeofday resumed> {1228436081, 424282}, NULL) = 0
4891  select(0, NULL, NULL, NULL, {0, 400000} <unfinished ...>
4871  select(0, NULL, NULL, NULL, {0, 400000} <unfinished ...>
4872  sigreturn()                       = ? (mask now [])
4872  rt_sigaction(SIGWINCH, NULL, {0x8000b6d0, [], 0}, 8) = 0
4872  rt_sigaction(SIGWINCH, {SIG_DFL}, NULL, 8) = 0
4872  close(5)                          = 0
4872  close(6)                          = 0
4872  close(7)                          = 0
4872  munmap(0xb7e07000, 1118208)       = 0
4872 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfccd988) = -1 EINVAL (Invalid argument) 4872 fcntl64(0, F_GETFL) = 0x802 (flags O_RDWR| O_NONBLOCK)
4872  fcntl64(0, F_SETFL, O_RDWR)       = 0
4872 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfccd988) = -1 EINVAL (Invalid argument) 4872 fcntl64(1, F_GETFL) = 0x802 (flags O_RDWR| O_NONBLOCK)
4872  fcntl64(1, F_SETFL, O_RDWR)       = 0
4872 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
4872  write(2, "Killed by signal 2.\r\n", 21) = 21
4872  exit_group(255)                   = ?
4871 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted)
4871  --- SIGCHLD (Child exited) @ 0 (0) ---
4871 waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG) = 4872
4871  waitpid(-1, 0xbfa5a610, WNOHANG)  = 0
4871  sigreturn()                       = ? (mask now [INT])
4871  gettimeofday({1228436081, 736123}, NULL) = 0
4871  select(0, NULL, NULL, NULL, {0, 89000} <unfinished ...>
4891  <... select resumed> )            = 0 (Timeout)
4891  gettimeofday({1228436081, 823878}, NULL) = 0
4891  select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
4871  <... select resumed> )            = 0 (Timeout)
4871  gettimeofday({1228436081, 824699}, NULL) = 0
4871  rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
4891  <... select resumed> )            = 0 (Timeout)
4871  rt_sigaction(SIGUSR2, {SIG_IGN},  <unfinished ...>
4891  gettimeofday( <unfinished ...>
4871  <... rt_sigaction resumed> NULL, 8) = 0
4891  <... gettimeofday resumed> {1228436081, 824877}, NULL) = 0
4891  rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
4891  rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
4891  waitpid(4872,  <unfinished ...>
4871  waitpid(4872,  <unfinished ...>
4891 <... waitpid resumed> 0xbfa56ba8, WNOHANG) = -1 ECHILD (No child processes)
4871  <... waitpid resumed> 0xbfa5a948, WNOHANG) = 0
4891  unlink("var/lib/slocate/.slocate.db.s9nY0L" <unfinished ...>
4871  kill(4872, SIGUSR1)               = -1 ESRCH (No such process)
4891  <... unlink resumed> )            = 0
4871  kill(4891, SIGUSR1)               = 0
4891  --- SIGUSR1 (User defined signal 1) @ 0 (0) ---
4891  kill(4872, SIGUSR1)               = -1 ESRCH (No such process)
4871  time(NULL)                        = 1228436081
4871 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
4891  time(NULL)                        = 1228436081
4871 write(3, "2008/12/04 19:14:41 [4871] rsync"..., 120 <unfinished ...> 4891 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 4891 write(3, "2008/12/04 19:14:41 [4891] rsync"..., 119 <unfinished ...>
4871  <... write resumed> )             = 120
4891  <... write resumed> )             = 119
4891 write(2, "rsync error: received SIGINT, SI"..., 91 <unfinished ...> 4871 write(2, "rsync error: received SIGINT, SI"..., 92 <unfinished ...>
4891  <... write resumed> )             = 91
4891  write(2, "\n", 1)                 = 1
4891  time(NULL)                        = 1228436082
4871  <... write resumed> )             = 92
4891  stat64("/etc/localtime",  <unfinished ...>
4871  write(2, "\n", 1 <unfinished ...>
4891 <... stat64 resumed> {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
4871  <... write resumed> )             = 1
4891 write(3, "2008/12/04 19:14:42 [4891] _exit"..., 98 <unfinished ...>
4871  time( <unfinished ...>
4891  <... write resumed> )             = 98
4871  <... time resumed> NULL)          = 1228436082
4891 write(1, "_exit_cleanup(code=20, file=rsyn"..., 71 <unfinished ...>
4871  stat64("/etc/localtime",  <unfinished ...>
4891  <... write resumed> )             = 71
4871 <... stat64 resumed> {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0
4871  write(3, "2008/12/04 19:14:42 [4871] _exit"..., 98) = 98
4871  write(1, "_exit_cleanup(code=20, file=rsyn"..., 71) = 71
4871  exit_group(20)                    = ?
4891  exit_group(20)




On Dec 4, 2008, at 5:50 PM, Matt McCutchen wrote:

On Thu, 2008-12-04 at 16:50 -0500, Shawn Geraghty wrote:
I'm having an issue with rsync in that it recently (11/29/2008)
started to crash during a daily cron job backup.  Below is some info
on how rsync is being used to backup our servers.

Version:
        2.6.9

Please try the latest stable version of rsync, 3.0.4, if at all
possible.

Execution:
        Called daily via a wrapper script, and up until about 2 weeks ago,
ran fine.  The wrapper script just loops through a series servers to
backup.  The first server succeeds but the second fails.
Interesting, this server that does the backing up, is actually backed
up by another server, which is not having any issues (leads me to
believe that this is not an ssh connection issue).

I'm not clear on how you reached this conclusion.

Command Used:

rsync -vvv -a -e "ssh -i /etc/backup/some_id.rsa" --delete -- numeric- ids --exclude-from=/etc/backup/exclude_list --log-file=/usr/local/ bin/ rsync2.log --link-dest=../$LASTBACKUPDIR [EMAIL PROTECTED]:/ $NEWBACKUPDIR

I've added the -vvv and --log-file options while trying to debug this
issue.

Issue:

rsync runs for while, but then it freezes up. Sits in this state for
about 15 minutes until the connection is closed (I'm guessing ssh is
timing out due to inactivity).

Please send the last 50 lines or so of the strace of each of the three
rsync processes (one on the sending side and the other two on the
receiving side) so we can see how they are hanging.

Server B:

2008/12/03 16:08:35 [26912] sender finished //var/log/httpd/
ssl_request_log.4
2008/12/03 16:08:35 [26912] send_files(145596,//var/log/httpd/stage-
rewrite_log)
2008/12/03 16:08:35 [26912] send_files mapped //var/log/httpd/stage-
rewrite_log of size 38039011
2008/12/03 16:08:35 [26912] recv_files(var/log/httpd/stage- rewrite_log)
2008/12/03 16:08:35 [26912] calling match_sums //var/log/httpd/stage-
rewrite_log
2008/12/03 16:25:24 [26912] rsync: connection unexpectedly closed
(218901485 bytes received so far) [receiver]
2008/12/03 16:25:24 [26912] rsync error: error in rsync protocol data
stream (code 12) at io.c(453) [receiver=2.6.9]
2008/12/03 16:25:24 [26912] _exit_cleanup(code=12, file=io.c,
line=453): about to call exit(12)
2008/12/03 16:25:24 [26910] rsync: connection unexpectedly closed
(2929535 bytes received so far) [generator]
2008/12/03 16:25:24 [26910] rsync error: unexplained error (code 255)
at io.c(453) [generator=2.6.9]
2008/12/03 16:25:24 [26910] _exit_cleanup(code=12, file=io.c,
line=453): about to call exit(255)


--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Reply via email to