Matt:
Thanks for the help.
Upgrading will be done, but not really an option at this point, unless
of course, that is the fix. I'm a limited resource and other stuff is
backing up.
I made my original conclusion on the ssh issue since:
Server a backs up eight servers, say b through i, with Server b acting
as an ssh tunnel for servers c - i.
And Server b backs up Server A and that works fine. So, the
connection between the two shouldn't be an issue.
All of these worked for the two years I've been here. But about 2
weeks ago, one of those servers failed. Since then about another two
fail as well.
I don't think this is an disk space issue. I've checked that, and the
wrapper script that does the rsync first deletes off older backups.
Here is an strace from server A.
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:04:55 [11242] recv"..., 92) = 92
11242 write(1, "recv_file_name(usr/share/icons/L"..., 64) = 64
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, "9\0\0\t", 4) = 4
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, "make_file(1,usr/share/icons/Loco"..., 57) = 57
11242 time(NULL) = 1228482295
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:04:55 [11242] make"..., 85) = 85
11242 write(1, "make_file(1,usr/share/icons/Loco"..., 57) = 57
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, "\f\0\0\7", 4) = 4
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, "\276&\5p.pngt\1\0\0", 12) = 12
11242 time(NULL) = 1228482295
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:04:55 [11242] recv"..., 88) = 88
11242 write(1, "recv_file_name(usr/share/icons/L"..., 60) = 60
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, ":\0\0\t", 4) = 4
11242 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
11242 read(6, "make_file(1,usr/share/icons/Loco"..., 58) = 36
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0}) = 0 (Timeout)
11242 select(7, [6], [], NULL, {60, 0} <unfinished ...>
11243 <... select resumed> ) = 1 (in [4])
11243 read(4, "", 8192) = 0
11243 rt_sigaction(SIGWINCH, NULL, {0x8000b6d0, [], 0}, 8) = 0
11243 rt_sigaction(SIGWINCH, {SIG_DFL}, NULL, 8) = 0
11243 close(5) = 0
11243 close(6) = 0
11243 close(7) = 0
11243 munmap(0xb7df7000, 1150976) = 0
11243 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfe9e178) = -1 EINVAL
(Invalid argument)
11243 fcntl64(0, F_GETFL) = 0x802 (flags O_RDWR|
O_NONBLOCK)
11243 fcntl64(0, F_SETFL, O_RDWR) = 0
11243 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfe9e178) = -1 EINVAL
(Invalid argument)
11243 fcntl64(1, F_GETFL) = 0x802 (flags O_RDWR|
O_NONBLOCK)
11243 fcntl64(1, F_SETFL, O_RDWR) = 0
11243 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, {B9600 opost isig icanon
echo ...}) = 0
11243 write(2, "Connection to localhost closed b"..., 48) = 48
11243 gettimeofday({1228483264, 414090}, NULL) = 0
11243 shutdown(4, 2 /* send and receive */) = 0
11243 close(4) = 0
11243 brk(0x81fce000) = 0x81fce000
11243 exit_group(-1) = ?
11242 <... select resumed> ) = 1 (in [6], left {50, 782000})
11242 --- SIGCHLD (Child exited) @ 0 (0) ---
11242 waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG)
= 11243
11242 waitpid(-1, 0xbfc2bab0, WNOHANG) = -1 ECHILD (No child processes)
11242 sigreturn() = ? (mask now [])
11242 read(6, "", 22) = 0
11242 time(NULL) = 1228483264
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:21:04 [11242] rsyn"..., 109) = 109
11242 write(2, "rsync: connection unexpectedly c"..., 80) = 80
11242 write(2, "\n", 1) = 1
11242 rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
11242 rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
11242 waitpid(11243, 0xbfc2bd98, WNOHANG) = -1 ECHILD (No child
processes)
11242 kill(11243, SIGUSR1) = -1 ESRCH (No such process)
11242 time(NULL) = 1228483264
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:21:04 [11242] rsyn"..., 100) = 100
11242 write(2, "rsync error: unexplained error ("..., 71) = 71
11242 write(2, "\n", 1) = 1
11242 time(NULL) = 1228483264
11242 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=3519, ...}) = 0
11242 write(3, "2008/12/05 08:21:04 [11242] _exi"..., 97) = 97
11242 write(1, "_exit_cleanup(code=12, file=io.c"..., 69) = 69
11242 exit_group(255) = ?
The server being backed up doesn't run rsync as a daemon. I've read
in the rsync docs that backups can be done initiated from the server
with out the client running rsync as a daemon. For this case, I don't
think I can get an strace. But, if possible, let me know and I will
track down how to do that.
I also tried to back-up another server, and captured an strace from
Server A. Here is that trace.
The big difference here was that I had to ^C the process after about
an hour. Usually, it looks like ssh shuts down the tunnel after a
period of 15 minutes.
4872 <... select resumed> ) = 2 (in [4], out [6])
4891 select(7, [6], [], NULL, {60, 0} <unfinished ...>
4872 write(6, "\374\17\0\7glade2\0\f.xpm\0\347images\0\6/yelp\0"...,
16384 <unfinished ...>
4891 <... select resumed> ) = 1 (in [6], left {60, 0})
4872 <... write resumed> ) = 16384
4891 read(6, <unfinished ...>
4872 read(4, <unfinished ...>
4891 <... read resumed> "urrent.desktop\0\5win95.desktop\0\3756"...,
4092) = 4092
4872 <... read resumed> "\243\36c\274aVv\324\16\307z
\347\17\274\247\34\331\327\333X\256\177\222\360\246\"\v\301\"pA%"...,
8192) = 8192
4891 select(7, [6], [], NULL, {60, 0} <unfinished ...>
4872 select(8, [4 5], [], NULL, NULL <unfinished ...>
4891 <... select resumed> ) = 1 (in [6], left {60, 0})
4872 <... select resumed> ) = 1 (in [4])
4891 read(6, <unfinished ...>
4872 read(4, <unfinished ...>
4891 <... read resumed> "\374\17\0\7", 4) = 4
4872 <... read resumed> "\270\335\204#\373QM\217U\225K
\241\22\2750\357\256\2504\215\206L\255M\264\364\362\267\366D
\335\350"..., 8192) = 3040
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "t-preferences.glade2\0\1asciitable"..., 4092) = 4092
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "\374\17\0\7", 4) = 4
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4872 select(8, [4 5], [6], NULL, NULL <unfinished ...>
4891 read(6, <unfinished ...>
4872 <... select resumed> ) = 1 (out [6])
4891 <... read resumed> "glade2\0\f.xpm\0\347images\0\6/yelp\0\5/
em"..., 4092) = 4092
4872 write(6, "\374\17\0\7ORWARDING_DENIED\0\1TP_DISABLE"..., 16384
<unfinished ...>
4891 select(7, [6], [], NULL, {60, 0} <unfinished ...>
4872 <... write resumed> ) = 16384
4891 <... select resumed> ) = 1 (in [6], left {60, 0})
4872 select(8, [4 5], [], NULL, NULL <unfinished ...>
4891 read(6, "\374\17\0\7", 4) = 4
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "t-config-nfs.png\0\371test.py\0\0mainW"..., 4092) = 4092
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "\374\17\0\7", 4) = 4
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "RWARDING_DENIED\0\1TP_DISABLED\0\3FA"..., 4092) = 4092
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "\374\17\0\7", 4) = 4
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "NG_DENIED\0\1TP_DISABLED\0\3FAILURE\0"..., 4092) = 4092
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "\374\17\0\7", 4) = 4
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "ORWARDING_DENIED\0\1TP_DISABLED\0\3F"..., 4092) = 4092
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "\374\17\0\7", 4) = 4
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "elogs\0\377suexec\0\0certwatch\0\0aepdae"..., 4092) =
4092
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "\374\17\0\7", 4) = 4
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "slinux\0\0talk\0\1elnet\0\0ime\0\377wget\0\0"..., 4092)
= 4092
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "\374\17\0\7", 4) = 4
4891 select(7, [6], [], NULL, {60, 0}) = 1 (in [6], left {60, 0})
4891 read(6, "changethehostname\0\0einit_wrapper"..., 4092) = 4092
4891 select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871 <... select resumed> ) = 0 (Timeout)
4871 select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891 <... select resumed> ) = 0 (Timeout)
4891 select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871 <... select resumed> ) = 0 (Timeout)
4871 select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891 <... select resumed> ) = 0 (Timeout)
4891 select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871 <... select resumed> ) = 0 (Timeout)
.
.
same message another 100 times
.
.
4871 <... select resumed> ) = 0 (Timeout)
4871 select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891 <... select resumed> ) = 0 (Timeout)
4891 select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871 <... select resumed> ) = 0 (Timeout)
4871 select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891 <... select resumed> ) = 0 (Timeout)
4891 select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871 <... select resumed> ) = 0 (Timeout)
4871 select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891 <... select resumed> ) = 0 (Timeout)
4891 select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871 <... select resumed> ) = 0 (Timeout)
4871 select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891 <... select resumed> ) = 0 (Timeout)
4891 select(7, [6], [], NULL, {60, 0} <unfinished ...>
4871 <... select resumed> ) = 0 (Timeout)
4871 select(5, [4], [], NULL, {60, 0} <unfinished ...>
4891 <... select resumed> ) = 0 (Timeout)
4891 select(7, [6], [], NULL, {60, 0}) = ? ERESTARTNOHAND (To be
restarted)
4872 <... select resumed> ) = ? ERESTARTNOHAND (To be
restarted)
4871 <... select resumed> ) = ? ERESTARTNOHAND (To be
restarted)
4891 --- SIGINT (Interrupt) @ 0 (0) ---
4872 --- SIGINT (Interrupt) @ 0 (0) ---
4871 --- SIGINT (Interrupt) @ 0 (0) ---
4891 gettimeofday( <unfinished ...>
4871 gettimeofday( <unfinished ...>
4891 <... gettimeofday resumed> {1228436081, 424261}, NULL) = 0
4871 <... gettimeofday resumed> {1228436081, 424282}, NULL) = 0
4891 select(0, NULL, NULL, NULL, {0, 400000} <unfinished ...>
4871 select(0, NULL, NULL, NULL, {0, 400000} <unfinished ...>
4872 sigreturn() = ? (mask now [])
4872 rt_sigaction(SIGWINCH, NULL, {0x8000b6d0, [], 0}, 8) = 0
4872 rt_sigaction(SIGWINCH, {SIG_DFL}, NULL, 8) = 0
4872 close(5) = 0
4872 close(6) = 0
4872 close(7) = 0
4872 munmap(0xb7e07000, 1118208) = 0
4872 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfccd988) = -1 EINVAL
(Invalid argument)
4872 fcntl64(0, F_GETFL) = 0x802 (flags O_RDWR|
O_NONBLOCK)
4872 fcntl64(0, F_SETFL, O_RDWR) = 0
4872 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfccd988) = -1 EINVAL
(Invalid argument)
4872 fcntl64(1, F_GETFL) = 0x802 (flags O_RDWR|
O_NONBLOCK)
4872 fcntl64(1, F_SETFL, O_RDWR) = 0
4872 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, {B9600 opost isig icanon
echo ...}) = 0
4872 write(2, "Killed by signal 2.\r\n", 21) = 21
4872 exit_group(255) = ?
4871 <... select resumed> ) = ? ERESTARTNOHAND (To be
restarted)
4871 --- SIGCHLD (Child exited) @ 0 (0) ---
4871 waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG)
= 4872
4871 waitpid(-1, 0xbfa5a610, WNOHANG) = 0
4871 sigreturn() = ? (mask now [INT])
4871 gettimeofday({1228436081, 736123}, NULL) = 0
4871 select(0, NULL, NULL, NULL, {0, 89000} <unfinished ...>
4891 <... select resumed> ) = 0 (Timeout)
4891 gettimeofday({1228436081, 823878}, NULL) = 0
4891 select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
4871 <... select resumed> ) = 0 (Timeout)
4871 gettimeofday({1228436081, 824699}, NULL) = 0
4871 rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
4891 <... select resumed> ) = 0 (Timeout)
4871 rt_sigaction(SIGUSR2, {SIG_IGN}, <unfinished ...>
4891 gettimeofday( <unfinished ...>
4871 <... rt_sigaction resumed> NULL, 8) = 0
4891 <... gettimeofday resumed> {1228436081, 824877}, NULL) = 0
4891 rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
4891 rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
4891 waitpid(4872, <unfinished ...>
4871 waitpid(4872, <unfinished ...>
4891 <... waitpid resumed> 0xbfa56ba8, WNOHANG) = -1 ECHILD (No child
processes)
4871 <... waitpid resumed> 0xbfa5a948, WNOHANG) = 0
4891 unlink("var/lib/slocate/.slocate.db.s9nY0L" <unfinished ...>
4871 kill(4872, SIGUSR1) = -1 ESRCH (No such process)
4891 <... unlink resumed> ) = 0
4871 kill(4891, SIGUSR1) = 0
4891 --- SIGUSR1 (User defined signal 1) @ 0 (0) ---
4891 kill(4872, SIGUSR1) = -1 ESRCH (No such process)
4871 time(NULL) = 1228436081
4871 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=3519, ...}) = 0
4891 time(NULL) = 1228436081
4871 write(3, "2008/12/04 19:14:41 [4871] rsync"..., 120
<unfinished ...>
4891 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=3519, ...}) = 0
4891 write(3, "2008/12/04 19:14:41 [4891] rsync"..., 119
<unfinished ...>
4871 <... write resumed> ) = 120
4891 <... write resumed> ) = 119
4891 write(2, "rsync error: received SIGINT, SI"..., 91
<unfinished ...>
4871 write(2, "rsync error: received SIGINT, SI"..., 92
<unfinished ...>
4891 <... write resumed> ) = 91
4891 write(2, "\n", 1) = 1
4891 time(NULL) = 1228436082
4871 <... write resumed> ) = 92
4891 stat64("/etc/localtime", <unfinished ...>
4871 write(2, "\n", 1 <unfinished ...>
4891 <... stat64 resumed> {st_mode=S_IFREG|0644, st_size=3519, ...})
= 0
4871 <... write resumed> ) = 1
4891 write(3, "2008/12/04 19:14:42 [4891] _exit"..., 98
<unfinished ...>
4871 time( <unfinished ...>
4891 <... write resumed> ) = 98
4871 <... time resumed> NULL) = 1228436082
4891 write(1, "_exit_cleanup(code=20, file=rsyn"..., 71
<unfinished ...>
4871 stat64("/etc/localtime", <unfinished ...>
4891 <... write resumed> ) = 71
4871 <... stat64 resumed> {st_mode=S_IFREG|0644, st_size=3519, ...})
= 0
4871 write(3, "2008/12/04 19:14:42 [4871] _exit"..., 98) = 98
4871 write(1, "_exit_cleanup(code=20, file=rsyn"..., 71) = 71
4871 exit_group(20) = ?
4891 exit_group(20)
On Dec 4, 2008, at 5:50 PM, Matt McCutchen wrote:
On Thu, 2008-12-04 at 16:50 -0500, Shawn Geraghty wrote:
I'm having an issue with rsync in that it recently (11/29/2008)
started to crash during a daily cron job backup. Below is some info
on how rsync is being used to backup our servers.
Version:
2.6.9
Please try the latest stable version of rsync, 3.0.4, if at all
possible.
Execution:
Called daily via a wrapper script, and up until about 2 weeks ago,
ran fine. The wrapper script just loops through a series servers to
backup. The first server succeeds but the second fails.
Interesting, this server that does the backing up, is actually backed
up by another server, which is not having any issues (leads me to
believe that this is not an ssh connection issue).
I'm not clear on how you reached this conclusion.
Command Used:
rsync -vvv -a -e "ssh -i /etc/backup/some_id.rsa" --delete --
numeric-
ids --exclude-from=/etc/backup/exclude_list --log-file=/usr/local/
bin/
rsync2.log --link-dest=../$LASTBACKUPDIR [EMAIL PROTECTED]:/
$NEWBACKUPDIR
I've added the -vvv and --log-file options while trying to debug this
issue.
Issue:
rsync runs for while, but then it freezes up. Sits in this state
for
about 15 minutes until the connection is closed (I'm guessing ssh is
timing out due to inactivity).
Please send the last 50 lines or so of the strace of each of the three
rsync processes (one on the sending side and the other two on the
receiving side) so we can see how they are hanging.
Server B:
2008/12/03 16:08:35 [26912] sender finished //var/log/httpd/
ssl_request_log.4
2008/12/03 16:08:35 [26912] send_files(145596,//var/log/httpd/stage-
rewrite_log)
2008/12/03 16:08:35 [26912] send_files mapped //var/log/httpd/stage-
rewrite_log of size 38039011
2008/12/03 16:08:35 [26912] recv_files(var/log/httpd/stage-
rewrite_log)
2008/12/03 16:08:35 [26912] calling match_sums //var/log/httpd/stage-
rewrite_log
2008/12/03 16:25:24 [26912] rsync: connection unexpectedly closed
(218901485 bytes received so far) [receiver]
2008/12/03 16:25:24 [26912] rsync error: error in rsync protocol data
stream (code 12) at io.c(453) [receiver=2.6.9]
2008/12/03 16:25:24 [26912] _exit_cleanup(code=12, file=io.c,
line=453): about to call exit(12)
2008/12/03 16:25:24 [26910] rsync: connection unexpectedly closed
(2929535 bytes received so far) [generator]
2008/12/03 16:25:24 [26910] rsync error: unexplained error (code 255)
at io.c(453) [generator=2.6.9]
2008/12/03 16:25:24 [26910] _exit_cleanup(code=12, file=io.c,
line=453): about to call exit(255)
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html