Re: FW: selfcheck hangs
Jeremy, That could very well be the problem I'm having since I just tried a df on the client system and it ground to a halt trying to located NFS mounts. ?We rely heavily on NFS here so I'll need to figure out how to get around this problem in the future. ?Thanks for the info. Steve Jeremy L. Mordkoff wrote: no, the list was no help. The problem was that the client had nfs-mounted a disk that was no longer on the net, so anything that iterated over mounts (like df) was hanging. That is probably why reboot solve it. I don't allow key machines to be nfs clients anymore. JLM -Original Message- From: Steven M. Wilson [mailto:[EMAIL PROTECTED]] Sent: Wed 6/18/2003 2:32 PM To: Jeremy L. Mordkoff Cc: Subject: Re: FW: selfcheck hangs Jeremy, Did anyone respond off-list to your posting? I have the same problem here from time to time and the only way I've been able to correct is by rebooting the offending client system. Steve Jeremy L. Mordkoff wrote: one system has started refusing to run backups. amcheck reports a timeout. A ps on the client shows several orphaned selfcheck's. I did try killing all amandad's and hitting xinetd with a sigHUP, and then I tried an amcheck again, to no avail. I then reinstalled amanda and repeated. Still no. Here's the debug log. Any ideas would be appreciated. JLM -Original Message- From: root [mailto:[EMAIL PROTECTED]] Sent: Fri 6/13/2003 9:20 AM To: [EMAIL PROTECTED] Cc: Subject: amandad: debug 1 pid 23823 ruid 527 euid 527: start at Fri Jun 13 09:16:52 2003 amandad: version 2.4.3 amandad: build: VERSION="Amanda-2.4.3" amandad:BUILT_DATE="Fri Apr 4 10:37:17 EST 2003" amandad:BUILT_MACH="Linux lux1 2.4.18-18.7.xsmp #1 SMP Wed Nov 13 19:01:42 EST 2002 i686 unknown" amandad:CC="gcc" amandad:CONFIGURE_COMMAND="'./configure' '--with-user=amanda' '--with-group=disk'" amandad: paths: bindir="/usr/local/bin" sbindir="/usr/local/sbin" amandad:libexecdir="/usr/local/libexec" mandir="/usr/local/man" amandad:AMANDA_TMPDIR="/tmp/amanda" AMANDA_DBGDIR="/tmp/amanda" amandad:CONFIG_DIR="/usr/local/etc/amanda" DEV_PREFIX="/dev/" amandad:RDEV_PREFIX="/dev/" DUMP="/sbin/dump" amandad:RESTORE="/sbin/restore" SAMBA_CLIENT="/usr/bin/smbclient" amandad:GNUTAR="/bin/gtar" COMPRESS_PATH="/bin/gzip" amandad:UNCOMPRESS_PATH="/bin/gzip" MAILER="/usr/bin/Mail" amandad:listed_incr_dir="/usr/local/var/amanda/gnutar-lists" amandad: defs: DEFAULT_SERVER="lux1" DEFAULT_CONFIG="DailySet1" amandad:DEFAULT_TAPE_SERVER="lux1" DEFAULT_TAPE_DEVICE="/dev/null" amandad:HAVE_MMAP HAVE_SYSVSHM LOCKING=POSIX_FCNTL SETPGRP_VOID amandad:DEBUG_CODE AMANDA_DEBUG_DAYS=4 BSD_SECURITY USE_AMANDAHOSTS amandad:CLIENT_LOGIN="amanda" FORCE_USERID HAVE_GZIP amandad:COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast" amandad:COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc" amandad: time 0.000: got packet: Amanda 2.4 REQ HANDLE 000-58790808 SEQ 1055510212 SECURITY USER amanda SERVICE selfcheck OPTIONS features=feff9f00;maxdumps=1;hostname=rel2; DUMP hda3 0 OPTIONS |;auth=bsd;compress-fast; DUMP vg01/lv_data 0 OPTIONS |;auth=bsd;compress-fast; amandad: time 0.000: sending ack: Amanda 2.4 ACK HANDLE 000-58790808 SEQ 1055510212 amandad: time 0.001: bsd security: remote host lux1 user amanda local user amanda amandad: time 0.001: amandahosts security check passed amandad: time 0.001: running service "/usr/local/libexec/selfcheck" amandad: time 30.526: got packet: Amanda 2.4 REQ HANDLE 000-58790808 SEQ 1055510212 SECURITY USER amanda SERVICE selfcheck OPTIONS features=feff9f00;maxdumps=1;hostname=rel2; DUMP hda3 0 OPTIONS |;auth=bsd;compress-fast; DUMP vg01/lv_data 0 OPTIONS |;auth=bsd;compress-fast; amandad: time 31.146: received dup P_REQ packet, ACKing it amandad: time 31.146: sending ack: Amanda 2.4 ACK HANDLE 000-58790808 SEQ 1055510212 amandad: time 61.141: got packet: Amanda 2.4 REQ HANDLE 000-58790808 SEQ 1055510212 SECURITY USER amanda SERVICE selfcheck OPTIONS features=feff9f00;maxdumps=1;hostname=rel2; DUMP hda3 0 OPTIONS |;auth=bsd;compress-fast; DUMP vg01/lv_data 0 OPTIONS |;auth=bsd;compress-fast; amandad: time 61.141: received dup P_REQ packet, ACKing it amandad: time 61.141: sending ack: Amanda 2.4 ACK HANDLE 000-58790808 SEQ 1055510212 -- Steven M. Wilson, Systems and Network Manager Markey Center for Structural Biology Purdue University [EMAIL PROTECTED]765.496.1946
Re: Holding disks and the disk output driver
Ted, I've been using a 2 TB disk array for the past month or so and it's been working great for me without a holding disk. Using a holding disk will require more free disk space plus the additional time to transfer from the holding disk to the backup disk. I recommend getting a patch from Jean-Louis Martineau ([EMAIL PROTECTED]) which prevents a client error from forcing Amanda to go into degraded mode. When Amanda operates in degraded mode, it will only write to the holding disk, which in my case doesn't exist. But the patch prevents client-side errors from putting Amanda into degraded mode which allows my backups to continue being written to the backup disks. Steve Ted Cabeen wrote: If you're using the disk output driver to run backups to a large disk array, is there any reason to use a holding disk? -- Steven M. Wilson, Systems and Network Manager Markey Center for Structural Biology Purdue University [EMAIL PROTECTED]765.496.1946
Re: Holding disks and the disk output driver
Brian, I don't know much about the rait driver. Hopefully someone more knowledgeable on the list will respond. I was unclear what you meant about moving to "diskless backups"... Steve Brian Cuttler wrote: Ted, Steve, Amanda users, Stupid question, how smart is the rait driver for disk ? If you put the spool area on the output disk will it juggle the space ok ? Will it know to move the file from one directory to another (move the file pointer # mv perhaps) rather than having to copy all the bits and then remove the original ? I only ask because someone at my site is looking to move to diskless backups... thanks, Brian Ted, I've been using a 2 TB disk array for the past month or so and it's been working great for me without a holding disk. Using a holding disk will require more free disk space plus the additional time to transfer from the holding disk to the backup disk. I recommend getting a patch from Jean-Louis Martineau ([EMAIL PROTECTED]) which prevents a client error from forcing Amanda to go into degraded mode. When Amanda operates in degraded mode, it will only write to the holding disk, which in my case doesn't exist. But the patch prevents client-side errors from putting Amanda into degraded mode which allows my backups to continue being written to the backup disks. Steve Ted Cabeen wrote: If you're using the disk output driver to run backups to a large disk array, is there any reason to use a holding disk? -- Steven M. Wilson, Systems and Network Manager Markey Center for Structural Biology Purdue University [EMAIL PROTECTED]765.496.1946 -- Steven M. Wilson, Systems and Network Manager Markey Center for Structural Biology Purdue University [EMAIL PROTECTED]765.496.1946
Re: Holding disks and the disk output driver
Ted, I just noticed that you mentioned using the disk output driver. I believe my problem with Amanda dropping into degraded mode was specific to using the tape output driver (I use disk directories that are seen as virtual tapes). Sorry for confusing the issue... Steve Steven M. Wilson wrote: Ted, I've been using a 2 TB disk array for the past month or so and it's been working great for me without a holding disk. Using a holding disk will require more free disk space plus the additional time to transfer from the holding disk to the backup disk. I recommend getting a patch from Jean-Louis Martineau ([EMAIL PROTECTED]) which prevents a client error from forcing Amanda to go into degraded mode. When Amanda operates in degraded mode, it will only write to the holding disk, which in my case doesn't exist. But the patch prevents client-side errors from putting Amanda into degraded mode which allows my backups to continue being written to the backup disks. Steve Ted Cabeen wrote: If you're using the disk output driver to run backups to a large disk array, is there any reason to use a holding disk? -- Steven M. Wilson, Systems and Network Manager Markey Center for Structural Biology Purdue University [EMAIL PROTECTED]765.496.1946
Re: Failed Backups
Chris, I looked around a little in the Amanda source code and convinced myself that there was a bug there. I sent a note to to the amanda-hackers mailing list and received a prompt reply from Jean-Louis Martineau with a patch that fixed the problem for me. I'll attach his message and patch. Hope that helps! Steve Chris Gordon wrote: Steve, On Wed, Jun 04, 2003 at 02:29:20PM -, smw_purdue wrote: Chris, I'm having the same problem using a similar configuration of backups to disk without any holding disks. Every time Amanda drops into degraded mode it's because an error occurred with one of the clients (usually a timeout, indicating that a client system was unavailable). I would suspect that there's a bug in the code that puts Amanda into degraded mode on more errors than just a tape error. Notice in your log that you have an unknown response from gilgamesh. This error was probably what kicked Amanda into degraded mode. That is exactly what appears to be happening. I configured a holding disk in an attempt to eliminate that as a possible cause. In my case, the problem is intermittent with everything working fine for some time and then I a failure. The failure may be some file systems on a given host or most/all of the backup run. Today, I had two file systems fail on the again on gilgamesh and I began checking the various logs for issue. What I found in sendbackup.lotsofnumbers.debug is: ---[ begin ]--- sendbackup: time 0.002: stream_server: waiting for connection: 0.0.0.0.1496 sendbackup: time 0.002: stream_server: waiting for connection: 0.0.0.0.1497 sendbackup: time 0.002: stream_server: waiting for connection: 0.0.0.0.1498 sendbackup: time 0.003: waiting for connect on 1496, then 1497, then 1498 sendbackup: time 29.996: stream_accept: timeout after 30 seconds sendbackup: time 29.996: timeout on data port 1496 sendbackup: time 59.996: stream_accept: timeout after 30 seconds sendbackup: time 59.996: timeout on mesg port 1497 sendbackup: time 89.996: stream_accept: timeout after 30 seconds sendbackup: time 89.996: timeout on index port 1498 sendbackup: time 89.996: pid 5263 finish time Fri Jun 6 00:47:44 2003 ---[ end ]--- Anybody out there have time to debug the source? I may take a look at it but time is at a premium right now... (when isn't it???). Anyone have any ideas? This only happens occasionally and I haven't yet been able to draw a correlation. Thanks, Chris -- Steven M. Wilson, Systems and Network Manager Markey Center for Structural Biology Purdue University [EMAIL PROTECTED]765.496.1946 --- server-src/driver.c.orig2003-01-01 18:28:54.0 -0500 +++ server-src/driver.c 2003-06-04 15:54:44.0 -0400 @@ -2242,10 +,10 @@ error(error [dump to tape DONE result_argc != 5: %d], result_argc); } - free_serial(result_argv[2]); - if(failed == 1) goto tryagain; /* dump didn't work */ - else if(failed == 2) goto fatal; + else if(failed == 2) goto failed_dumper; + + free_serial(result_argv[2]); /* every thing went fine */ update_info_dumper(dp, origsize, dumpsize, dumptime); @@ -2259,9 +2239,10 @@ case TRYAGAIN: /* TRY-AGAIN handle err mess */ tryagain: + headqueue_disk(runq, dp); +failed_dumper: update_failed_dump_to_tape(dp); free_serial(result_argv[2]); - headqueue_disk(runq, dp); tape_left = tape_length; break; @@ -2269,7 +2250,6 @@ case TAPE_ERROR: /* TAPE-ERROR handle err mess */ case BOGUS: default: -fatal: update_failed_dump_to_tape(dp); free_serial(result_argv[2]); failed = 2; /* fatal problem */ ---BeginMessage--- Hi Steven, Could you try this patch, It should apply to the latest 2.4.4 snapshot for http://www.iro.umontreal.ca/~martinea/amanda Jean-Louis On Wed, Jun 04, 2003 at 02:16:14PM -0500, Steven M. Wilson wrote: I have a question for the Amanda development experts. I'm using version 2.4.4 and backing up to hard disk directly (no tapes, no holding disks). On several occasions, I've had a client error cause Amanda to go into degraded mode. It appears that the dump_to_tape function (server-src/driver.c) takes any FATAL dumper error and forces Amanda into degraded mode. Shouldn't the code be more discerning as to what caused the error? I would think that Amanda should go into degraded mode only if an error were related to the output device. In my case the error was on the client and unrelated to writing the backup to disk. Here's some of the related amdump messages: driver: result time 6754.491 from dumper0: FAILED 01-00368 [data timeout] taper: reader-side: got label slot024 filenum 184 driver: result time 6754.492 from taper: DONE 00-00367 slot024 184 [sec 2174.408 kb 2061376 kps 948.0 {wr: writers 64419 rdwait 2166.220 wrwait 7.959 filemark 0.021}] driver: error time 6754.503 serial gen mismatch dump