Re: FW: selfcheck hangs

2003-06-18 Thread Steven M. Wilson




Jeremy,

That could very well be the problem I'm having since I just tried a df on
the client system and it ground to a halt trying to located NFS mounts. ?We
rely heavily on NFS here so I'll need to figure out how to get around this
problem in the future. ?Thanks for the info.

Steve

Jeremy L. Mordkoff wrote:

  no, the list was no help. The problem was that the client had nfs-mounted a disk that was no longer on the net, so anything that iterated over mounts (like df) was hanging. That is probably why reboot solve it. I don't allow key machines to be nfs clients anymore.

JLM



-Original Message-
From:	Steven M. Wilson [mailto:[EMAIL PROTECTED]]
Sent:	Wed 6/18/2003 2:32 PM
To:	Jeremy L. Mordkoff
Cc:	
Subject:	Re: FW: selfcheck hangs
Jeremy,

Did anyone respond off-list to your posting?  I have the same problem 
here from time to time and the only way I've been able to correct is by 
rebooting the offending client system.

Steve

Jeremy L. Mordkoff wrote:

  
  
one system has started refusing to run backups. amcheck reports a timeout. A ps on the client shows several orphaned selfcheck's. I did try killing all amandad's and hitting xinetd with a sigHUP, and then I tried an amcheck again, to no avail. I then reinstalled amanda and repeated. Still no. Here's the debug log. 

Any ideas would be appreciated.

JLM

-Original Message-
From:	root [mailto:[EMAIL PROTECTED]]
Sent:	Fri 6/13/2003 9:20 AM
To:	[EMAIL PROTECTED]
Cc:	
Subject:	
amandad: debug 1 pid 23823 ruid 527 euid 527: start at Fri Jun 13 09:16:52 2003
amandad: version 2.4.3
amandad: build: VERSION="Amanda-2.4.3"
amandad:BUILT_DATE="Fri Apr 4 10:37:17 EST 2003"
amandad:BUILT_MACH="Linux lux1 2.4.18-18.7.xsmp #1 SMP Wed Nov 13 19:01:42 EST 2002 i686 unknown"
amandad:CC="gcc"
amandad:CONFIGURE_COMMAND="'./configure' '--with-user=amanda' '--with-group=disk'"
amandad: paths: bindir="/usr/local/bin" sbindir="/usr/local/sbin"
amandad:libexecdir="/usr/local/libexec" mandir="/usr/local/man"
amandad:AMANDA_TMPDIR="/tmp/amanda" AMANDA_DBGDIR="/tmp/amanda"
amandad:CONFIG_DIR="/usr/local/etc/amanda" DEV_PREFIX="/dev/"
amandad:RDEV_PREFIX="/dev/" DUMP="/sbin/dump"
amandad:RESTORE="/sbin/restore" SAMBA_CLIENT="/usr/bin/smbclient"
amandad:GNUTAR="/bin/gtar" COMPRESS_PATH="/bin/gzip"
amandad:UNCOMPRESS_PATH="/bin/gzip" MAILER="/usr/bin/Mail"
amandad:listed_incr_dir="/usr/local/var/amanda/gnutar-lists"
amandad: defs:  DEFAULT_SERVER="lux1" DEFAULT_CONFIG="DailySet1"
amandad:DEFAULT_TAPE_SERVER="lux1" DEFAULT_TAPE_DEVICE="/dev/null"
amandad:HAVE_MMAP HAVE_SYSVSHM LOCKING=POSIX_FCNTL SETPGRP_VOID
amandad:DEBUG_CODE AMANDA_DEBUG_DAYS=4 BSD_SECURITY USE_AMANDAHOSTS
amandad:CLIENT_LOGIN="amanda" FORCE_USERID HAVE_GZIP
amandad:COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast"
amandad:COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc"
amandad: time 0.000: got packet:

Amanda 2.4 REQ HANDLE 000-58790808 SEQ 1055510212
SECURITY USER amanda
SERVICE selfcheck
OPTIONS features=feff9f00;maxdumps=1;hostname=rel2;
DUMP hda3  0 OPTIONS |;auth=bsd;compress-fast;
DUMP vg01/lv_data  0 OPTIONS |;auth=bsd;compress-fast;


amandad: time 0.000: sending ack:

Amanda 2.4 ACK HANDLE 000-58790808 SEQ 1055510212


amandad: time 0.001: bsd security: remote host lux1 user amanda local user amanda
amandad: time 0.001: amandahosts security check passed
amandad: time 0.001: running service "/usr/local/libexec/selfcheck"
amandad: time 30.526: got packet:

Amanda 2.4 REQ HANDLE 000-58790808 SEQ 1055510212
SECURITY USER amanda
SERVICE selfcheck
OPTIONS features=feff9f00;maxdumps=1;hostname=rel2;
DUMP hda3  0 OPTIONS |;auth=bsd;compress-fast;
DUMP vg01/lv_data  0 OPTIONS |;auth=bsd;compress-fast;


amandad: time 31.146: received dup P_REQ packet, ACKing it
amandad: time 31.146: sending ack:

Amanda 2.4 ACK HANDLE 000-58790808 SEQ 1055510212


amandad: time 61.141: got packet:

Amanda 2.4 REQ HANDLE 000-58790808 SEQ 1055510212
SECURITY USER amanda
SERVICE selfcheck
OPTIONS features=feff9f00;maxdumps=1;hostname=rel2;
DUMP hda3  0 OPTIONS |;auth=bsd;compress-fast;
DUMP vg01/lv_data  0 OPTIONS |;auth=bsd;compress-fast;


amandad: time 61.141: received dup P_REQ packet, ACKing it
amandad: time 61.141: sending ack:

Amanda 2.4 ACK HANDLE 000-58790808 SEQ 1055510212






 


  
  
  


-- 
Steven M. Wilson, Systems and Network Manager
Markey Center for Structural Biology
Purdue University
[EMAIL PROTECTED]765.496.1946






Re: Holding disks and the disk output driver

2003-06-10 Thread Steven M. Wilson
Ted,

I've been using a 2 TB disk array for the past month or so and it's been 
working great for me without a holding disk.  Using a holding disk will 
require more free disk space plus the additional time to transfer from 
the holding disk to the backup disk.

I recommend getting a patch from Jean-Louis Martineau 
([EMAIL PROTECTED]) which prevents a client error from forcing 
Amanda to go into degraded mode.  When Amanda operates in degraded mode, 
it will only write to the holding disk, which in my case doesn't exist. 
But the patch prevents client-side errors from putting Amanda into 
degraded mode which allows my backups to continue being written to the 
backup disks.

Steve

Ted Cabeen wrote:

If you're using the disk output driver to run backups to a large disk
array, is there any reason to use a holding disk?
 

--
Steven M. Wilson, Systems and Network Manager
Markey Center for Structural Biology
Purdue University
[EMAIL PROTECTED]765.496.1946




Re: Holding disks and the disk output driver

2003-06-10 Thread Steven M. Wilson




Brian,

I don't know much about the rait driver. Hopefully someone more knowledgeable
on the list will respond. I was unclear what you meant about moving to "diskless
backups"...

Steve

Brian Cuttler wrote:

  Ted,
Steve,
Amanda users,

Stupid question, how smart is the rait driver for disk ?

If you put the spool area on the output disk will it juggle
the space ok ? Will it know to move the file from one directory
to another (move the file pointer # mv perhaps) rather than 
having to copy all the bits and then remove the original ?

I only ask because someone at my site is looking to move to
diskless backups...

		thanks,

		Brian

  
  
Ted,

I've been using a 2 TB disk array for the past month or so and it's been 
working great for me without a holding disk.  Using a holding disk will 
require more free disk space plus the additional time to transfer from 
the holding disk to the backup disk.

I recommend getting a patch from Jean-Louis Martineau 
([EMAIL PROTECTED]) which prevents a client error from forcing 
Amanda to go into degraded mode.  When Amanda operates in degraded mode, 
it will only write to the holding disk, which in my case doesn't exist. 
 But the patch prevents client-side errors from putting Amanda into 
degraded mode which allows my backups to continue being written to the 
backup disks.

Steve

Ted Cabeen wrote:



  If you're using the disk output driver to run backups to a large disk
array, is there any reason to use a holding disk?

 

  
    
-- 
Steven M. Wilson, Systems and Network Manager
Markey Center for Structural Biology
Purdue University
[EMAIL PROTECTED]765.496.1946




  
  
  


-- 
Steven M. Wilson, Systems and Network Manager
Markey Center for Structural Biology
Purdue University
[EMAIL PROTECTED]765.496.1946






Re: Holding disks and the disk output driver

2003-06-10 Thread Steven M. Wilson
Ted,

I just noticed that you mentioned using the disk output driver.  I 
believe my problem with Amanda dropping into degraded mode was specific 
to using the tape output driver (I use disk directories that are seen as 
virtual tapes).  Sorry for confusing the issue...

Steve

Steven M. Wilson wrote:

Ted,

I've been using a 2 TB disk array for the past month or so and it's 
been working great for me without a holding disk.  Using a holding 
disk will require more free disk space plus the additional time to 
transfer from the holding disk to the backup disk.

I recommend getting a patch from Jean-Louis Martineau 
([EMAIL PROTECTED]) which prevents a client error from forcing 
Amanda to go into degraded mode.  When Amanda operates in degraded 
mode, it will only write to the holding disk, which in my case doesn't 
exist. But the patch prevents client-side errors from putting Amanda 
into degraded mode which allows my backups to continue being written 
to the backup disks.

Steve

Ted Cabeen wrote:

If you're using the disk output driver to run backups to a large disk
array, is there any reason to use a holding disk?
 


--
Steven M. Wilson, Systems and Network Manager
Markey Center for Structural Biology
Purdue University
[EMAIL PROTECTED]765.496.1946




Re: Failed Backups

2003-06-06 Thread Steven M. Wilson
Chris,

I looked around a little in the Amanda source code and convinced myself 
that there was a bug there.  I sent a note to to the amanda-hackers 
mailing list and received a prompt reply from Jean-Louis Martineau with 
a patch that fixed the problem for me.  I'll attach his message and patch.

Hope that helps!

Steve

Chris Gordon wrote:

Steve, 

On Wed, Jun 04, 2003 at 02:29:20PM -, smw_purdue wrote:
 

Chris,

I'm having the same problem using a similar configuration of backups
to disk without any holding disks.  Every time Amanda drops into
degraded mode it's because an error occurred with one of the clients
(usually a timeout, indicating that a client system was unavailable).
I would suspect that there's a bug in the code that puts Amanda into
degraded mode on more errors than just a tape error.  Notice in your
log that you have an unknown response from gilgamesh.  This error
was probably what kicked Amanda into degraded mode.
   

That is exactly what appears to be happening.  I configured a holding
disk in an attempt to eliminate that as a possible cause. In my case,
the problem is intermittent with everything working fine for some time
and then I a failure.  The failure may be some file systems on a given
host or most/all of the backup run.
Today, I had two file systems fail on the again on gilgamesh 
and I began checking the various logs for issue.  What I found in
sendbackup.lotsofnumbers.debug is:

---[ begin ]---
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1496
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1497
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1498
sendbackup: time 0.003: waiting for connect on 1496, then 1497, then
1498
sendbackup: time 29.996: stream_accept: timeout after 30 seconds
sendbackup: time 29.996: timeout on data port 1496
sendbackup: time 59.996: stream_accept: timeout after 30 seconds
sendbackup: time 59.996: timeout on mesg port 1497
sendbackup: time 89.996: stream_accept: timeout after 30 seconds
sendbackup: time 89.996: timeout on index port 1498
sendbackup: time 89.996: pid 5263 finish time Fri Jun  6 00:47:44 2003
---[ end ]---
 

Anybody out there have time to debug the source?  I may take a look at
it but time is at a premium right now... (when isn't it???).
   

Anyone have any ideas?  This only happens occasionally and I haven't
yet been able to draw a correlation.
Thanks,
Chris
 

--
Steven M. Wilson, Systems and Network Manager
Markey Center for Structural Biology
Purdue University
[EMAIL PROTECTED]765.496.1946

--- server-src/driver.c.orig2003-01-01 18:28:54.0 -0500
+++ server-src/driver.c 2003-06-04 15:54:44.0 -0400
@@ -2242,10 +,10 @@
error(error [dump to tape DONE result_argc != 5: %d], result_argc);
}
 
-   free_serial(result_argv[2]);
-
if(failed == 1) goto tryagain;  /* dump didn't work */
-   else if(failed == 2) goto fatal;
+   else if(failed == 2) goto failed_dumper;
+
+   free_serial(result_argv[2]);
 
/* every thing went fine */
update_info_dumper(dp, origsize, dumpsize, dumptime);
@@ -2259,9 +2239,10 @@
 
 case TRYAGAIN: /* TRY-AGAIN handle err mess */
 tryagain:
+   headqueue_disk(runq, dp);
+failed_dumper:
update_failed_dump_to_tape(dp);
free_serial(result_argv[2]);
-   headqueue_disk(runq, dp);
tape_left = tape_length;
break;
 
@@ -2269,7 +2250,6 @@
 case TAPE_ERROR: /* TAPE-ERROR handle err mess */
 case BOGUS:
 default:
-fatal:
update_failed_dump_to_tape(dp);
free_serial(result_argv[2]);
failed = 2; /* fatal problem */
---BeginMessage---
Hi Steven,

Could you try this patch, It should apply to the latest 2.4.4
snapshot for http://www.iro.umontreal.ca/~martinea/amanda

Jean-Louis

On Wed, Jun 04, 2003 at 02:16:14PM -0500, Steven M. Wilson wrote:
 
 
 I have a question for the Amanda development experts.
 
 I'm using version 2.4.4 and backing up to hard disk directly (no tapes, no 
 holding disks).  On several occasions, I've had a client error cause Amanda 
 to go into degraded mode.  It appears that the dump_to_tape function 
 (server-src/driver.c) takes any FATAL dumper error and forces Amanda into  
 degraded mode.  Shouldn't the code be more discerning as to what caused the 
 error?  I would think that Amanda should go into degraded mode only if an 
 error were related to the output device.  In my case the error was on the 
 client and unrelated to writing the backup to disk.
 
 Here's some of the related amdump messages:
 
 driver: result time 6754.491 from dumper0: FAILED 01-00368 [data timeout]
 taper: reader-side: got label slot024 filenum 184
 driver: result time 6754.492 from taper: DONE 00-00367 slot024 184 [sec 
 2174.408 kb 2061376 kps 948.0 {wr: writers 64419 rdwait 2166.220 wrwait 
 7.959 filemark 0.021}]
 driver: error time 6754.503 serial gen mismatch dump