Re: FAILED backups on different hosts each night

2006-08-30 Thread Stephen Carter
 Jon LaBadie [EMAIL PROTECTED] 08/29/06 7:50 PM 
If I understand the configuration, svr2 has 4 separate installations
or the amanda client.  To amanda it appears as 4 distinct remote hosts.
As you indicate different logical hosts fail nightly, it sounds like
all have also had successful backups, thus the basic config is ok.

Do the 4 logical hosts also have their own separate disks and network
controllers?  Or is a single network interface serving multiple IP
addresses and the hosts have separate partitions on a shared disk(s)?

I ask from the view that amanda considers them distinct and may be
asking for dumps simultaneously from all 4, possibly overloading
the shared resources on the single physical client, svr2.  This
could trigger some timeout mechanism that daily hits different
logical hosts.

Even if you are only running a single dumper so multiple, simultaneous
dumps do not occur on svr2, perhaps the interval between estimates and
dumps is so long that a network timeout is triggered.

These are total guesses, just seeing it they might fly.

-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)

Thanks for the reply Jon,

Yes you are right is assuming my setup. All 4 servers (3 XEN guests + host) are 
using the same SATA disks and single NIC interface. All servers are very low 
load systems, just running different web servers that aren't hit very regularly.

I think it could be a timing issue also, but am a bit unsure of where to look.

I see that I get all the estimates, and I always get at least 2 dumps in a run 
(1 from my physical backup server and 1 from one of the XEN host/guest 
servers). What files should I be looking at to see any timeout errors? All I 
seem to find is FAILED messages for the dumps but no explanation of why -- 
maybe I need to turn up debugging from default. I've had a look at both client 
and server but there are so many and I'm not clear as to which I should 
concentrate on.

Cheers,

Stephen Carter
Retrac Networking Limited
www: http://www.retnet.co.uk
Ph: +44 (0)7870 218 693
Fax: +44 (0)870 7060 056
CNA, CNE 6, CNS, CCNA, MCSE 2003



Re: FAILED backups on different hosts each night

2006-08-29 Thread Jon LaBadie

As no one has responded, I guess no one else has a clue either. :((

Of course, not having a clue seldom stops me from posting ;)


On Sun, Aug 27, 2006 at 04:56:03PM +0100, Stephen Carter wrote:
 I have 2 physical boxes I'm backing up, one called srv1 and the other called 
 srv2.

 srv1 is always backed up correctly, which also has the tape device and runs 
 the amanda backups.

 srv2 is a SLES 10 server running 3 virtual SLES 10 XEN guests within it, but 
 I'm treating them as separate physical boxes for the purposes of amanda.

 On different nights, different XEN guests fail (including the host, srv2) 
 with a could not connect error in the amanda report.

 amstatus says 'wait for dumping driver: (aborted:could not connect to data 
 port: Connection timed out)


If I understand the configuration, svr2 has 4 separate installations
or the amanda client.  To amanda it appears as 4 distinct remote hosts.
As you indicate different logical hosts fail nightly, it sounds like
all have also had successful backups, thus the basic config is ok.

Do the 4 logical hosts also have their own separate disks and network
controllers?  Or is a single network interface serving multiple IP
addresses and the hosts have separate partitions on a shared disk(s)?

I ask from the view that amanda considers them distinct and may be
asking for dumps simultaneously from all 4, possibly overloading
the shared resources on the single physical client, svr2.  This
could trigger some timeout mechanism that daily hits different
logical hosts.

Even if you are only running a single dumper so multiple, simultaneous
dumps do not occur on svr2, perhaps the interval between estimates and
dumps is so long that a network timeout is triggered.

These are total guesses, just seeing it they might fly.


-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)


FAILED backups on different hosts each night

2006-08-27 Thread Stephen Carter
I have 2 physical boxes I'm backing up, one called srv1 and the other called 
srv2.

srv1 is always backed up correctly, which also has the tape device and runs the 
amanda backups.

srv2 is a SLES 10 server running 3 virtual SLES 10 XEN guests within it, but 
I'm treating them as separate physical boxes for the purposes of amanda.
 
On different nights, different XEN guests fail (including the host, srv2) with 
a could not connect error in the amanda report.

amstatus says 'wait for dumping driver: (aborted:could not connect to data 
port: Connection timed out)

amdump.1 reports all estimates worked, with a FAILED QUEUE: empty and the 
DONE QUEUE: includes all DLE's listed in the disklist.

amdump.1 then reports the dumper process, 2 of which work with my other 4 DLE's 
failing with:
dumper: stream_client: connect to 192.168.0.9:12359 failed: Connection timed out

I allow all traffic between srv1 (my backup server) and all clients, and 
thinking it may have been a throughput problem I reduced parallel dumps to 1 
which hasn't helped.

A copy of the latest amstatus  a section from my amdump.1 files are below.  
Any help would be greatly appreciated.


AMSTATUS OUTPUT:
srv1:/var/lib/amanda/DailySet1 # amstatus DailySet1
Using /var/lib/amanda/DailySet1/amdump.1 from Fri Aug 25 01:00:02 BST 2006

srv1.retnet.co.uk:md0 3   352152k finished (1:17:18)
mailscan.retnet.co.uk:hda2   0  1062300k wait for dumping driver: 
(aborted:could not connect to data port: Connection timed out)
srv2.retnet.co.uk:/srv/install 0 21497250k wait for dumping driver: 
(aborted:could not connect to data port: Connection timed out)
srv2.retnet.co.uk:md0  0  4242910k wait for dumping driver: 
(aborted:could not connect to data port: Connection timed out)
web-1.retnet.co.uk:hda2  0   699770k finished (1:33:02)
web-2.retnet.co.uk:hda2 0   906355k wait for dumping driver: (aborted:could 
not connect to data port: Connection timed out)

SUMMARY  part  real  estimated
   size   size
partition   :   6
estimated   :   6 28769687k
flush   :   0 0k
failed  :   00k   (  0.00%)
wait for dumping:   4 27708815k   ( 96.31%)
dumping to tape :   00k   (  0.00%)
dumping :   0 0k 0k (  0.00%) (  0.00%)
dumped  :   2   1051922k   1060872k ( 99.16%) (  3.66%)
wait for writing:   0 0k 0k (  0.00%) (  0.00%)
wait to flush   :   0 0k 0k (100.00%) (  0.00%)
writing to tape :   0 0k 0k (  0.00%) (  0.00%)
failed to tape  :   0 0k 0k (  0.00%) (  0.00%)
taped   :   2   1051922k   1060872k ( 99.16%) (  3.66%)
  tape 1:   2   1051922k   1060872k (  2.94%) DailySet1-5
1 dumper idle   : not-idle
taper idle
network free kps:  2600
holding space   :  33792000k (100.00%)
 dumper0 busy   :  0:40:08  ( 95.25%)
   taper busy   :  0:06:47  ( 16.10%)
 0 dumpers busy :  0:00:00  (  0.00%)
 1 dumper busy  :  0:42:08  (100.00%)not-idle:  0:28:40  ( 68.07%)
   no-dumpers:  0:13:27  ( 31.93%)
srv1:/var/lib/amanda/DailySet1 #




AMDUMP.1 PARTIAL OUTPUT:
driver: adding holding disk 0 dir /mnt/dumps size 33792000
reserving 33792000 out of 33792000 for degraded-mode dumps
driver: flush size 0
driver: start time 812.693 inparallel 1 bandwidth 2600 diskspace 33792000 dir 
OBSOLETE datestamp 20060825 driver: drain-ends tapeq FIRST big-dumpers ttt
driver: result time 812.693 from taper: TAPER-OK
driver: send-cmd time 812.703 to dumper0: FILE-DUMP 00-1 
/mnt/dumps/20060825/srv1.retnet.co.uk.md0.3 srv1.retnet.co.uk feff9ffe0f 
md0 NODEVICE 3 2006:8:22:0:36:52 1073741824 GNUTAR 356544 
|;bsd-auth;compress-best;index;exclude-list=/usr/lib/amanda/exclude.gtar;
driver: state time 812.703 free kps: -2090 space: 33435456 taper: idle 
idle-dumpers: 0 qlen tapeq: 0 runq: 5 roomq: 0 wakeup: 86400 driver-idle: 
not-idle
driver: interface-state time 812.703 if : free -3890 if ETH0: free 800 if 
LOCAL: free 1000
driver: hdisk-state time 812.703 hdisk 0: free 33435456 dumpers 1
dumper: stream_client: connected to 192.168.0.1.51236
dumper: stream_client: our side is 0.0.0.0.51239
dumper: stream_client: connected to 192.168.0.1.51237
dumper: stream_client: our side is 0.0.0.0.51240
dumper: stream_client: connected to 192.168.0.1.51238
dumper: stream_client: our side is 0.0.0.0.51241
driver: result time 901.369 from dumper0: DONE 00-1 441620 352152 89 [sec 
88.636 kb 352152 kps 3973.0 orig-kb 441620]
driver: finished-cmd time 901.387 dumper0 dumped srv1.retnet.co.uk:md0
driver: send-cmd time 901.387 to taper: FILE-WRITE 00-2 
/mnt/dumps/20060825/srv1.retnet.co.uk.md0.3 srv1.retnet.co.uk feff9ffe0f 
md0 3 20060825
driver: startaflush: FIRST srv1.retnet.co.uk md0 352185 3584
driver: send-cmd time 901.387 to dumper0: FILE-DUMP 

Re: Old failed backups

2003-10-16 Thread Nicolas Ecarnot
Lucio a écrit :
Two problems (maybe related):

1 - I've got two failed backups in the holding disk. I do not want to 
flush them on tape for a number of reasons, one being because they aren't 
useful anymore.
I'm perhaps wrong, but what I do in that case is :
$ rm -fr /somewhere/holdingDisk/Dailyset1/*
$ amcleanup Dailyset1
--
Nicolas Ecarnot



Re: Old failed backups

2003-10-16 Thread Lucio
 Lucio a écrit :
  1 - I've got two failed backups in the holding disk. I do not want to
  flush them on tape for a number of reasons, one being because they aren't
  useful anymore.

 $ rm -fr /somewhere/holdingDisk/Dailyset1/*
 $ amcleanup Dailyset1

Does amcleanup fix the index as well? Will amflush stop telling me there are 
two backups on the holding disk? I don't want to rm -fr if the index gets 
corrupted because rm -fr is not reversible.



Old failed backups

2003-10-15 Thread Lucio
I've got two very old failed backups in the holding disk. I do not want to 
flush them on tape for a number of reasons, one being because they aren't 
useful anymore.

How do I force amanda to empty the holding disk and to update its index files 
without writing to a tape?

Lucio.



Re: Failed Backups

2003-06-06 Thread Chris Gordon
Steve, 


On Wed, Jun 04, 2003 at 02:29:20PM -, smw_purdue wrote:
 Chris,
 
 I'm having the same problem using a similar configuration of backups
 to disk without any holding disks.  Every time Amanda drops into
 degraded mode it's because an error occurred with one of the clients
 (usually a timeout, indicating that a client system was unavailable).
  I would suspect that there's a bug in the code that puts Amanda into
 degraded mode on more errors than just a tape error.  Notice in your
 log that you have an unknown response from gilgamesh.  This error
 was probably what kicked Amanda into degraded mode.

That is exactly what appears to be happening.  I configured a holding
disk in an attempt to eliminate that as a possible cause. In my case,
the problem is intermittent with everything working fine for some time
and then I a failure.  The failure may be some file systems on a given
host or most/all of the backup run.

Today, I had two file systems fail on the again on gilgamesh 
and I began checking the various logs for issue.  What I found in
sendbackup.lotsofnumbers.debug is:

---[ begin ]---
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1496
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1497
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1498
sendbackup: time 0.003: waiting for connect on 1496, then 1497, then
1498
sendbackup: time 29.996: stream_accept: timeout after 30 seconds
sendbackup: time 29.996: timeout on data port 1496
sendbackup: time 59.996: stream_accept: timeout after 30 seconds
sendbackup: time 59.996: timeout on mesg port 1497
sendbackup: time 89.996: stream_accept: timeout after 30 seconds
sendbackup: time 89.996: timeout on index port 1498
sendbackup: time 89.996: pid 5263 finish time Fri Jun  6 00:47:44 2003
---[ end ]---

 Anybody out there have time to debug the source?  I may take a look at
 it but time is at a premium right now... (when isn't it???).

Anyone have any ideas?  This only happens occasionally and I haven't
yet been able to draw a correlation.

Thanks,
Chris


Re: Failed Backups

2003-06-06 Thread Steven M. Wilson
Chris,

I looked around a little in the Amanda source code and convinced myself 
that there was a bug there.  I sent a note to to the amanda-hackers 
mailing list and received a prompt reply from Jean-Louis Martineau with 
a patch that fixed the problem for me.  I'll attach his message and patch.

Hope that helps!

Steve

Chris Gordon wrote:

Steve, 

On Wed, Jun 04, 2003 at 02:29:20PM -, smw_purdue wrote:
 

Chris,

I'm having the same problem using a similar configuration of backups
to disk without any holding disks.  Every time Amanda drops into
degraded mode it's because an error occurred with one of the clients
(usually a timeout, indicating that a client system was unavailable).
I would suspect that there's a bug in the code that puts Amanda into
degraded mode on more errors than just a tape error.  Notice in your
log that you have an unknown response from gilgamesh.  This error
was probably what kicked Amanda into degraded mode.
   

That is exactly what appears to be happening.  I configured a holding
disk in an attempt to eliminate that as a possible cause. In my case,
the problem is intermittent with everything working fine for some time
and then I a failure.  The failure may be some file systems on a given
host or most/all of the backup run.
Today, I had two file systems fail on the again on gilgamesh 
and I began checking the various logs for issue.  What I found in
sendbackup.lotsofnumbers.debug is:

---[ begin ]---
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1496
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1497
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1498
sendbackup: time 0.003: waiting for connect on 1496, then 1497, then
1498
sendbackup: time 29.996: stream_accept: timeout after 30 seconds
sendbackup: time 29.996: timeout on data port 1496
sendbackup: time 59.996: stream_accept: timeout after 30 seconds
sendbackup: time 59.996: timeout on mesg port 1497
sendbackup: time 89.996: stream_accept: timeout after 30 seconds
sendbackup: time 89.996: timeout on index port 1498
sendbackup: time 89.996: pid 5263 finish time Fri Jun  6 00:47:44 2003
---[ end ]---
 

Anybody out there have time to debug the source?  I may take a look at
it but time is at a premium right now... (when isn't it???).
   

Anyone have any ideas?  This only happens occasionally and I haven't
yet been able to draw a correlation.
Thanks,
Chris
 

--
Steven M. Wilson, Systems and Network Manager
Markey Center for Structural Biology
Purdue University
[EMAIL PROTECTED]765.496.1946

--- server-src/driver.c.orig2003-01-01 18:28:54.0 -0500
+++ server-src/driver.c 2003-06-04 15:54:44.0 -0400
@@ -2242,10 +,10 @@
error(error [dump to tape DONE result_argc != 5: %d], result_argc);
}
 
-   free_serial(result_argv[2]);
-
if(failed == 1) goto tryagain;  /* dump didn't work */
-   else if(failed == 2) goto fatal;
+   else if(failed == 2) goto failed_dumper;
+
+   free_serial(result_argv[2]);
 
/* every thing went fine */
update_info_dumper(dp, origsize, dumpsize, dumptime);
@@ -2259,9 +2239,10 @@
 
 case TRYAGAIN: /* TRY-AGAIN handle err mess */
 tryagain:
+   headqueue_disk(runq, dp);
+failed_dumper:
update_failed_dump_to_tape(dp);
free_serial(result_argv[2]);
-   headqueue_disk(runq, dp);
tape_left = tape_length;
break;
 
@@ -2269,7 +2250,6 @@
 case TAPE_ERROR: /* TAPE-ERROR handle err mess */
 case BOGUS:
 default:
-fatal:
update_failed_dump_to_tape(dp);
free_serial(result_argv[2]);
failed = 2; /* fatal problem */
---BeginMessage---
Hi Steven,

Could you try this patch, It should apply to the latest 2.4.4
snapshot for http://www.iro.umontreal.ca/~martinea/amanda

Jean-Louis

On Wed, Jun 04, 2003 at 02:16:14PM -0500, Steven M. Wilson wrote:
 
 
 I have a question for the Amanda development experts.
 
 I'm using version 2.4.4 and backing up to hard disk directly (no tapes, no 
 holding disks).  On several occasions, I've had a client error cause Amanda 
 to go into degraded mode.  It appears that the dump_to_tape function 
 (server-src/driver.c) takes any FATAL dumper error and forces Amanda into  
 degraded mode.  Shouldn't the code be more discerning as to what caused the 
 error?  I would think that Amanda should go into degraded mode only if an 
 error were related to the output device.  In my case the error was on the 
 client and unrelated to writing the backup to disk.
 
 Here's some of the related amdump messages:
 
 driver: result time 6754.491 from dumper0: FAILED 01-00368 [data timeout]
 taper: reader-side: got label slot024 filenum 184
 driver: result time 6754.492 from taper: DONE 00-00367 slot024 184 [sec 
 2174.408 kb 2061376 kps 948.0 {wr: writers 64419 rdwait 2166.220 wrwait 
 7.959 filemark 0.021}]
 driver: error time 6754.503 serial gen mismatch dump of 

Re: Failed Backups

2003-06-01 Thread Chris Gordon
Jon,

Thanks for looking at this for me.

On Sat, May 31, 2003 at 03:37:18AM -0400, Jon LaBadie wrote:
  
  -- AMANDA MAIL REPORT --
  These dumps were to tape standard14.
  The next 7 tapes Amanda expects to used are:
  standard16, standard17,
  standard18,
  +standard19, standard20, standard21, standard22.
 
 Interesting that standard15 is not mentioned.
 It may have bearing on my guesses.

That tape has been used before -- I have been running amanda long
enough for it to cycle through all of my tapes and to have used
standard15 before.  I have rechecked everything to make sure it is setup
like all of my other tapes (I used a script to initially create them
all to minimize chance of errors.).
 
  FAILURE AND STRANGE DUMP SUMMARY:
gilgamesh. / lev 1 FAILED [unknown response: 0;]
gilgamesh. / lev 1 FAILED [dump to tape failed]
goblin.the /var lev 1 FAILED [can't dump no-hold disk in degraded mode]
 
 A scan of the source shows that message only coming in one place.
 At that time the backup has entered degraded mode.  Further, the message
 is only printed if the backup is not using a holding disk.
 
 So first I presume you are not using a holding disk.

I have added a holding disk to see if that helps.
 
 So I'm guessing you have a size limit on your disk tapes and standard14
 reached that limit.

Yes, I have them set to 5 GB.  I can't find any reference in the man
pages, but would setting the length to 0 let the tape be infinitely
long?

 When the changer script went to switch to standard15,
 an error occured.  That put you into degraded mode, and without a holding
 disk, backups of all subsequent DLE's failed.
 
 A place to start looking at least.

I've read over all of the man pages and the limited data I've found on
the net.  From your comments, it seems that reading the source is the
only really good source of detailed documentation and troubleshooting.
Is that true and if so, is there a specific place I should start reading
to get details of error messages, etc?

Thanks,
Chris


Failed Backups

2003-05-31 Thread Chris Gordon
I've been running amanda for several months with
backups to disk
(amanda version 2.4.3).  Recently I've had backups
failing and can't
figure out what the problem may be.  

Some details:
 - Clients and backup server are Linux (RedHat 8)
 - backup disk has plenty of free space (80 GB drive
with only 35% in
use)

Below is an example report from one of the failed
dumps.  Nothing has
recently changed that should affect backups.  I
haven't found
anything to help point me in the right direction and
would appreciate
any points.

Thanks,
Chris

-- AMANDA MAIL REPORT --
These dumps were to tape standard14.
The next 7 tapes Amanda expects to used are:
standard16, standard17,
standard18,
+standard19, standard20, standard21, standard22.

FAILURE AND STRANGE DUMP SUMMARY:
  gilgamesh. / lev 1 FAILED [unknown response: 0;]
  gilgamesh. / lev 1 FAILED [dump to tape failed]
  goblin.the /var lev 1 FAILED [can't dump no-hold
disk in degraded
mode]
  gilgamesh. /home lev 1 FAILED [can't dump no-hold
disk in degraded
mode]
  gilgamesh. /usr lev 1 FAILED [can't dump no-hold
disk in degraded
mode]
  hades.theo /var lev 1 FAILED [can't dump no-hold
disk in degraded
mode]
  psyche.the /usr lev 1 FAILED [can't dump no-hold
disk in degraded
mode]
  psyche.the /var lev 3 FAILED [can't dump no-hold
disk in degraded
mode]
  hades.theo / lev 1 FAILED [can't dump no-hold disk
in degraded mode]
  goblin.the /usr lev 2 FAILED [can't dump no-hold
disk in degraded
mode]


STATISTICS:
  Total   Full  Daily
      
Estimate Time (hrs:min)0:02
Run Time (hrs:min) 0:02
Dump Time (hrs:min)0:00   0:00   0:00
Output Size (meg)   3.00.03.0
Original Size (meg) 9.40.09.4
Avg Compressed Size (%)31.5--31.5 
 (level:#disks
...)
Filesystems Dumped9  0  9 
 (1:9)
Avg Dump Rate (k/s)   298.7--   298.7

Tape Time (hrs:min)0:00   0:00   0:00
Tape Size (meg) 3.50.03.5
Tape Used (%)   0.30.00.3 
 (level:#disks
...)
Filesystems Taped10  0 10 
 (1:10)
Avg Tp Write Rate (k/s)   300.0--   300.0


^L
FAILED AND STRANGE DUMP DETAILS:

/-- gilgamesh. / lev 1 FAILED [unknown response: 0;]
\

^L
NOTES:
  planner: Incremental of psyche.theory14.net:/var
bumped to level 3.
  planner: Full dump of psyche.theory14.net:/var
promoted from 2 days
ahead.
  taper: tape standard14 kb 3552 fm 10 [OK]

DUMP SUMMARY:
 DUMPER STATS 
  TAPER
STATS
HOSTNAME DISKL ORIG-KB OUT-KB COMP% MMM:SS
 KB/s MMM:SS 
KB/s
--
-

gilgamesh.th /   1 FAILED 
---
gilgamesh.th /boot   1  10 64 640.0   0:00
  0.0   0:00
218.8
gilgamesh.th /home   1 FAILED 
---
gilgamesh.th /usr1 FAILED 
---
gilgamesh.th /var12370448  18.9   0:02
180.5   0:03
131.1
goblin.theor /   11320160  12.1   0:01
 76.9   0:01
121.3
goblin.theor /boot   1  10 64 640.0   0:00
  0.0  
0:001008.4
goblin.theor /home   14710   2496  53.0   0:03
963.0   0:03
974.2
goblin.theor /usr2 FAILED 
---
goblin.theor /var1 FAILED 
---
hades.theory /   1 FAILED 
---
hades.theory /boot   1  10 64 640.0   0:00
  0.0   0:00
171.2
hades.theory /var1 FAILED 
---
psyche.theor /   11080128  11.9   0:03
 26.7   0:03 
37.6
psyche.theor /boot   1  10 64 640.0   0:00
  0.0   0:00
603.0
psyche.theor /home   1  90 64  71.1   0:00
 15.5   0:00
243.9
psyche.theor /usr1 FAILED 
---
psyche.theor /var3 FAILED 
---
  
(brought to you by Amanda version 2.4.3)



Re: Failed Backups

2003-05-31 Thread Jon LaBadie
On Fri, May 30, 2003 at 11:32:54PM -0400, Chris Gordon wrote:
 I've been running amanda for several months with backups to disk
 (amanda version 2.4.3).  Recently I've had backups failing and can't
 figure out what the problem may be.  
 
 Some details:
  - Clients and backup server are Linux (RedHat 8)
  - backup disk has plenty of free space (80 GB drive
with only 35% in use)
 
 Below is an example report from one of the failed dumps.  Nothing has
 recently changed that should affect backups.  I haven't found
 anything to help point me in the right direction and would appreciate
 any points.
 
 -- AMANDA MAIL REPORT --
 These dumps were to tape standard14.
 The next 7 tapes Amanda expects to used are:
 standard16, standard17,
 standard18,
 +standard19, standard20, standard21, standard22.


Interesting that standard15 is not mentioned.
It may have bearing on my guesses.


 FAILURE AND STRANGE DUMP SUMMARY:
   gilgamesh. / lev 1 FAILED [unknown response: 0;]
   gilgamesh. / lev 1 FAILED [dump to tape failed]
   goblin.the /var lev 1 FAILED [can't dump no-hold disk in degraded mode]


A scan of the source shows that message only coming in one place.
At that time the backup has entered degraded mode.  Further, the message
is only printed if the backup is not using a holding disk.

So first I presume you are not using a holding disk.

Second, looking at the source (quickly, so I might have missed something)
degraded mode is only entered after a dump starts if a tape error occurs.

So I'm guessing you have a size limit on your disk tapes and standard14
reached that limit.  When the changer script went to switch to standard15,
an error occured.  That put you into degraded mode, and without a holding
disk, backups of all subsequent DLE's failed.

A place to start looking at least.

-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)


Problem: Failed backups overwrite good backups

2003-03-11 Thread Jonathan Swaby
 
 I am using Amanda 2.4.3b3 on a Linux RH 7.2 box to dump several
 windows clients to disk. I discovered a problem yesterday with my
 process. I run all of the backup jobs from a script. Each backup is a
 full backup. When one job completes, the next job runs. This all works
 correctly if the backup server is able to access the machine. If it is
 not able to connect to the machine, prehaps the machine is off, the
 existing backup files are overwritten. Does anyone know of a way to
 prevent this from happening? If it fails, I want it to leave the
 existing backup files.
 
 The dumpcycles are set to 0 and the number of tapes is 1. I am just
 getting the system going, and I did not have a good feel for how much
 drive space was going to be consummed by the backups.
 
 If any one cares, this is how the system works.
 
 I created a couple of web pages to allow the user 
 to add their machines to the backup list. The web pages are restricted
 via ip address. The user is informed that this is experimental and
 that they should also backup their data to zip disk or cd. The user is
 also instructed to contact the support person to make changes to the
 computers to allow the backups to happen. We only backup My
 Documents and Eudora. The users enters some basic information into a
 form and some information like IP address is collected behind the
 scenes. The data from the form is added to a mysql
 database. Initially, my plan was to have the users submit a date and
 time to run the backup. The users did not like this idea. I guess too
 much work for them. Good thing I guess, as I read some time there
 after that I could not run concurrent amdump jobs.
 
 I wrote a C program to construct everything else. Before doing this,
 I had to construct some template files for amanda.conf, changer.conf,
 and the disklist. I use sed to create the useable files. Anyways, the
 program checks for any new additions to the list. I have 2 servers to
 backup 80 machines in 5 different buildings. Each server runs the
 program and only looks for certain subnets. If a new machine has been
 added, it creates all of the directories, files, and tapes. Next the program
 looks through the list for its machines and creates a shell script to
 perform the backups and uses at to schedule the script. As I said
 this al works fine. The only problems I have run into have been with
 ZoneAlarm and the users PC not being set up correctly.
 
 Thanks 
 Jonathan Swaby


Re: Problem: Failed backups overwrite good backups

2003-03-11 Thread Jonathan Swaby
 
 On Tue, Mar 11, 2003 at 09:15:10AM -0500, Jonathan Swaby wrote:
   
   I am using Amanda 2.4.3b3 on a Linux RH 7.2 box to dump several
   windows clients to disk. I discovered a problem yesterday with my
   process. I run all of the backup jobs from a script. Each backup is a
   full backup. When one job completes, the next job runs. This all works
   correctly if the backup server is able to access the machine. If it is
   not able to connect to the machine, prehaps the machine is off, the
   existing backup files are overwritten. Does anyone know of a way to
   prevent this from happening? If it fails, I want it to leave the
   existing backup files.
 
 Which files are overwritten?
 Is it the files in holding disk? that's normal if you run more than
 one amdump by day for the same disk.
Essentially it is overwritting the tape. In my case the tape is a
directory on disk. I assumed it would only do this if it had data to
write, but that does not appear to be the case.


 
 Jean-Louis
 -- 
 Jean-Louis Martineau email: [EMAIL PROTECTED] 
 Departement IRO, Universite de Montreal
 C.P. 6128, Succ. CENTRE-VILLETel: (514) 343-6111 ext. 3529
 Montreal, Canada, H3C 3J7Fax: (514) 343-5834
 



Re: Problem: Failed backups overwrite good backups

2003-03-11 Thread Jean-Louis Martineau
On Tue, Mar 11, 2003 at 01:35:48PM -0500, Jonathan Swaby wrote:
  
  On Tue, Mar 11, 2003 at 09:15:10AM -0500, Jonathan Swaby wrote:

I am using Amanda 2.4.3b3 on a Linux RH 7.2 box to dump several
windows clients to disk. I discovered a problem yesterday with my
process. I run all of the backup jobs from a script. Each backup is a
full backup. When one job completes, the next job runs. This all works
correctly if the backup server is able to access the machine. If it is
not able to connect to the machine, prehaps the machine is off, the
existing backup files are overwritten. Does anyone know of a way to
prevent this from happening? If it fails, I want it to leave the
existing backup files.
  
  Which files are overwritten?
  Is it the files in holding disk? that's normal if you run more than
  one amdump by day for the same disk.
 Essentially it is overwritting the tape. In my case the tape is a
 directory on disk. I assumed it would only do this if it had data to
 write, but that does not appear to be the case.

It's a tape, it is overwritten at every run, that's the way it works,
that's the way it should works (like a tape).

Jean-Louis
-- 
Jean-Louis Martineau email: [EMAIL PROTECTED] 
Departement IRO, Universite de Montreal
C.P. 6128, Succ. CENTRE-VILLETel: (514) 343-6111 ext. 3529
Montreal, Canada, H3C 3J7Fax: (514) 343-5834


Re: Problem: Failed backups overwrite good backups

2003-03-11 Thread Jonathan Swaby
 
 On Tue, Mar 11, 2003 at 01:35:48PM -0500, Jonathan Swaby wrote:
   
   On Tue, Mar 11, 2003 at 09:15:10AM -0500, Jonathan Swaby wrote:
 
 I am using Amanda 2.4.3b3 on a Linux RH 7.2 box to dump several
 windows clients to disk. I discovered a problem yesterday with my
 process. I run all of the backup jobs from a script. Each backup is a
 full backup. When one job completes, the next job runs. This all works
 correctly if the backup server is able to access the machine. If it is
 not able to connect to the machine, prehaps the machine is off, the
 existing backup files are overwritten. Does anyone know of a way to
 prevent this from happening? If it fails, I want it to leave the
 existing backup files.
   
   Which files are overwritten?
   Is it the files in holding disk? that's normal if you run more than
   one amdump by day for the same disk.
  Essentially it is overwritting the tape. In my case the tape is a
  directory on disk. I assumed it would only do this if it had data to
  write, but that does not appear to be the case.
 
 It's a tape, it is overwritten at every run, that's the way it works,
 that's the way it should works (like a tape).
I thought that it would erase the file only if it had something to
write. It seems that it erases then checks to see if there is
something to write.

In any event my problem is solve I think. I wrote a small C program
that takes its input from amcheck. If it sees the word ERROR or
WARNING, it will return a value of 10. If amcheck works, it will
return a value of 0. So, my script looks like this:

su -c amcheck amachine1 | backup_test operator  su -c amdump
machine1 operator

If backup_test returns a 0, it will do the dump.

Thanks 
Jonathan Swaby


 
 Jean-Louis
 -- 
 Jean-Louis Martineau email: [EMAIL PROTECTED] 
 Departement IRO, Universite de Montreal
 C.P. 6128, Succ. CENTRE-VILLETel: (514) 343-6111 ext. 3529
 Montreal, Canada, H3C 3J7Fax: (514) 343-5834
 



Failed backups

2002-06-23 Thread Lee Fellows


Hi,

  Since setting amanda up, I have constantly run into failed backups
with timeouts reported.  If I run amdump on the configuration, it works
fine, but if I let the cron job call it during the night, it fails.

  We have a local network with a vpn to our remote servers.  The tape
server is only tasked with backing up some files on its hard drive and
a share from an NT box.  Both machines are local.  The total size of
the backups being requested are less than 5 Gig. and the tape capacity
is 40 Gig.  The NT box is also the dns server of first resort.

  Apparently, if/when the vpn goes down, sendsize gets lost and times
out in reporting to amandad (if I understand correctly).  If the vpn is
up, sendsize has no difficulty whatsoever.  I found this be seeing
amandad and sendsize still running on the tape server at 7:30 AM when
the cron job started at 1:00 AM.  When I discovered that the vpn was
down and restarted it, amandad and sendsize happily finished, reporting
the timeout error.  Unfortunately, the vpn goes down most nights,
although not by design.

  Any ideas why sendsize would (mis)behave in this manner?
  Any ideas what I can do to work around this?

Thank you.

  Lee




Failed backups

2002-05-15 Thread Lee Fellows


Hi,

  Since setting amanda up, I have constantly run into failed backups
with timeouts reported.  If I run amdump on the configuration, it works
fine, but if I let the cron job call it during the night, it fails.

  We have a local network with a vpn to our remote servers.  The tape
server is only tasked with backing up some files on its hard drive and
a share from an NT box.  Both machines are local.  The total size of
the backups being requested are less than 5 Gig. and the tape capacity
is 40 Gig.  The NT box is also the dns server of first resort.

  Apparently, if/when the vpn goes down, sendsize gets lost and times
out in reporting to amandad (if I understand correctly).  If the vpn is
up, sendsize has no difficulty whatsoever.  I found this be seeing
amandad and sendsize still running on the tape server at 7:30 AM when
the cron job started at 1:00 AM.  When I discovered that the vpn was
down and restarted it, amandad and sendsize happily finished, reporting
the timeout error.  Unfortunately, the vpn goes down most nights,
although not by design.

  Any ideas why sendsize would (mis)behave in this manner?
  Any ideas what I can do to work around this?

Thank you.

  Lee





RE: Failed backups

2002-05-15 Thread James Kelty

Sounds a lot like what I am going through, but I know what my problem is, I
just havn't fixed it yet. Basically the client tries to open a random UDP
connection to the server between the 1-1024 port range. For security
reasons, it uses a 'trusted' port range. You can set the port range when you
compile Amanda, but that isn't the issue. The issue seems to be that the
client MUST be able to contact the server's address on that range in order
to work. This means that if the server is sitting behind a NAT device, the
client must be able to reach the 'reverse NAT' address.


Hope this make sense, or help a little. Sorry if it doesn't!

-James


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Lee Fellows
Sent: Thursday, May 16, 2002 6:04 AM
To: [EMAIL PROTECTED]
Subject: Failed backups



Hi,

  Since setting amanda up, I have constantly run into failed backups
with timeouts reported.  If I run amdump on the configuration, it works
fine, but if I let the cron job call it during the night, it fails.

  We have a local network with a vpn to our remote servers.  The tape
server is only tasked with backing up some files on its hard drive and
a share from an NT box.  Both machines are local.  The total size of
the backups being requested are less than 5 Gig. and the tape capacity
is 40 Gig.  The NT box is also the dns server of first resort.

  Apparently, if/when the vpn goes down, sendsize gets lost and times
out in reporting to amandad (if I understand correctly).  If the vpn is
up, sendsize has no difficulty whatsoever.  I found this be seeing
amandad and sendsize still running on the tape server at 7:30 AM when
the cron job started at 1:00 AM.  When I discovered that the vpn was
down and restarted it, amandad and sendsize happily finished, reporting
the timeout error.  Unfortunately, the vpn goes down most nights,
although not by design.

  Any ideas why sendsize would (mis)behave in this manner?
  Any ideas what I can do to work around this?

Thank you.

  Lee





RE: Failed backups

2002-05-15 Thread Lee Fellows

James,

  Yes, it does make sense.  Fortunately, both of these machines reside
on the same end of the vpn, and neither use a NAT'd address.  My
suspicion is that sendsize could not resolve its hostname do to network
problems caused by the downed vpn.  What puzzles me is why the vpn's
being up or down would cause such problems.  I have put the server's and
NT's info in the hosts file on the tape server.  Will see tonight if
that corrects this problem.  

  Thank you for your response!

 

On Wed, 2002-05-15 at 12:34, James Kelty wrote:
 Sounds a lot like what I am going through, but I know what my problem is, I
 just havn't fixed it yet. Basically the client tries to open a random UDP
 connection to the server between the 1-1024 port range. For security
 reasons, it uses a 'trusted' port range. You can set the port range when you
 compile Amanda, but that isn't the issue. The issue seems to be that the
 client MUST be able to contact the server's address on that range in order
 to work. This means that if the server is sitting behind a NAT device, the
 client must be able to reach the 'reverse NAT' address.
 
 
 Hope this make sense, or help a little. Sorry if it doesn't!
 
 -James
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]]On Behalf Of Lee Fellows
 Sent: Thursday, May 16, 2002 6:04 AM
 To: [EMAIL PROTECTED]
 Subject: Failed backups
 
 
 
 Hi,
 
   Since setting amanda up, I have constantly run into failed backups
 with timeouts reported.  If I run amdump on the configuration, it works
 fine, but if I let the cron job call it during the night, it fails.
 
   We have a local network with a vpn to our remote servers.  The tape
 server is only tasked with backing up some files on its hard drive and
 a share from an NT box.  Both machines are local.  The total size of
 the backups being requested are less than 5 Gig. and the tape capacity
 is 40 Gig.  The NT box is also the dns server of first resort.
 
   Apparently, if/when the vpn goes down, sendsize gets lost and times
 out in reporting to amandad (if I understand correctly).  If the vpn is
 up, sendsize has no difficulty whatsoever.  I found this be seeing
 amandad and sendsize still running on the tape server at 7:30 AM when
 the cron job started at 1:00 AM.  When I discovered that the vpn was
 down and restarted it, amandad and sendsize happily finished, reporting
 the timeout error.  Unfortunately, the vpn goes down most nights,
 although not by design.
 
   Any ideas why sendsize would (mis)behave in this manner?
   Any ideas what I can do to work around this?
 
 Thank you.
 
   Lee