Re: FAILED backups on different hosts each night

2006-08-30 Thread Stephen Carter
>>> Jon LaBadie <[EMAIL PROTECTED]> 08/29/06 7:50 PM >>>
>If I understand the configuration, svr2 has 4 separate installations
>or the amanda client.  To amanda it appears as 4 distinct remote hosts.
>As you indicate different logical hosts fail nightly, it sounds like
>all have also had successful backups, thus the basic config is ok.
>
>Do the 4 logical hosts also have their own separate disks and network
>controllers?  Or is a single network interface serving multiple IP
>addresses and the hosts have separate partitions on a shared disk(s)?
>
>I ask from the view that amanda considers them distinct and may be
>asking for dumps simultaneously from all 4, possibly overloading
>the shared resources on the single physical client, svr2.  This
>could trigger some timeout mechanism that daily hits different
>logical hosts.
>
>Even if you are only running a single dumper so multiple, simultaneous
>dumps do not occur on svr2, perhaps the interval between estimates and
>dumps is so long that a network timeout is triggered.
>
>These are total guesses, just seeing it they might fly.
>
>-- 
>Jon H. LaBadie  [EMAIL PROTECTED]
> JG Computing
> 4455 Province Line Road(609) 252-0159
> Princeton, NJ  08540-4322  (609) 683-7220 (fax)

Thanks for the reply Jon,

Yes you are right is assuming my setup. All 4 servers (3 XEN guests + host) are 
using the same SATA disks and single NIC interface. All servers are very low 
load systems, just running different web servers that aren't hit very regularly.

I think it could be a timing issue also, but am a bit unsure of where to look.

I see that I get all the estimates, and I always get at least 2 dumps in a run 
(1 from my physical backup server and 1 from one of the XEN host/guest 
servers). What files should I be looking at to see any timeout errors? All I 
seem to find is FAILED messages for the dumps but no explanation of why -- 
maybe I need to turn up debugging from default. I've had a look at both client 
and server but there are so many and I'm not clear as to which I should 
concentrate on.

Cheers,

Stephen Carter
Retrac Networking Limited
www: http://www.retnet.co.uk
Ph: +44 (0)7870 218 693
Fax: +44 (0)870 7060 056
CNA, CNE 6, CNS, CCNA, MCSE 2003



Re: FAILED backups on different hosts each night

2006-08-29 Thread Jon LaBadie

As no one has responded, I guess no one else has a clue either. :((

Of course, not having a clue seldom stops me from posting ;)


On Sun, Aug 27, 2006 at 04:56:03PM +0100, Stephen Carter wrote:
> I have 2 physical boxes I'm backing up, one called srv1 and the other called 
> srv2.
>
> srv1 is always backed up correctly, which also has the tape device and runs 
> the amanda backups.
>
> srv2 is a SLES 10 server running 3 virtual SLES 10 XEN guests within it, but 
> I'm treating them as separate physical boxes for the purposes of amanda.
>
> On different nights, different XEN guests fail (including the host, srv2) 
> with a "could not connect" error in the amanda report.
>
> amstatus says 'wait for dumping driver: (aborted:could not connect to data 
> port: Connection timed out)


If I understand the configuration, svr2 has 4 separate installations
or the amanda client.  To amanda it appears as 4 distinct remote hosts.
As you indicate different logical hosts fail nightly, it sounds like
all have also had successful backups, thus the basic config is ok.

Do the 4 logical hosts also have their own separate disks and network
controllers?  Or is a single network interface serving multiple IP
addresses and the hosts have separate partitions on a shared disk(s)?

I ask from the view that amanda considers them distinct and may be
asking for dumps simultaneously from all 4, possibly overloading
the shared resources on the single physical client, svr2.  This
could trigger some timeout mechanism that daily hits different
logical hosts.

Even if you are only running a single dumper so multiple, simultaneous
dumps do not occur on svr2, perhaps the interval between estimates and
dumps is so long that a network timeout is triggered.

These are total guesses, just seeing it they might fly.


-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)


FAILED backups on different hosts each night

2006-08-27 Thread Stephen Carter
I have 2 physical boxes I'm backing up, one called srv1 and the other called 
srv2.

srv1 is always backed up correctly, which also has the tape device and runs the 
amanda backups.

srv2 is a SLES 10 server running 3 virtual SLES 10 XEN guests within it, but 
I'm treating them as separate physical boxes for the purposes of amanda.
 
On different nights, different XEN guests fail (including the host, srv2) with 
a "could not connect" error in the amanda report.

amstatus says 'wait for dumping driver: (aborted:could not connect to data 
port: Connection timed out)

amdump.1 reports all estimates worked, with a "FAILED QUEUE: empty" and the 
DONE QUEUE: includes all DLE's listed in the disklist.

amdump.1 then reports the dumper process, 2 of which work with my other 4 DLE's 
failing with:
dumper: stream_client: connect to 192.168.0.9:12359 failed: Connection timed out

I allow all traffic between srv1 (my backup server) and all clients, and 
thinking it may have been a throughput problem I reduced parallel dumps to 1 
which hasn't helped.

A copy of the latest amstatus & a section from my amdump.1 files are below.  
Any help would be greatly appreciated.


AMSTATUS OUTPUT:
srv1:/var/lib/amanda/DailySet1 # amstatus DailySet1
Using /var/lib/amanda/DailySet1/amdump.1 from Fri Aug 25 01:00:02 BST 2006

srv1.retnet.co.uk:md0 3   352152k finished (1:17:18)
mailscan.retnet.co.uk:hda2   0  1062300k wait for dumping driver: 
(aborted:could not connect to data port: Connection timed out)
srv2.retnet.co.uk:/srv/install 0 21497250k wait for dumping driver: 
(aborted:could not connect to data port: Connection timed out)
srv2.retnet.co.uk:md0  0  4242910k wait for dumping driver: 
(aborted:could not connect to data port: Connection timed out)
web-1.retnet.co.uk:hda2  0   699770k finished (1:33:02)
web-2.retnet.co.uk:hda2 0   906355k wait for dumping driver: (aborted:could 
not connect to data port: Connection timed out)

SUMMARY  part  real  estimated
   size   size
partition   :   6
estimated   :   6 28769687k
flush   :   0 0k
failed  :   00k   (  0.00%)
wait for dumping:   4 27708815k   ( 96.31%)
dumping to tape :   00k   (  0.00%)
dumping :   0 0k 0k (  0.00%) (  0.00%)
dumped  :   2   1051922k   1060872k ( 99.16%) (  3.66%)
wait for writing:   0 0k 0k (  0.00%) (  0.00%)
wait to flush   :   0 0k 0k (100.00%) (  0.00%)
writing to tape :   0 0k 0k (  0.00%) (  0.00%)
failed to tape  :   0 0k 0k (  0.00%) (  0.00%)
taped   :   2   1051922k   1060872k ( 99.16%) (  3.66%)
  tape 1:   2   1051922k   1060872k (  2.94%) DailySet1-5
1 dumper idle   : not-idle
taper idle
network free kps:  2600
holding space   :  33792000k (100.00%)
 dumper0 busy   :  0:40:08  ( 95.25%)
   taper busy   :  0:06:47  ( 16.10%)
 0 dumpers busy :  0:00:00  (  0.00%)
 1 dumper busy  :  0:42:08  (100.00%)not-idle:  0:28:40  ( 68.07%)
   no-dumpers:  0:13:27  ( 31.93%)
srv1:/var/lib/amanda/DailySet1 #




AMDUMP.1 PARTIAL OUTPUT:
driver: adding holding disk 0 dir /mnt/dumps size 33792000
reserving 33792000 out of 33792000 for degraded-mode dumps
driver: flush size 0
driver: start time 812.693 inparallel 1 bandwidth 2600 diskspace 33792000 dir 
OBSOLETE datestamp 20060825 driver: drain-ends tapeq FIRST big-dumpers ttt
driver: result time 812.693 from taper: TAPER-OK
driver: send-cmd time 812.703 to dumper0: FILE-DUMP 00-1 
/mnt/dumps/20060825/srv1.retnet.co.uk.md0.3 srv1.retnet.co.uk feff9ffe0f 
md0 NODEVICE 3 2006:8:22:0:36:52 1073741824 GNUTAR 356544 
|;bsd-auth;compress-best;index;exclude-list=/usr/lib/amanda/exclude.gtar;
driver: state time 812.703 free kps: -2090 space: 33435456 taper: idle 
idle-dumpers: 0 qlen tapeq: 0 runq: 5 roomq: 0 wakeup: 86400 driver-idle: 
not-idle
driver: interface-state time 812.703 if : free -3890 if ETH0: free 800 if 
LOCAL: free 1000
driver: hdisk-state time 812.703 hdisk 0: free 33435456 dumpers 1
dumper: stream_client: connected to 192.168.0.1.51236
dumper: stream_client: our side is 0.0.0.0.51239
dumper: stream_client: connected to 192.168.0.1.51237
dumper: stream_client: our side is 0.0.0.0.51240
dumper: stream_client: connected to 192.168.0.1.51238
dumper: stream_client: our side is 0.0.0.0.51241
driver: result time 901.369 from dumper0: DONE 00-1 441620 352152 89 [sec 
88.636 kb 352152 kps 3973.0 orig-kb 441620]
driver: finished-cmd time 901.387 dumper0 dumped srv1.retnet.co.uk:md0
driver: send-cmd time 901.387 to taper: FILE-WRITE 00-2 
/mnt/dumps/20060825/srv1.retnet.co.uk.md0.3 srv1.retnet.co.uk feff9ffe0f 
md0 3 20060825
driver: startaflush: FIRST srv1.retnet.co.uk md0 352185 3584
driver: send-cmd time 901.387 to dumper0: FILE-DU