Re: FAILED backups on different hosts each night

2006-08-30 Thread Stephen Carter
 Jon LaBadie [EMAIL PROTECTED] 08/29/06 7:50 PM 
If I understand the configuration, svr2 has 4 separate installations
or the amanda client.  To amanda it appears as 4 distinct remote hosts.
As you indicate different logical hosts fail nightly, it sounds like
all have also had successful backups, thus the basic config is ok.

Do the 4 logical hosts also have their own separate disks and network
controllers?  Or is a single network interface serving multiple IP
addresses and the hosts have separate partitions on a shared disk(s)?

I ask from the view that amanda considers them distinct and may be
asking for dumps simultaneously from all 4, possibly overloading
the shared resources on the single physical client, svr2.  This
could trigger some timeout mechanism that daily hits different
logical hosts.

Even if you are only running a single dumper so multiple, simultaneous
dumps do not occur on svr2, perhaps the interval between estimates and
dumps is so long that a network timeout is triggered.

These are total guesses, just seeing it they might fly.

-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)

Thanks for the reply Jon,

Yes you are right is assuming my setup. All 4 servers (3 XEN guests + host) are 
using the same SATA disks and single NIC interface. All servers are very low 
load systems, just running different web servers that aren't hit very regularly.

I think it could be a timing issue also, but am a bit unsure of where to look.

I see that I get all the estimates, and I always get at least 2 dumps in a run 
(1 from my physical backup server and 1 from one of the XEN host/guest 
servers). What files should I be looking at to see any timeout errors? All I 
seem to find is FAILED messages for the dumps but no explanation of why -- 
maybe I need to turn up debugging from default. I've had a look at both client 
and server but there are so many and I'm not clear as to which I should 
concentrate on.

Cheers,

Stephen Carter
Retrac Networking Limited
www: http://www.retnet.co.uk
Ph: +44 (0)7870 218 693
Fax: +44 (0)870 7060 056
CNA, CNE 6, CNS, CCNA, MCSE 2003



Re: FAILED backups on different hosts each night

2006-08-29 Thread Jon LaBadie

As no one has responded, I guess no one else has a clue either. :((

Of course, not having a clue seldom stops me from posting ;)


On Sun, Aug 27, 2006 at 04:56:03PM +0100, Stephen Carter wrote:
 I have 2 physical boxes I'm backing up, one called srv1 and the other called 
 srv2.

 srv1 is always backed up correctly, which also has the tape device and runs 
 the amanda backups.

 srv2 is a SLES 10 server running 3 virtual SLES 10 XEN guests within it, but 
 I'm treating them as separate physical boxes for the purposes of amanda.

 On different nights, different XEN guests fail (including the host, srv2) 
 with a could not connect error in the amanda report.

 amstatus says 'wait for dumping driver: (aborted:could not connect to data 
 port: Connection timed out)


If I understand the configuration, svr2 has 4 separate installations
or the amanda client.  To amanda it appears as 4 distinct remote hosts.
As you indicate different logical hosts fail nightly, it sounds like
all have also had successful backups, thus the basic config is ok.

Do the 4 logical hosts also have their own separate disks and network
controllers?  Or is a single network interface serving multiple IP
addresses and the hosts have separate partitions on a shared disk(s)?

I ask from the view that amanda considers them distinct and may be
asking for dumps simultaneously from all 4, possibly overloading
the shared resources on the single physical client, svr2.  This
could trigger some timeout mechanism that daily hits different
logical hosts.

Even if you are only running a single dumper so multiple, simultaneous
dumps do not occur on svr2, perhaps the interval between estimates and
dumps is so long that a network timeout is triggered.

These are total guesses, just seeing it they might fly.


-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)


Re: Failed Backups

2003-06-06 Thread Chris Gordon
Steve, 


On Wed, Jun 04, 2003 at 02:29:20PM -, smw_purdue wrote:
 Chris,
 
 I'm having the same problem using a similar configuration of backups
 to disk without any holding disks.  Every time Amanda drops into
 degraded mode it's because an error occurred with one of the clients
 (usually a timeout, indicating that a client system was unavailable).
  I would suspect that there's a bug in the code that puts Amanda into
 degraded mode on more errors than just a tape error.  Notice in your
 log that you have an unknown response from gilgamesh.  This error
 was probably what kicked Amanda into degraded mode.

That is exactly what appears to be happening.  I configured a holding
disk in an attempt to eliminate that as a possible cause. In my case,
the problem is intermittent with everything working fine for some time
and then I a failure.  The failure may be some file systems on a given
host or most/all of the backup run.

Today, I had two file systems fail on the again on gilgamesh 
and I began checking the various logs for issue.  What I found in
sendbackup.lotsofnumbers.debug is:

---[ begin ]---
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1496
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1497
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1498
sendbackup: time 0.003: waiting for connect on 1496, then 1497, then
1498
sendbackup: time 29.996: stream_accept: timeout after 30 seconds
sendbackup: time 29.996: timeout on data port 1496
sendbackup: time 59.996: stream_accept: timeout after 30 seconds
sendbackup: time 59.996: timeout on mesg port 1497
sendbackup: time 89.996: stream_accept: timeout after 30 seconds
sendbackup: time 89.996: timeout on index port 1498
sendbackup: time 89.996: pid 5263 finish time Fri Jun  6 00:47:44 2003
---[ end ]---

 Anybody out there have time to debug the source?  I may take a look at
 it but time is at a premium right now... (when isn't it???).

Anyone have any ideas?  This only happens occasionally and I haven't
yet been able to draw a correlation.

Thanks,
Chris


Re: Failed Backups

2003-06-06 Thread Steven M. Wilson
Chris,

I looked around a little in the Amanda source code and convinced myself 
that there was a bug there.  I sent a note to to the amanda-hackers 
mailing list and received a prompt reply from Jean-Louis Martineau with 
a patch that fixed the problem for me.  I'll attach his message and patch.

Hope that helps!

Steve

Chris Gordon wrote:

Steve, 

On Wed, Jun 04, 2003 at 02:29:20PM -, smw_purdue wrote:
 

Chris,

I'm having the same problem using a similar configuration of backups
to disk without any holding disks.  Every time Amanda drops into
degraded mode it's because an error occurred with one of the clients
(usually a timeout, indicating that a client system was unavailable).
I would suspect that there's a bug in the code that puts Amanda into
degraded mode on more errors than just a tape error.  Notice in your
log that you have an unknown response from gilgamesh.  This error
was probably what kicked Amanda into degraded mode.
   

That is exactly what appears to be happening.  I configured a holding
disk in an attempt to eliminate that as a possible cause. In my case,
the problem is intermittent with everything working fine for some time
and then I a failure.  The failure may be some file systems on a given
host or most/all of the backup run.
Today, I had two file systems fail on the again on gilgamesh 
and I began checking the various logs for issue.  What I found in
sendbackup.lotsofnumbers.debug is:

---[ begin ]---
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1496
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1497
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1498
sendbackup: time 0.003: waiting for connect on 1496, then 1497, then
1498
sendbackup: time 29.996: stream_accept: timeout after 30 seconds
sendbackup: time 29.996: timeout on data port 1496
sendbackup: time 59.996: stream_accept: timeout after 30 seconds
sendbackup: time 59.996: timeout on mesg port 1497
sendbackup: time 89.996: stream_accept: timeout after 30 seconds
sendbackup: time 89.996: timeout on index port 1498
sendbackup: time 89.996: pid 5263 finish time Fri Jun  6 00:47:44 2003
---[ end ]---
 

Anybody out there have time to debug the source?  I may take a look at
it but time is at a premium right now... (when isn't it???).
   

Anyone have any ideas?  This only happens occasionally and I haven't
yet been able to draw a correlation.
Thanks,
Chris
 

--
Steven M. Wilson, Systems and Network Manager
Markey Center for Structural Biology
Purdue University
[EMAIL PROTECTED]765.496.1946

--- server-src/driver.c.orig2003-01-01 18:28:54.0 -0500
+++ server-src/driver.c 2003-06-04 15:54:44.0 -0400
@@ -2242,10 +,10 @@
error(error [dump to tape DONE result_argc != 5: %d], result_argc);
}
 
-   free_serial(result_argv[2]);
-
if(failed == 1) goto tryagain;  /* dump didn't work */
-   else if(failed == 2) goto fatal;
+   else if(failed == 2) goto failed_dumper;
+
+   free_serial(result_argv[2]);
 
/* every thing went fine */
update_info_dumper(dp, origsize, dumpsize, dumptime);
@@ -2259,9 +2239,10 @@
 
 case TRYAGAIN: /* TRY-AGAIN handle err mess */
 tryagain:
+   headqueue_disk(runq, dp);
+failed_dumper:
update_failed_dump_to_tape(dp);
free_serial(result_argv[2]);
-   headqueue_disk(runq, dp);
tape_left = tape_length;
break;
 
@@ -2269,7 +2250,6 @@
 case TAPE_ERROR: /* TAPE-ERROR handle err mess */
 case BOGUS:
 default:
-fatal:
update_failed_dump_to_tape(dp);
free_serial(result_argv[2]);
failed = 2; /* fatal problem */
---BeginMessage---
Hi Steven,

Could you try this patch, It should apply to the latest 2.4.4
snapshot for http://www.iro.umontreal.ca/~martinea/amanda

Jean-Louis

On Wed, Jun 04, 2003 at 02:16:14PM -0500, Steven M. Wilson wrote:
 
 
 I have a question for the Amanda development experts.
 
 I'm using version 2.4.4 and backing up to hard disk directly (no tapes, no 
 holding disks).  On several occasions, I've had a client error cause Amanda 
 to go into degraded mode.  It appears that the dump_to_tape function 
 (server-src/driver.c) takes any FATAL dumper error and forces Amanda into  
 degraded mode.  Shouldn't the code be more discerning as to what caused the 
 error?  I would think that Amanda should go into degraded mode only if an 
 error were related to the output device.  In my case the error was on the 
 client and unrelated to writing the backup to disk.
 
 Here's some of the related amdump messages:
 
 driver: result time 6754.491 from dumper0: FAILED 01-00368 [data timeout]
 taper: reader-side: got label slot024 filenum 184
 driver: result time 6754.492 from taper: DONE 00-00367 slot024 184 [sec 
 2174.408 kb 2061376 kps 948.0 {wr: writers 64419 rdwait 2166.220 wrwait 
 7.959 filemark 0.021}]
 driver: error time 6754.503 serial gen mismatch dump of 

Re: Failed Backups

2003-06-01 Thread Chris Gordon
Jon,

Thanks for looking at this for me.

On Sat, May 31, 2003 at 03:37:18AM -0400, Jon LaBadie wrote:
  
  -- AMANDA MAIL REPORT --
  These dumps were to tape standard14.
  The next 7 tapes Amanda expects to used are:
  standard16, standard17,
  standard18,
  +standard19, standard20, standard21, standard22.
 
 Interesting that standard15 is not mentioned.
 It may have bearing on my guesses.

That tape has been used before -- I have been running amanda long
enough for it to cycle through all of my tapes and to have used
standard15 before.  I have rechecked everything to make sure it is setup
like all of my other tapes (I used a script to initially create them
all to minimize chance of errors.).
 
  FAILURE AND STRANGE DUMP SUMMARY:
gilgamesh. / lev 1 FAILED [unknown response: 0;]
gilgamesh. / lev 1 FAILED [dump to tape failed]
goblin.the /var lev 1 FAILED [can't dump no-hold disk in degraded mode]
 
 A scan of the source shows that message only coming in one place.
 At that time the backup has entered degraded mode.  Further, the message
 is only printed if the backup is not using a holding disk.
 
 So first I presume you are not using a holding disk.

I have added a holding disk to see if that helps.
 
 So I'm guessing you have a size limit on your disk tapes and standard14
 reached that limit.

Yes, I have them set to 5 GB.  I can't find any reference in the man
pages, but would setting the length to 0 let the tape be infinitely
long?

 When the changer script went to switch to standard15,
 an error occured.  That put you into degraded mode, and without a holding
 disk, backups of all subsequent DLE's failed.
 
 A place to start looking at least.

I've read over all of the man pages and the limited data I've found on
the net.  From your comments, it seems that reading the source is the
only really good source of detailed documentation and troubleshooting.
Is that true and if so, is there a specific place I should start reading
to get details of error messages, etc?

Thanks,
Chris


Re: Failed Backups

2003-05-31 Thread Jon LaBadie
On Fri, May 30, 2003 at 11:32:54PM -0400, Chris Gordon wrote:
 I've been running amanda for several months with backups to disk
 (amanda version 2.4.3).  Recently I've had backups failing and can't
 figure out what the problem may be.  
 
 Some details:
  - Clients and backup server are Linux (RedHat 8)
  - backup disk has plenty of free space (80 GB drive
with only 35% in use)
 
 Below is an example report from one of the failed dumps.  Nothing has
 recently changed that should affect backups.  I haven't found
 anything to help point me in the right direction and would appreciate
 any points.
 
 -- AMANDA MAIL REPORT --
 These dumps were to tape standard14.
 The next 7 tapes Amanda expects to used are:
 standard16, standard17,
 standard18,
 +standard19, standard20, standard21, standard22.


Interesting that standard15 is not mentioned.
It may have bearing on my guesses.


 FAILURE AND STRANGE DUMP SUMMARY:
   gilgamesh. / lev 1 FAILED [unknown response: 0;]
   gilgamesh. / lev 1 FAILED [dump to tape failed]
   goblin.the /var lev 1 FAILED [can't dump no-hold disk in degraded mode]


A scan of the source shows that message only coming in one place.
At that time the backup has entered degraded mode.  Further, the message
is only printed if the backup is not using a holding disk.

So first I presume you are not using a holding disk.

Second, looking at the source (quickly, so I might have missed something)
degraded mode is only entered after a dump starts if a tape error occurs.

So I'm guessing you have a size limit on your disk tapes and standard14
reached that limit.  When the changer script went to switch to standard15,
an error occured.  That put you into degraded mode, and without a holding
disk, backups of all subsequent DLE's failed.

A place to start looking at least.

-- 
Jon H. LaBadie  [EMAIL PROTECTED]
 JG Computing
 4455 Province Line Road(609) 252-0159
 Princeton, NJ  08540-4322  (609) 683-7220 (fax)


RE: Failed backups

2002-05-15 Thread James Kelty

Sounds a lot like what I am going through, but I know what my problem is, I
just havn't fixed it yet. Basically the client tries to open a random UDP
connection to the server between the 1-1024 port range. For security
reasons, it uses a 'trusted' port range. You can set the port range when you
compile Amanda, but that isn't the issue. The issue seems to be that the
client MUST be able to contact the server's address on that range in order
to work. This means that if the server is sitting behind a NAT device, the
client must be able to reach the 'reverse NAT' address.


Hope this make sense, or help a little. Sorry if it doesn't!

-James


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Lee Fellows
Sent: Thursday, May 16, 2002 6:04 AM
To: [EMAIL PROTECTED]
Subject: Failed backups



Hi,

  Since setting amanda up, I have constantly run into failed backups
with timeouts reported.  If I run amdump on the configuration, it works
fine, but if I let the cron job call it during the night, it fails.

  We have a local network with a vpn to our remote servers.  The tape
server is only tasked with backing up some files on its hard drive and
a share from an NT box.  Both machines are local.  The total size of
the backups being requested are less than 5 Gig. and the tape capacity
is 40 Gig.  The NT box is also the dns server of first resort.

  Apparently, if/when the vpn goes down, sendsize gets lost and times
out in reporting to amandad (if I understand correctly).  If the vpn is
up, sendsize has no difficulty whatsoever.  I found this be seeing
amandad and sendsize still running on the tape server at 7:30 AM when
the cron job started at 1:00 AM.  When I discovered that the vpn was
down and restarted it, amandad and sendsize happily finished, reporting
the timeout error.  Unfortunately, the vpn goes down most nights,
although not by design.

  Any ideas why sendsize would (mis)behave in this manner?
  Any ideas what I can do to work around this?

Thank you.

  Lee





RE: Failed backups

2002-05-15 Thread Lee Fellows

James,

  Yes, it does make sense.  Fortunately, both of these machines reside
on the same end of the vpn, and neither use a NAT'd address.  My
suspicion is that sendsize could not resolve its hostname do to network
problems caused by the downed vpn.  What puzzles me is why the vpn's
being up or down would cause such problems.  I have put the server's and
NT's info in the hosts file on the tape server.  Will see tonight if
that corrects this problem.  

  Thank you for your response!

 

On Wed, 2002-05-15 at 12:34, James Kelty wrote:
 Sounds a lot like what I am going through, but I know what my problem is, I
 just havn't fixed it yet. Basically the client tries to open a random UDP
 connection to the server between the 1-1024 port range. For security
 reasons, it uses a 'trusted' port range. You can set the port range when you
 compile Amanda, but that isn't the issue. The issue seems to be that the
 client MUST be able to contact the server's address on that range in order
 to work. This means that if the server is sitting behind a NAT device, the
 client must be able to reach the 'reverse NAT' address.
 
 
 Hope this make sense, or help a little. Sorry if it doesn't!
 
 -James
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]]On Behalf Of Lee Fellows
 Sent: Thursday, May 16, 2002 6:04 AM
 To: [EMAIL PROTECTED]
 Subject: Failed backups
 
 
 
 Hi,
 
   Since setting amanda up, I have constantly run into failed backups
 with timeouts reported.  If I run amdump on the configuration, it works
 fine, but if I let the cron job call it during the night, it fails.
 
   We have a local network with a vpn to our remote servers.  The tape
 server is only tasked with backing up some files on its hard drive and
 a share from an NT box.  Both machines are local.  The total size of
 the backups being requested are less than 5 Gig. and the tape capacity
 is 40 Gig.  The NT box is also the dns server of first resort.
 
   Apparently, if/when the vpn goes down, sendsize gets lost and times
 out in reporting to amandad (if I understand correctly).  If the vpn is
 up, sendsize has no difficulty whatsoever.  I found this be seeing
 amandad and sendsize still running on the tape server at 7:30 AM when
 the cron job started at 1:00 AM.  When I discovered that the vpn was
 down and restarted it, amandad and sendsize happily finished, reporting
 the timeout error.  Unfortunately, the vpn goes down most nights,
 although not by design.
 
   Any ideas why sendsize would (mis)behave in this manner?
   Any ideas what I can do to work around this?
 
 Thank you.
 
   Lee