RE: Solaris 8 Server hangs during backup

2001-09-04 Thread Eva Freer

John,
Thanks for the reply. Although we backup the firewalls, we do not pass any
Amanda traffic through from one segment to another. The systems are all
up-to-date with patches (mid-August). We have done a lot more investigating.
ufsdump runs fine. We also tried the Arkeia backup software and it has
similar problems to Amanda. The systems just seem to run out of resources
(i.e. CPU cycles). It happens more quickly on a single processor system, but
also happens on some of the dual-processor systems. Everything points to a
change in settings (probably network or system) when we ran Titan for
servers on the systems. The backups were fine before then and began to
intermittently fail afterwards. If you (or anyone else) have any info on
this we would appreciate it.

Thanks,
Eva Freer

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of John R. Jackson
Sent: Friday, August 31, 2001 9:56 PM
To: Eva Freer
Cc: [EMAIL PROTECTED]
Subject: Re: Solaris 8 Server hangs during backup


>I didn't get much response from amanda-users so I am trying this list.

I responded to your first letter but your mail server refused to accept
the letter and it eventually bounced.  I've appended my original response
in case this one gets through.

>Further investigation indicates that the problem occurs when sendbackup is
>running. We have tried /usr/bin/sed, /usr/xpg4/bin/sed, and GNU sed since
>sendbackup appears to be doing ufsdump | sed ... | ufsrestore.  ...

Just for testing, you might try setting "index no" in amanda.conf for
that dumptype.  That's what's inserting the sed and ufsrestore stuff
in the pipeline.

However I'm betting you have a hardware problem and the I/O ufsdump
does is causing the system to hang.  I'd start by doing some ufsdump's
just like Amanda does (see the /tmp/amanda/sendbackup*debug files),
**but without the 'u' option**, to /dev/null.

>Eva Freer

John R. Jackson, Technical Software Specialist, [EMAIL PROTECTED]

>We have a highly subnetted configuration of Solaris 8 and 2.6 boxes, mostly
>E220R's. The subnets are connected via firewalls. Each subnet has its own
>Amanda server with an Exabyte Mammoth tape drive.  ...

Do the servers reach across the firewalls to back up clients "on the
other side"?  Or is that the point of having a tape drive in each subnet,
so backups stay inside a given firewall?

>We use hardware compression only. The Amanda is 2.4.2p1 on most nodes.
>...
>Originally, we seemed to have a problem with only one subnet, with a
Solaris
>2.6 server, 2 Solaris clients, and 1 Solaris 8 client. The server would
hang
>during the backup and required a poweroff reboot.  ...

Please believe me that I'm not just trying to pass the buck :-), but
Amanda cannot be the root of this problem.  Put another way, anything
you do to Amanda that gets this going is, at best, a workaround and
the real problem will still be there, waiting to bite you at the worst
possible time.

Amanda is pure application level code.  Any program that generates the
same set of circumstances (e.g. high network load, particular data
patterns, etc) would trigger the same problem.  If you have systems
crashing or hanging, something else (hardware or OS) is wrong.

>... Messages in the logs (not from amanda) indicate
>that the system is very busy (e.g. sendmail won't run the queue because the
>load average is too high.)  ...

How high is the load average getting?  Amanda is I/O bound, especially
on the server.  It should not be generating significant load (w.r.t.
"load average").  Are you certain nothing else was going on at the time?
Do you have "top" to see what the heavy hitters are when it starts to
go wrong?  Or there are other tools (even just a "ps") that do roughly
the same thing.

What kind of netstat numbers are you seeing during the bad times?  Any
high error/collision counts or excessive packets?

Are all your systems up to reasonably recent Solaris patch levels?

Have you tried doing several ftp's of roughly dump image size from
the client to the server (they can go to /dev/null on the server as an
initial test)?

What is maxdumps set to?  That would control how many backups were
running at one time on the client, which, in turn, would control how
many data streams were coming into the server.

How about inparallel?  That will also throttle how many dumpers are
active.

Is anything special about the two subnets with the problem?
Any particular type of network card, connection, media or topology?

>Eva Freer

John R. Jackson, Technical Software Specialist, [EMAIL PROTECTED]




Re: Solaris 8 Server hangs during backup

2001-09-01 Thread Gerhard den Hollander

* Eva Freer <[EMAIL PROTECTED]> (Mon, Aug 13, 2001 at 11:52:48AM -0400)
> Amanda Users:
> 
> We have a highly subnetted configuration of Solaris 8 and 2.6 boxes, mostly
> E220R's. The subnets are connected via firewalls. Each subnet has its own
> Amanda server with an Exabyte Mammoth tape drive. We use hardware
> compression only. The Amanda is 2.4.2p1 on most nodes.
[snip]

The usual concerns apply:
1) did you check the scsi chains on all machines to ensure proper
   termination, proper cables &c &c 
2)  What is in the syslog on the hanging machine
3) You say the machine gets slower and slower (which means it's
   progressive).
   Did you try running top (or similar) on the machine to see what was
   happening (e.g. was the machine running out of memory ? out of swap
   space ?)
   Was there anything special in the /tmp/amanda logfiles ?
   [you will have to make sure /tmp/amanda is not mounted on tmpfs
4) Are you using software compression ?

> but the problem persists. Messages in the logs (not from amanda) indicate
> that the system is very busy (e.g. sendmail won't run the queue because the
> load average is too high.) Amanda is the only thing really happening other
> than the usual OS stuff.

Could you run top/ps -ef or whatever to see what exactly is runnig, and
what is hogging the CPU ?
Are you using ufsdump or tar dump ?

Kind regards,
 --
Gerhard den Hollander   Phone +31-10.280.1515
Global Technical SupportFax   +31-10.280.1511 
Jason Geosystems BV (When calling please note: we are in GMT+1)

[EMAIL PROTECTED]  POBox 1573
visit us at http://www.jasongeo.com 3000 BN Rotterdam  
JASON...#1 in Reservoir CharacterizationThe Netherlands

  This e-mail and any attachment is/are intended solely for the named
  addressee(s) and may contain information that is confidential and privileged.
   If you are not the intended recipient, we request that you do not
 disseminate, forward, distribute or copy this e-mail message.
  If you have received this e-mail message in error, please notify us
   immediately by telephone and destroy the original message.



RE: Solaris 8 Server hangs during backup

2001-08-22 Thread Eva Freer

Bill,
Thanks for your reply. We have done some more investigation and have
determined that the problem is with sendbackup. It does ufsdump | sed |
ufsrestore. When this starts it takes the CPU to 100% and stays there. The
performance monitoring soon quits updating. Log messages indicate that
sendmail sees the load average too high and quits processing the queue. The
only recovery is to turn the machine off and back on.

The data on the largest partition was slightly greater that 1 GB. We had 2
holding partitions, each slightly less than 1 GB. We tried combining the 2
partitions with DiskSuite to get a larger volume, but this did not fix the
problem.

The only patch on the web site for 2.4.2p2 seems to be for IRIS and TRU64,
not Solaris.

Eva Freer

-Original Message-
From: Bill Carlson [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, August 15, 2001 10:16 AM
To: Eva Freer
Cc: [EMAIL PROTECTED]
Subject: Re: Solaris 8 Server hangs during backup


On Tue, 14 Aug 2001, Eva Freer wrote:

> We have a highly subnetted configuration of Solaris 8 and 2.6 boxes,
mostly
> E220R's. The subnets are connected via firewalls. Each subnet has its own
> Amanda server with an Exabyte Mammoth tape drive. We use hardware
> compression only. The Amanda is 2.4.2p1 on most nodes.
>
> Originally, we seemed to have a problem with only one subnet, with a
Solaris
> 2.6 server, 2 Solaris clients, and 1 Solaris 8 client. The server would
hang
> during the backup and required a poweroff reboot. Part of the backup would

!?!
I've never seen anything with amanda that actually killed the machine. A
heavily overloaded machine will seem dead, but should eventually respond.

> The problem now affects at least 2 of the subnets. In both cases, the
Amanda
> server is Solaris 8 with 1 Solaris 8 client and 2 Solaris 2.6 clients. One
> server hangs every night while the other is intermittent. Both are
> configured to use 2 ~1 GB holding partitions. Eliminating the holding
> partitions did not prevent the hangup. The largest disk backed up contains
> slightly more than the capacity of 1 of the holding partitions. The server

How full is the largest partition? For holding disk purposes, the
important part is how much actual data you have, not the size of the
filesystem.

> than the usual OS stuff. The 2.6 clients are dual processor Sun E220R
> webservers with no activity during the backup period. The 8 client and
> server are single processor E220R LDAP servers with no activity during the
> backup period. Perfmeter analysis indicates that the CPU usage goes to
100%
> shortly after the backup starts and stays there.

Do you have debug turned on for all clients and servers? The first thing
I'd want to see is the debug output and then the actual logs. When the CPU
starts spinning at 100%, what process is the culprit? We need more info
here. Are you using ufsdump or tar? Any patches to amanda?

Bill Carlson
--
Systems Programmer[EMAIL PROTECTED]  | Anything is possible,
Virtual Hospital  http://www.vh.org/  | given time and money.
University of Iowa Hospitals and Clinics  |
Opinions are mine, not my employer's. |





Re: Solaris 8 Server hangs during backup

2001-08-15 Thread Paul . Haldane


On Tue, 14 Aug 2001, Eva Freer wrote:

> We have a highly subnetted configuration of Solaris 8 and 2.6 boxes, mostly
> E220R's. The subnets are connected via firewalls. Each subnet has its own
> Amanda server with an Exabyte Mammoth tape drive. We use hardware
> compression only. The Amanda is 2.4.2p1 on most nodes.
...

We very occasionally (two times in months of running Amanada) see
something which _may_ be related to your problem.  We're running a
mixture of Solaris 7 and 8 (Amanda server is 7) [as well as some
RedHat Linux and MacOS X clients].

Twice one of the Solaris 7 Amanda clients (same one both times) has
locked up during the estimate phase of the backup run (this is using
ufsdump).  When this happens access to one or more filesystems blocks
and the system clogs up with jammed processes.  This is a mail server
and sendmail stops accepting new mail once the load gets too high so
I've managed to recover both times by killing off the amanda
processes.  Next time this happens I plan to be less flustered :-> and
hopeffully will have better data about what's causing the blockage.

Paul
-- 
Paul Haldane
Computing Service
University of Newcastle





Re: Solaris 8 Server hangs during backup

2001-08-15 Thread John R. Jackson

>We have a highly subnetted configuration of Solaris 8 and 2.6 boxes, mostly
>E220R's. The subnets are connected via firewalls. Each subnet has its own
>Amanda server with an Exabyte Mammoth tape drive.  ...

Do the servers reach across the firewalls to back up clients "on the
other side"?  Or is that the point of having a tape drive in each subnet,
so backups stay inside a given firewall?

>We use hardware compression only. The Amanda is 2.4.2p1 on most nodes.
>...
>Originally, we seemed to have a problem with only one subnet, with a Solaris
>2.6 server, 2 Solaris clients, and 1 Solaris 8 client. The server would hang
>during the backup and required a poweroff reboot.  ...

Please believe me that I'm not just trying to pass the buck :-), but
Amanda cannot be the root of this problem.  Put another way, anything
you do to Amanda that gets this going is, at best, a workaround and
the real problem will still be there, waiting to bite you at the worst
possible time.

Amanda is pure application level code.  Any program that generates the
same set of circumstances (e.g. high network load, particular data
patterns, etc) would trigger the same problem.  If you have systems
crashing or hanging, something else (hardware or OS) is wrong.

>... Messages in the logs (not from amanda) indicate
>that the system is very busy (e.g. sendmail won't run the queue because the
>load average is too high.)  ...

How high is the load average getting?  Amanda is I/O bound, especially
on the server.  It should not be generating significant load (w.r.t.
"load average").  Are you certain nothing else was going on at the time?
Do you have "top" to see what the heavy hitters are when it starts to
go wrong?  Or there are other tools (even just a "ps") that do roughly
the same thing.

What kind of netstat numbers are you seeing during the bad times?  Any
high error/collision counts or excessive packets?

Are all your systems up to reasonably recent Solaris patch levels?

Have you tried doing several ftp's of roughly dump image size from
the client to the server (they can go to /dev/null on the server as an
initial test)?

What is maxdumps set to?  That would control how many backups were
running at one time on the client, which, in turn, would control how
many data streams were coming into the server.

How about inparallel?  That will also throttle how many dumpers are
active.

Is anything special about the two subnets with the problem?
Any particular type of network card, connection, media or topology?

>Eva Freer

John R. Jackson, Technical Software Specialist, [EMAIL PROTECTED]