Re: DLT 8000 on Linux

2002-05-05 Thread Elmar Kolkman

> On Wed, 1 May 2002, Elmar Kolkman wrote:
>
>> Hi,
>>
>> We are trying here to make backups of a Linux File Server on a DLT
>> 8000 tapedrive, but we can not get it to streaming mode. We tried
>> everything we could think of, but it only streams when running 'dd
>> if=/dev/zero of=/dev/st0 bs=64k'. It also streams with other block
>> sizes, but as soon as we replace /dev/zero by anything else, it will
>> not stream. Strange thing is that if we dd to disk and then to tape,
>> it streams... But if we
>> dd /dev/urandom to disk and then to tape, it doesn't. It's really
>> baffling. Since I see a lot of you are running Amanda with DLT drives,
>> does someone have any idea how to get the drive streaming with real
>> data ?
>
> Have you tried looking at some of the module parameters for st?
>
> Try adding something like this to /etc/modules.conf (or pass on command
>  line when loading module):
>
> options st buffer_kbs=128 write_threshold_kbs=126 max_buffers=32
> max_sg_segs=16

We tried this, and a lot more. Last week, on thursday, when I was out of
the office, my colleague found the solution: switching the tapedrive with
an yet unused one. We were lucky enough we had one, because we are setting
up the first of three linux servers, which should all have identical
hardware.

After switching the drives, all problems were solved, and the backups run
smoothly. I haven't seen for myself, but the hardware in the machine is
fast enough to make it streaming, even without the holding disk.

>
> See the st docs for info on what all that does and tune to taste.
>

We saw the manuals, but they didn't help. Because the problem wasn't
software related in the end.

Thanx,
> Later,
>
> Bill Carlson


 Elmar Kolkman





Re: DLT 8000 on Linux

2002-05-05 Thread Elmar Kolkman

> * Elmar Kolkman <[EMAIL PROTECTED]> (Wed, May 01, 2002 at 04:43:49PM +0200)
>
>> We are trying here to make backups of a Linux File Server on a DLT
>> 8000 tapedrive, but we can not get it to streaming mode. We tried
>> everything we
>
>> Does anyone have a clue ?
>
> to get the obvious out of the way,
> are you using a spooldisk or are you dumping directly to tape ?
> If the latter, Im not surprised you do not get streaming mode.

It doesn't matter if we use a holding disk or not...

But the problem is solved by using another DLT drive (we were lucky enough
to have one). The problem seemed to be in the drive, probably a memory
problem or something like that. The backup is now running without a glitch.

>
>>

Thanx

Elmar





Re: problems with some failing backups

2002-05-05 Thread Michael Richardson

-BEGIN PGP SIGNED MESSAGE-


> "Ulrik" == Ulrik Sandberg <[EMAIL PROTECTED]> writes:
Ulrik> On Sun, 5 May 2002, Niall O Broin wrote:

>> On Sat, May 04, 2002 at 08:02:58PM -0400, Michael Richardson wrote:
>> 
>> > The three behind the firewall fail frequently, but not 100% of the
>> time.  > I setup backups for just those hosts, and watch with
>> tcpdump. I've built with > the appropriate port ranges, but I never
>> seen firewall failures, yet I get > failures.
>> 
>> Speak to me brother ! I've been posting about a similar problem here
>> but I've got no responses. Do you get messages like these in the
>> report:
>> 
>> serv1 /boot lev 0 FAILED [Request to serv1 timed out.]  serv1 / lev 0
>> FAILED [Request to serv1 timed out.]

Ulrik> I had similar failures (using tar) and the reason was slow
Ulrik> estimates. I changed to dump and it worked. Then I split the file
Ulrik> system into smaller parts and went back to tar. It continued to
Ulrik> work.

  I use dump everywhere that I can, and often plan my file system sizes so
that I can always get a level 0 on a single tape.

Ulrik> I also increased the timeouts for estimates in amanda.conf
Ulrik> (etimeout) to 1200 seconds per file system.

  That's the kind of thing I was looking for...
  Still 5 minutes per-file system ought to have bene enough.

Ulrik> Another thing to check is the network bandwidth limit (netusage),
Ulrik> which I believe can cause trouble if it is incorrectly set. Mine
Ulrik> is set to 8000.

  Hmm. My backup server actually has a 10Mb/s NIC card on a 100Mb/s hub. I
can easily change this, but I'm skeptical about this.  Mine is set to 600KB.

]   ON HUMILITY: to err is human. To moo, bovine.   |  firewalls  [
]   Michael Richardson, Sandelman Software Works, Ottawa, ON|net architect[
] [EMAIL PROTECTED] http://www.sandelman.ottawa.on.ca/ |device driver[
] panic("Just another NetBSD/notebook using, kernel hacking, security guy");  [




-BEGIN PGP SIGNATURE-
Version: 2.6.3ia
Charset: latin1
Comment: Finger me for keys

iQCVAwUBPNWKdoqHRg3pndX9AQF/zgQAxQ85NrWJUOkwvge7bd2p/aEr99DYMOTn
dT36e4hGE4qZ4gZ1NUoyK7ZryVeB9qf4HlPba5ZPG1EkvJ1tHeoRERNopJPZhTs3
nbZXlz6Jfe16Lj01KtToj1ljh5KJUuRar4424Hhe00T8IJQW92+I46A7gFpBsQ5u
breITkMW+Es=
=cxkh
-END PGP SIGNATURE-



Re: problems with some failing backups

2002-05-05 Thread Ulrik Sandberg

On Sun, 5 May 2002, Niall O Broin wrote:

> On Sat, May 04, 2002 at 08:02:58PM -0400, Michael Richardson wrote:
>
> >   The three behind the firewall fail frequently, but not 100% of the time.
> > I setup backups for just those hosts, and watch with tcpdump. I've built with
> > the appropriate port ranges, but I never seen firewall failures, yet I get
> > failures.
>
> Speak to me brother ! I've been posting about a similar problem here but
> I've got no responses. Do you get messages like these in the report:
>
>   serv1  /boot lev 0 FAILED [Request to serv1 timed out.]
>   serv1  / lev 0 FAILED [Request to serv1 timed out.]

I had similar failures (using tar) and the reason was slow estimates. I
changed to dump and it worked. Then I split the file system into smaller
parts and went back to tar. It continued to work.

I also increased the timeouts for estimates in amanda.conf (etimeout) to
1200 seconds per file system.

Another thing to check is the network bandwidth limit (netusage), which
I believe can cause trouble if it is incorrectly set. Mine is set to 8000.

--
Ulrik Sandberg





Re: problems with some failing backups

2002-05-05 Thread Michael Richardson

-BEGIN PGP SIGNED MESSAGE-


> "Niall" == Niall O Broin <[EMAIL PROTECTED]> writes:
>> I backup 7 local systems with Amanda.
>> 
>> Three Linux boxes (1 Debian/i386, 1 RH/i386, 1 RH/Netwinder), and four
>> NetBSD/i386 boxes. There is a NetBSD/ipf firewall between the backup
>> server (NetBSD/i386) and some of the boxes. Some of the backups also
>> occur over IPsec (yes, even though they are all "local").
>> 
>> Two boxes on the same wire as backup server (plus the server itself)
>> work flawlessly. The IPsec connected ones work fine.
>> 
>> The three behind the firewall fail frequently, but not 100% of the
>> time.  I setup backups for just those hosts, and watch with
>> tcpdump. I've built with the appropriate port ranges, but I never seen
>> firewall failures, yet I get failures.

Niall> Speak to me brother ! I've been posting about a similar problem
Niall> here but I've got no responses. Do you get messages like these in
Niall> the report:

Niall> serv1 /boot lev 0 FAILED [Request to serv1 timed out.]  serv1 /
Niall> lev 0 FAILED [Request to serv1 timed out.]

  Bingo. What is the firewall?
  
Niall> I've a different situation - my failing machines are 2 X 1.2 GHz
Niall> and 1 x 250MHz. However, my firewall is quite a slow box - I can't
Niall> reach it now to say exactly. I suspect that the firewall can't
Niall> handle the load, although I have clients using NFS accessing
Niall> servers through it. However, NFS as a protocol is good at error
Niall> recovery so that's probably the answer.

  The firewall is a 233Mhz PII. The load on it is neglible. It has a 3Mb
bridged ethernet ADSL in front of it which is pretty much busy all the time.

>> My impression is that the failures are because the backup time
>> estimates take too long and the backup server gives up on them. One
>> the clients, I don't see any errors in the /tmp/amanda output - it
>> looks normal to me.

Niall> At the end of amandad.debug on a failing client I see

Niall> amandad: sending REP packet:  Amanda 2.4 REP HANDLE
Niall> 002-F8B30708 SEQ 1020382783 OPTIONS maxdumps=1; / 0 SIZE 6929200
Niall> /boot 0 SIZE 3600 

Niall> amandad: dgram_recv: timeout after 10 seconds amandad: waiting for
Niall> ack: timeout, retrying amandad: dgram_recv: timeout after 10

  Yeah... I get that as well. But not always.

  One possibility is that the state for the UDP connection is failing. 
I would expect to see something in the firewall logs on this, and I'd expect
to see the 10080 packet on one side of the firewall and not on the other.

  I will test with turning off stateful inspection on the UDP stream and see
what happens.

  If this is the case, then Amanda perhaps needs to do keepalives.

Niall> Like you, I've RTFM and STFW but to no avail. I didn't get to the
Niall> the tcpdump stage yet, mind you.

  Thank you for the reply.

]   ON HUMILITY: to err is human. To moo, bovine.   |  firewalls  [
]   Michael Richardson, Sandelman Software Works, Ottawa, ON|net architect[
] [EMAIL PROTECTED] http://www.sandelman.ottawa.on.ca/ |device driver[
] panic("Just another NetBSD/notebook using, kernel hacking, security guy");  [

-BEGIN PGP SIGNATURE-
Version: 2.6.3ia
Charset: latin1
Comment: Finger me for keys

iQCVAwUBPNVtjoqHRg3pndX9AQGllgP/S+j0m0tDguBmF2mXQo4CIZB3Lgr2/4r9
Cqagt4YVlQ0P1QJsxvfPGoulHk06nKjl01lZl85IHokVQ5jtIbAy/b92WuEXXFuN
SO7R5Oq1t2LonlKUYG3oMRuCGp+4a2dkK3//o9ZzWWBakJwX0Ei3/BswjlImZpxB
b1VITDBrfFk=
=VyLf
-END PGP SIGNATURE-



Re: problems with some failing backups

2002-05-05 Thread Niall O Broin

On Sat, May 04, 2002 at 08:02:58PM -0400, Michael Richardson wrote:

>   I backup 7 local systems with Amanda.
> 
>   Three Linux boxes (1 Debian/i386, 1 RH/i386, 1 RH/Netwinder), and 
> four NetBSD/i386 boxes. There is a NetBSD/ipf firewall between the backup
> server (NetBSD/i386) and some of the boxes. Some of the backups also occur
> over IPsec (yes, even though they are all "local").
> 
>   Two boxes on the same wire as backup server (plus the server itself) 
> work flawlessly. The IPsec connected ones work fine.
> 
>   The three behind the firewall fail frequently, but not 100% of the time.
> I setup backups for just those hosts, and watch with tcpdump. I've built with 
> the appropriate port ranges, but I never seen firewall failures, yet I get
> failures. 

Speak to me brother ! I've been posting about a similar problem here but
I've got no responses. Do you get messages like these in the report:

  serv1  /boot lev 0 FAILED [Request to serv1 timed out.]
  serv1  / lev 0 FAILED [Request to serv1 timed out.]
  
My remote (to describe the machines on the other side of the firewall)
backups fail nearly all the time. My boxes are all Linux with large / and
small /boot partitions. Sometimes L0 backups of /boot work, and once or
twice I got an L0 of / to work (of one client) but generally all that works
is when I get L1 of /boot, which is of course tiny.

>   Coincidentally, the machines that fail are all less than 300Mhz systems,
> (233Mhz, 350Mhz, 200Mhz), while the machines that work are 650Mhz+. The
> backup server itself, however is a K5-133 running NetBSD/i386, and a lot of
> SCSI spindles. (Yeah, it needs to be replaced)

I've a different situation - my failing machines are 2 X 1.2 GHz and 1 x
250MHz. However, my firewall is quite a slow box - I can't reach it now to
say exactly. I suspect that the firewall can't handle the load, although I
have clients using NFS accessing servers through it. However, NFS as a
protocol is good at error recovery so that's probably the answer.

>   My impression is that the failures are because the backup time estimates 
> take too long and the backup server gives up on them. One the clients, I
> don't see any errors in the /tmp/amanda output - it looks normal to me.

At the end of amandad.debug on a failing client I see

amandad: sending REP packet:

Amanda 2.4 REP HANDLE 002-F8B30708 SEQ 1020382783
OPTIONS maxdumps=1;
/ 0 SIZE 6929200
/boot 0 SIZE 3600


amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, retrying
amandad: dgram_recv: timeout after 10 seconds
amandad: waiting for ack: timeout, giving up!

which is presumably related to the timeout in the mail reports.

>   I've been through the documentation and the FAQs, and I've watched
> tcpdump's of the traffic going through... nothing obvious.

Like you, I've RTFM and STFW but to no avail. I didn't get to the the
tcpdump stage yet, mind you.


Kindest regards,


Niall  O Broin