Re: amdump inconsistancy.

2000-12-13 Thread Chris Karakas

"John R. Jackson" wrote:
 
 Thanks John, I think you are absolutely right to question the dump
 program.  ...
 It's 0.4b19 version ...
 
 Well, I'd be a lot happier if it was ancient so I could blame it :-).
 
I've read in this list that there is a Linux dump 0.4b20 out there, so
there might be hope for you, John :-)

-- 
Regards

Chris Karakas
Dont waste your cpu time - crack rc5: http://www.distributed.net



Re: amdump inconsistancy.

2000-12-12 Thread Hien Viet Lieu

Thanks John, I think you are absolutely right to question the dump
program. The inconsistancy may be due to the dump command, but it's more
likely that the processes it depends on are the main cause, and I don't know
how to find out what these processes are. 
 
 
 Anytime a filesystem failed to backup, the tapedrive seemed to be idle forever
 until the READ_TIMEOUT period lapsed, i.e no activity shown.
 
 That could be normal.  Looking at the amdump.NN file, are the file
 systems that time out being done with PORT-DUMP (direct to tape) or
 FILE-DUMP (through the holding disk)?  If they are going through the
 holding disk, it could be normal for the tape to be idle waiting on
 something to do.
 
It's being done with PORT-DUMP as I don't use the holding disk. However, I
did try to use the holding disk but the results was no different.


 What happens if you do something like this:
 
   /sbin/dump 0f -  /dev/null /

This is what I've got when it's done successfully:

  DUMP: Date of this level 0 dump: Tue Dec 12 16:02:22 2000
  DUMP: Date of last level 0 dump: the epoch
  DUMP: Dumping /dev/sda1 (/) to standard output
  DUMP: Label: none
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 1990332 tape blocks.
  DUMP: Volume 1 started at: Tue Dec 12 16:02:29 2000
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping (Pass IV) [regular files]
  DUMP: Volume 1 completed at: Tue Dec 12 16:05:27 2000
  DUMP: Volume 1 took 0:02:58
  DUMP: Volume 1 transfer rate: 12022 KB/s
  DUMP: 2139918 tape blocks (2089.76MB)
  DUMP: finished in 178 seconds, throughput 12022 KBytes/sec
  DUMP: Date of this level 0 dump: Tue Dec 12 16:02:22 2000
  DUMP: Date this dump completed:  Tue Dec 12 16:05:27 2000
  DUMP: Average transfer rate: 12022 KB/s
  DUMP: DUMP IS DONE

but when it wasn't successful, it got stucked after the "Pass IV" step, and
was waiting for something indefinitely.

 
 or this:
 
   /sbin/dump 0f - / | rsh localhost "cat  /dev/null"
 
 What version of dump are you using?  You really, really, really want
 to get the latest stuff from sourceforge.  It was reportedly pretty bad
 for a while but has gotten a good maintainer now and is much better.
 
It's 0.4b19 version which came with RedHat 7.0 so I guess it's pretty current


 
 Clearly the dumps are getting started and moving some data, so something
 must be freezing up.  I'm not sure how to track this down other than to
 try and catch it in the act and run gcore on the various programs and
 then a debugger to see what they are waiting on.
 
Yes, it would be great to know exactly how to track it down. What the
various programs are you talking about here John?

 John R. Jackson, Technical Software Specialist, [EMAIL PROTECTED]

Thanks,
Hien.



Re: amdump inconsistancy.

2000-12-12 Thread John R. Jackson

Thanks John, I think you are absolutely right to question the dump
program.  ...
It's 0.4b19 version ...

Well, I'd be a lot happier if it was ancient so I could blame it :-).
You're right that that seems pretty recent, so it may not be the culprit.

The inconsistancy may be due to the dump command, but it's more
likely that the processes it depends on are the main cause, and I don't know
how to find out what these processes are. 

"ps" is your friend.  Find all the "dump" processes, use "ps -fp PID"
to display their parent, and keep working your way up the process tree
(there may be some program that does this on your OS, but I don't know
what it is).  Draw a picture as you go.

As I recall, there will be three+ dump processes at the lowest layer
with no children.  Above them will be one or more single dump processes.
The parent of the last one will be sendbackup.  Sendbackup may have
other children (e.g. gzip).  You do not need to go back any further than
sendbackup -- it's the one doing the network I/O.

Here's a sample using a program I have called "pstree":

 \-+- 01191 root /usr/sbin/inetd -s
   \-+- 02071 backup amandad
 \-+- 02072 backup /opt/amanda-2.4.2/libexec/sendbackup
   \-+- 05481 backup ufsdump 0sf 1048576 - /dev/rdsk/c0t0d0s0
 \-+- 05488 backup ufsdump 0sf 1048576 - /dev/rdsk/c0t0d0s0
   |--- 05489 backup ufsdump 0sf 1048576 - /dev/rdsk/c0t0d0s0
   |--- 05490 backup ufsdump 0sf 1048576 - /dev/rdsk/c0t0d0s0
   |--- 05492 backup ufsdump 0sf 1048576 - /dev/rdsk/c0t0d0s0
   \--- 05491 backup ufsdump 0sf 1048576 - /dev/rdsk/c0t0d0s0

On the server side you need to find which dumper process is getting the
data (look at the amdump file or use something like lsof) and then the
two taper processes.

Run truss or the equivalent on each process and see if it is making any
progress.  If none of them are, try gcore-ing the Amanda ones (sendbackup,
dumper and the tapers) then run gdb on the binary and core file to see
where they are stopped.  If you didn't do so, you might want to rebuild
Amanda first with -g and not -O (and maybe without shared libraries)
so you can get good traceback data.

 What happens if you do something like this:
 
   /sbin/dump 0f -  /dev/null /

This is what I've got when it's done successfully:

OK, that tells us it isn't something fatally flawed, a.k.a. obvious :-).
It's probably some network congestion issue someplace.  Sigh.

Hien.

John R. Jackson, Technical Software Specialist, [EMAIL PROTECTED]



Re: amdump inconsistancy.

2000-12-11 Thread John R. Jackson

Anytime a filesystem failed to backup, the tapedrive seemed to be idle forever
until the READ_TIMEOUT period lapsed, i.e no activity shown.

That could be normal.  Looking at the amdump.NN file, are the file
systems that time out being done with PORT-DUMP (direct to tape) or
FILE-DUMP (through the holding disk)?  If they are going through the
holding disk, it could be normal for the tape to be idle waiting on
something to do.

What happens if you do something like this:

  /sbin/dump 0f -  /dev/null /

or this:

  /sbin/dump 0f - / | rsh localhost "cat  /dev/null"

What version of dump are you using?  You really, really, really want
to get the latest stuff from sourceforge.  It was reportedly pretty bad
for a while but has gotten a good maintainer now and is much better.

These are some typical error from the log:
...

Clearly the dumps are getting started and moving some data, so something
must be freezing up.  I'm not sure how to track this down other than to
try and catch it in the act and run gcore on the various programs and
then a debugger to see what they are waiting on.

Hien Viet Lieu

John R. Jackson, Technical Software Specialist, [EMAIL PROTECTED]