Re: problem backing up a host with more than 171 disklist entries of root-tar

2001-05-14 Thread Bernhard R. Erdmann

> Please let me know if this solves your problem so I can get it into the
> source tree.

Amanda rocks! (No problem after applying your patch)

Thanks you very much for your quick help - what's the number of your
paypal account? ;-)

Regards,
Bernie




Re: problem backing up a host with more than 171 disklist entries of root-tar

2001-05-13 Thread John R. Jackson

I was able to reproduce the problem.  It was what I thought it was, but
the original patch was the wrong solution (sort of -- it gets complicated
with the multiple OS's we have to support).  The following patch forks
a separate amandad process just to write the packet to the service child
and appears to work.

It also fixes a buffer overflow bug I happened to notice.

Please let me know if this solves your problem so I can get it into the
source tree.

John R. Jackson, Technical Software Specialist, [EMAIL PROTECTED]

 amandad.diff


Re: problem backing up a host with more than 171 disklist entries of root-tar

2001-05-12 Thread Bernhard R. Erdmann

> I compiled the whole suite (without amandad.diff, but with make
> CFLAGS="-g") and copied client-src/.libs/{amandad,selfcheck} to
> /usr/libexec/amanda/.

I forgot to mention that advfs.diff is applied.



Re: problem backing up a host with more than 171 disklist entries of root-tar

2001-05-12 Thread Bernhard R. Erdmann

Hi,

> So the next step is to make sure amandad and sendsize were compiled
> with -g, get them hung and attach a debugger to them, then get a stack
> traceback ("where") so we can see where they are stopped.

I compiled the whole suite (without amandad.diff, but with make
CFLAGS="-g") and copied client-src/.libs/{amandad,selfcheck} to
/usr/libexec/amanda/.

Using 172 disklist entries of type root-tar:

$ ps x
  PID TTY  STAT   TIME COMMAND
26808 pts/2S  0:00 -bash
26840 ?S  0:00 amandad
26847 pts/1S  0:00 -bash
26842 ?S  0:00 /usr/libexec/amanda/selfcheck
26874 pts/1R  0:00 ps x

$ gdb /usr/libexec/amanda/amandad 26840
GNU gdb 19991004
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux"...

/var/lib/amanda/26840: No such file or directory.
Attaching to program: /usr/libexec/amanda/amandad, Pid 26840
Reading symbols from /usr/lib/amanda/libamanda-2.4.2p2.so...done.
Reading symbols from /lib/libm.so.6...done.
Reading symbols from /usr/lib/libreadline.so.3...done.
Reading symbols from /lib/libtermcap.so.2...done.
Reading symbols from /lib/libnsl.so.1...done.
Reading symbols from /usr/lib/amanda/libamclient-2.4.2p2.so...done.
Reading symbols from /lib/libc.so.6...done.
Reading symbols from /lib/ld-linux.so.2...done.
Reading symbols from /lib/libnss_files.so.2...done.
0x40136af4 in __libc_write () from /lib/libc.so.6
(gdb) where
#0  0x40136af4 in __libc_write () from /lib/libc.so.6
#1  0x401801cc in ?? () from /lib/libc.so.6
#2  0x400a89cb in __libc_start_main (main=0x8048ff0 , argc=1, 
argv=0xbe74, init=0x8048c34 <_init>, fini=0x804ad9c <_fini>, 
rtld_fini=0x4000aea0 <_dl_fini>, stack_end=0xbe6c)
at ../sysdeps/generic/libc-start.c:92
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /usr/libexec/amanda/amandad, Pid 26840

$ gdb /usr/libexec/amanda/selfcheck 26842
GNU gdb 19991004
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "i386-redhat-linux"...

/var/lib/amanda/26842: No such file or directory.
Attaching to program: /usr/libexec/amanda/selfcheck, Pid 26842
Reading symbols from /usr/lib/amanda/libamanda-2.4.2p2.so...done.
Reading symbols from /lib/libm.so.6...done.
Reading symbols from /usr/lib/libreadline.so.3...done.
Reading symbols from /lib/libtermcap.so.2...done.
Reading symbols from /lib/libnsl.so.1...done.
Reading symbols from /usr/lib/amanda/libamclient-2.4.2p2.so...done.
Reading symbols from /lib/libc.so.6...done.
Reading symbols from /lib/ld-linux.so.2...done.
Reading symbols from /lib/libnss_files.so.2...done.
0x40136af4 in __libc_write () from /lib/libc.so.6
(gdb) where
#0  0x40136af4 in __libc_write () from /lib/libc.so.6
#1  0x401801cc in ?? () from /lib/libc.so.6
#2  0x400e68a4 in new_do_write (fp=0x4017e960, 
data=0x4002d000 " access /home/User/info (/home/User/info):
Permission denied]\nERROR [could not access /home/User/ilnu
(/home/User/ilnu): Permission denied]\nERROR [could not access
/home/User/ijk (/home/Us"..., 
to_do=4096) at fileops.c:328
#3  0x400e6360 in _IO_new_do_write (fp=0x4017e960, 
data=0x4002d000 " access /home/User/info (/home/User/info):
Permission denied]\nERROR [could not access /home/User/ilnu
(/home/User/ilnu): Permission denied]\nERROR [could not access
/home/User/ijk (/home/Us"..., 
to_do=4096) at fileops.c:301
#4  0x400e5a1e in _IO_new_file_overflow (f=0x4017e960, ch=-1) at
fileops.c:441
#5  0x400e71a7 in __overflow (f=0x4017e960, ch=-1) at genops.c:197
#6  0x400e60a0 in _IO_new_file_xsputn (f=0x4017e960, data=0x804ce78,
n=69)
at fileops.c:803
#7  0x400d752c in _IO_vfprintf (s=0x4017e960, format=0x804a9b7 "ERROR
[%s]\n", 
ap=0xbe10) at vfprintf.c:1259
#8  0x400de050 in printf (format=0x804a9b7 "ERROR [%s]\n") at
printf.c:31
#9  0x804a016 in check_disk (program=0x804de30 "GNUTAR", 
disk=0x804de37 "/home/User/cn", level=0) at selfcheck.c:462
#10 0x8049380 in main (argc=1, argv=0xbf24) at selfcheck.c:157
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /usr/libexec/amanda/selfcheck, Pid 26842



Re: problem backing up a host with more than 171 disklist entries of root-tar

2001-05-12 Thread John R. Jackson

>the problem persists: selfcheck checked the last 100 lines of the
>disklist.

Well, nuts.  I was pretty sure that patch was involved.

In the first letter you said:

  After adding one or more lines to the disklist file, only the last 100
  lines get checked, then an amandad and a selfcheck process is hanging
  around: ...

So the next step is to make sure amandad and sendsize were compiled
with -g, get them hung and attach a debugger to them, then get a stack
traceback ("where") so we can see where they are stopped.

If your OS has gcore, you might use it instead of attaching the debugger.
That way, if there are other questions (e.g. "what is in variable X"),
you'll be able to answer them right away without rerunning the test case.

If you need more explicit instructions on attaching a debugger to a
process or running gcore, just ask.

You can either do this with the patch or without, but let us know which
way it was so we can get line numbers matched up.

>Bernie

John R. Jackson, Technical Software Specialist, [EMAIL PROTECTED]



Re: problem backing up a host with more than 171 disklist entries of root-tar

2001-05-12 Thread Bernhard R. Erdmann

Hi,

> Please give the following patch a try and let me know if it solves the
> problem.

the problem persists: selfcheck checked the last 100 lines of the
disklist.

Yes, the patched amandad has been started:

amandad: debug 1 pid 26880 ruid 37 euid 37 start time Sat May 12
12:06:57 2001
amandad: version 2.4.2p2
amandad: build: VERSION="Amanda-2.4.2p2"
amandad:BUILT_DATE="Sat May 12 12:00:17 CEST 2001"

Regards,
Bernie



Re: problem backing up a host with more than 171 disklist entries of root-tar

2001-05-12 Thread Bernhard R. Erdmann

Hi,

> This sounds like a classic case of running out of file descriptors --
> either on a per-process basis, or on a system-wide basis (more likely
> per-process, as you seem to be able to reproduce it at will with the
> same number of disklist entries on that "host").

probably not on a system-wide basis:
(on that host)
# cat /proc/sys/fs/file-max 
4096
# cat /proc/sys/fs/file-nr 
1071650 4096

Regards,
Bernie



Re: problem backing up a host with more than 171 disklist entries of root-tar

2001-05-11 Thread John R. Jackson

>Up to and including 171 disklist entries of type root-tar, everything is
>ok.  ...
>If I add some more disklist entries of the same type, amcheck hangs for
>a minute (ctimeout 60) and then reports "selfcheck request timed out. 
>Host down?"

Wow.  If it's what I think it is, that bug has been around forever.
Sheesh!  :-)

Please give the following patch a try and let me know if it solves the
problem.

Basically, there is a deadlock between amandad and the child process
it starts (selfcheck, in your case).  Amandad gets the request packet,
creates one pipe to write the request and one to read the result, then
forks the child.  It then writes the whole packet to the child, and
that's where the problem lies.  If the pipeline cannot handle that much
data, the write loop will hang and amandad will never clear out the data
filling up the read pipe, so the child stops and does not read any more.

This patch moves the write loop into the select loop that was already set
up to read the child result.  I did a minimal test and it didn't seem to
break anything.  Well, it didn't after I got it to stop dropping core :-).

I haven't looked at this w.r.t. 2.5 yet.  I suspect things are much
different there.

John R. Jackson, Technical Software Specialist, [EMAIL PROTECTED]

 amandad.diff


Re: problem backing up a host with more than 171 disklist entries of root-tar

2001-05-11 Thread Marty Shannon, RHCE

This sounds like a classic case of running out of file descriptors --
either on a per-process basis, or on a system-wide basis (more likely
per-process, as you seem to be able to reproduce it at will with the
same number of disklist entries on that "host").

It seems to me that Amanda should specifically check for the
open/socket/whatever system call that is returning with errno set to
EMFILE (or, on some brain damaged systems, EAGAIN).  When that happens,
Amanda should wait for some of the existing connections to be taken down
(i.e., closed).

Cheers,
Marty

"Bernhard R. Erdmann" wrote:
> 
> Hi,
> 
> I'm using Amanda 2.4.2p2 on/for a Linux Box (RH 6.2, 2.2.19, GNU tar
> 1.13.17) to backup home directories on a NetApp Filer mounted with NFS.
> 
> Up to and including 171 disklist entries of type root-tar, everything is
> ok. amcheck complains about the home directories being not accessible
> (amanda has uid 37), but runtar get's them running with euid 0 (NFS
> export with no root squashing). It takes about 3 secs for amcheck to
> check these lines.
> 
> If I add some more disklist entries of the same type, amcheck hangs for
> a minute (ctimeout 60) and then reports "selfcheck request timed out.
> Host down?"
> 
> /tmp/amanda gets three more files: amanda..debug, amcheck...
> and selfcheck...
> With up to 171 entries, selfcheck..debug grows to 28387 Bytes
> containing 171 lines "could not access". Using 172 entries, it stops at
> 16427 Bytes and contains only 100 lines "could not access" (o.k. because
> of NFS permissions). The last line of the disklist is checked first.
> /tmp/amanda/selfcheck... ends with:
> selfcheck: checking disk /home/User/cb
> selfcheck: device /home/User/cb
> selfcheck: could not access /home/User/cb (/home/User/cb): Permission
> denied
> selfcheck: checking disk /home/User/ca
> selfcheck: device /home/User/ca
> 
> After adding one or more lines to the disklist file, only the last 100
> lines get checked, then an amandad and a selfcheck process is hanging
> around:
> $ ps x
>   PID TTY  STAT   TIME COMMAND
> 28833 pts/2S  0:00 -bash
> 28854 pts/2S  0:00 emacs -nw disklist
> 29000 pts/1S  0:00 -bash
> 29149 ?S  0:00 amandad
> 29151 ?S  0:00 /usr/libexec/amanda/selfcheck
> 29182 pts/3S  0:00 -bash
> 29227 pts/3S  0:00 less selfcheck.20010511233745.debug
> 29230 pts/1R  0:00 ps x
> 
> Killing selfcheck spaws another selfcheck process and this one's debug
> file stops after having checked the last 100 disklist lines, too.
> $ kill 29151
> $ ps x
>   PID TTY  STAT   TIME COMMAND
> 28833 pts/2S  0:00 -bash
> 28854 pts/2S  0:00 emacs -nw disklist
> 29000 pts/1S  0:00 -bash
> 29182 pts/3S  0:00 -bash
> 29231 ?S  0:00 amandad
> 29233 ?S  0:00 /usr/libexec/amanda/selfcheck
> 29234 pts/1R  0:00 ps x
> $ kill 29233
> $ ps x
>   PID TTY  STAT   TIME COMMAND
> 28833 pts/2S  0:00 -bash
> 28854 pts/2S  0:00 emacs -nw disklist
> 29000 pts/1S  0:00 -bash
> 29182 pts/3S  0:00 -bash
> 29238 ?S  0:00 amandad
> 29240 ?S  0:00 /usr/libexec/amanda/selfcheck
> 29241 pts/1R  0:00 ps x
> $ kill 29240
> $ ps x
>   PID TTY  STAT   TIME COMMAND
> 28833 pts/2S  0:00 -bash
> 28854 pts/2S  0:00 emacs -nw disklist
> 29000 pts/1S  0:00 -bash
> 29182 pts/3S  0:00 -bash
> 29244 ?S  0:00 amandad
> 29246 ?D  0:00 /usr/libexec/amanda/selfcheck
> 29247 pts/1R  0:00 ps x
> $ kill 29246
> $ ps x
>   PID TTY  STAT   TIME COMMAND
> 28833 pts/2S  0:00 -bash
> 28854 pts/2S  0:00 emacs -nw disklist
> 29000 pts/1S  0:00 -bash
> 29182 pts/3S  0:00 -bash
> 29251 pts/1R  0:00 ps x
> 
> Now it's got killed...
> 
> Any ideas?

--
Marty Shannon, RHCE, Independent Computing Consultant
mailto:[EMAIL PROTECTED]



problem backing up a host with more than 171 disklist entries of root-tar

2001-05-11 Thread Bernhard R. Erdmann

Hi,

I'm using Amanda 2.4.2p2 on/for a Linux Box (RH 6.2, 2.2.19, GNU tar
1.13.17) to backup home directories on a NetApp Filer mounted with NFS.

Up to and including 171 disklist entries of type root-tar, everything is
ok. amcheck complains about the home directories being not accessible
(amanda has uid 37), but runtar get's them running with euid 0 (NFS
export with no root squashing). It takes about 3 secs for amcheck to
check these lines.

If I add some more disklist entries of the same type, amcheck hangs for
a minute (ctimeout 60) and then reports "selfcheck request timed out. 
Host down?"

/tmp/amanda gets three more files: amanda..debug, amcheck...
and selfcheck...
With up to 171 entries, selfcheck..debug grows to 28387 Bytes
containing 171 lines "could not access". Using 172 entries, it stops at
16427 Bytes and contains only 100 lines "could not access" (o.k. because
of NFS permissions). The last line of the disklist is checked first.
/tmp/amanda/selfcheck... ends with:
selfcheck: checking disk /home/User/cb
selfcheck: device /home/User/cb
selfcheck: could not access /home/User/cb (/home/User/cb): Permission
denied
selfcheck: checking disk /home/User/ca
selfcheck: device /home/User/ca

After adding one or more lines to the disklist file, only the last 100
lines get checked, then an amandad and a selfcheck process is hanging
around:
$ ps x
  PID TTY  STAT   TIME COMMAND
28833 pts/2S  0:00 -bash
28854 pts/2S  0:00 emacs -nw disklist
29000 pts/1S  0:00 -bash
29149 ?S  0:00 amandad
29151 ?S  0:00 /usr/libexec/amanda/selfcheck
29182 pts/3S  0:00 -bash
29227 pts/3S  0:00 less selfcheck.20010511233745.debug
29230 pts/1R  0:00 ps x

Killing selfcheck spaws another selfcheck process and this one's debug
file stops after having checked the last 100 disklist lines, too.
$ kill 29151
$ ps x
  PID TTY  STAT   TIME COMMAND
28833 pts/2S  0:00 -bash
28854 pts/2S  0:00 emacs -nw disklist
29000 pts/1S  0:00 -bash
29182 pts/3S  0:00 -bash
29231 ?S  0:00 amandad
29233 ?S  0:00 /usr/libexec/amanda/selfcheck
29234 pts/1R  0:00 ps x
$ kill 29233
$ ps x
  PID TTY  STAT   TIME COMMAND
28833 pts/2S  0:00 -bash
28854 pts/2S  0:00 emacs -nw disklist
29000 pts/1S  0:00 -bash
29182 pts/3S  0:00 -bash
29238 ?S  0:00 amandad
29240 ?S  0:00 /usr/libexec/amanda/selfcheck
29241 pts/1R  0:00 ps x
$ kill 29240
$ ps x
  PID TTY  STAT   TIME COMMAND
28833 pts/2S  0:00 -bash
28854 pts/2S  0:00 emacs -nw disklist
29000 pts/1S  0:00 -bash
29182 pts/3S  0:00 -bash
29244 ?S  0:00 amandad
29246 ?D  0:00 /usr/libexec/amanda/selfcheck
29247 pts/1R  0:00 ps x
$ kill 29246
$ ps x
  PID TTY  STAT   TIME COMMAND
28833 pts/2S  0:00 -bash
28854 pts/2S  0:00 emacs -nw disklist
29000 pts/1S  0:00 -bash
29182 pts/3S  0:00 -bash
29251 pts/1R  0:00 ps x

Now it's got killed...


Any ideas?