On 3/18/19 12:20 PM, Ryan wrote: > Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and > most of it shows itself in the planner. > > Issue #1: > > Quick test, when ANY DLE entry is deleted, the planner will error out > with the message: > >> planner: assert: ep->estimate[0].level != -1 is false: file planner.c, >> line 1313 >> Aborted > > Just remove any entries from a healthy set and run this command to test: > >> /usr/libexec/amanda/planner $SET --starttime now --log-filename >> /tmp/planner.log
FWIW, I cannot reproduce this. Are there any other error messages? > > Issue #2 > In other, less well-defined situations, the planner will seemingly get > to the end of the estimates, then get stuck in some > read/close/fstat/futex/mmap loop: > >> [pid 2463] close(120)                 = -1 EBADF (Bad >> file descriptor) >> [pid 2461] read(4, <unfinished ...> >> [pid 2463] close(121 <unfinished ...> >> [pid 2461] <... read resumed> " 636/udp                  >>   "..., >> 4096) = 4096 >> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file >> descriptor) >> [pid 2463] close(122 <unfinished ...> >> [pid 2461] close(4 <unfinished ...> >> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file >> descriptor) >> [pid 2461] <... close resumed> )      = 0 >> [pid 2463] close(123 <unfinished ...> >> [pid 2461] munmap(0x7fd5e73d5000, 4096 <unfinished ...> >> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file >> descriptor) >> [pid 2463] close(124 <unfinished ...> >> [pid 2461] <... munmap resumed> )     = 0 >> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file >> descriptor) >> [pid 2461] futex(0x7fd5e0cb9240, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> >> [pid 2463] close(125 <unfinished ...> >> [pid 2461] <... futex resumed> )      = 1 >> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file >> descriptor) >> [pid 2463] close(126 <unfinished ...> >> [pid 2461] write(10, "Mon Mar 18 10:43:07.568314681 20"..., 120 >> <unfinished ...> >> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file >> descriptor) >> [pid 2461] <... write resumed> )      = 120 >> [pid 2463] close(127 <unfinished ...> >> [pid 2461] open("/etc/services", O_RDONLY|O_CLOEXEC <unfinished ...> >> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file >> descriptor) >> [pid 2461] <... open resumed> )       = 4 >> [pid 2463] close(128 <unfinished ...> >> [pid 2461] fstat(4, <unfinished ...> >> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file >> descriptor) >> [pid 2463] close(129 <unfinished ...> >> [pid 2461] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=670330, >> ...}) = 0 >> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file >> descriptor) >> [pid 2461] mmap(NULL, 4096, PROT_READ|PROT_WRITE, >> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...> >> [pid 2463] close(130 <unfinished ...> > > While this is going, the planner grows to consume over 100GBytes(!!!) of > memory before the OOM killer stops it. The problem I'm able to > reproduce right now is only logging a bit about ambind, which drew my > attention but seems nonsensical as this was working until days ago with > no server changes, and another set on the same system is still working > just fine: > >> Mon Mar 18 10:43:02.454918586 2019: pid 2453: thd-0x14ff600: planner: >> security_stream_close(0x7fd5cc0017e0) >> Mon Mar 18 10:43:02.499910629 2019: pid 2453: thd-0x1682000: planner: >> ambind failed: ambind: bind failed A: Permission denied >> Mon Mar 18 10:43:02.500014522 2019: pid 2453: thd-0x1682000: planner: >> stream_client: Could not bind to port in range 512-1023. >> Mon Mar 18 10:43:02.500092361 2019: pid 2453: thd-0x1682000: planner: >> security_seterror(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 >> (BSDTCP) error=ambind: bind failed A: Permission denied) >> Mon Mar 18 10:43:02.500372813 2019: pid 2453: thd-0x14ff600: planner: >> security_close(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP)) >> Mon Mar 18 10:43:02.500754962 2019: pid 2453: thd-0x14ff600: planner: >> security_stream_close(0x7fd5dc001eb0) >> Mon Mar 18 10:43:02.519951961 2019: pid 2453: thd-0x1682050: planner: >> ambind failed: ambind: bind failed A: Permission denied >> Mon Mar 18 10:43:02.520019407 2019: pid 2453: thd-0x1682050: planner: >> stream_client: Could not bind to port in range 512-1023. > > > By doing nothing else but calling the planner from 3.5 instead of 3.5.1 > (which needs a handful of 3.5 libs as well), both these issues are > resolved. I tracked it down even more, and making the old planner.c use > newer libs gives it the same problems. The single problematic lib is > libamserver-3.5.1.so... Changing just that lib back to 3.5 fixes the > issue, for whatever that's worth to anyone here. > > I'm thinking a complete downgrade from 3.5.1 to 3.5 is called for... > I haven't seen the memory issue either. -- Orion Poplawski Manager of NWRA Technical Systems 720-772-5637 NWRA, Boulder/CoRA Office FAX: 303-415-9702 3380 Mitchell Lane or...@nwra.com Boulder, CO 80301 https://www.nwra.com/
smime.p7s
Description: S/MIME Cryptographic Signature