On 3/18/19 12:20 PM, Ryan wrote:
> Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and 
> most of it shows itself in the planner.
> 
> Issue #1:
> 
> Quick test, when ANY DLE entry is deleted, the planner will error out 
> with the message:
> 
>> planner: assert: ep->estimate[0].level != -1 is false: file planner.c, 
>> line 1313
>> Aborted
> 
> Just remove any entries from a healthy set and run this command to test:
> 
>> /usr/libexec/amanda/planner $SET --starttime now --log-filename 
>> /tmp/planner.log

FWIW, I cannot reproduce this.  Are there any other error messages?

> 
> Issue #2
> In other, less well-defined situations, the planner will seemingly get 
> to the end of the estimates, then get stuck in some 
> read/close/fstat/futex/mmap loop:
> 
>> [pid  2463] close(120)                  = -1 EBADF (Bad 
>> file descriptor)
>> [pid  2461] read(4,  <unfinished ...>
>> [pid  2463] close(121 <unfinished ...>
>> [pid  2461] <... read resumed> " 636/udp                  
>>    "..., 
>> 4096) = 4096
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2463] close(122 <unfinished ...>
>> [pid  2461] close(4 <unfinished ...>
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2461] <... close resumed> )       = 0
>> [pid  2463] close(123 <unfinished ...>
>> [pid  2461] munmap(0x7fd5e73d5000, 4096 <unfinished ...>
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2463] close(124 <unfinished ...>
>> [pid  2461] <... munmap resumed> )      = 0
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2461] futex(0x7fd5e0cb9240, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>> [pid  2463] close(125 <unfinished ...>
>> [pid  2461] <... futex resumed> )       = 1
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2463] close(126 <unfinished ...>
>> [pid  2461] write(10, "Mon Mar 18 10:43:07.568314681 20"..., 120 
>> <unfinished ...>
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2461] <... write resumed> )       = 120
>> [pid  2463] close(127 <unfinished ...>
>> [pid  2461] open("/etc/services", O_RDONLY|O_CLOEXEC <unfinished ...>
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2461] <... open resumed> )        = 4
>> [pid  2463] close(128 <unfinished ...>
>> [pid  2461] fstat(4,  <unfinished ...>
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2463] close(129 <unfinished ...>
>> [pid  2461] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=670330, 
>> ...}) = 0
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2461] mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
>> [pid  2463] close(130 <unfinished ...>
> 
> While this is going, the planner grows to consume over 100GBytes(!!!) of 
> memory before the OOM killer stops it.  The problem I'm able to 
> reproduce right now is only logging a bit about ambind, which drew my 
> attention but seems nonsensical as this was working until days ago with 
> no server changes, and another set on the same system is still working 
> just fine:
> 
>> Mon Mar 18 10:43:02.454918586 2019: pid 2453: thd-0x14ff600: planner: 
>> security_stream_close(0x7fd5cc0017e0)
>> Mon Mar 18 10:43:02.499910629 2019: pid 2453: thd-0x1682000: planner: 
>> ambind failed: ambind: bind failed A: Permission denied
>> Mon Mar 18 10:43:02.500014522 2019: pid 2453: thd-0x1682000: planner: 
>> stream_client: Could not bind to port in range 512-1023.
>> Mon Mar 18 10:43:02.500092361 2019: pid 2453: thd-0x1682000: planner: 
>> security_seterror(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 
>> (BSDTCP) error=ambind: bind failed A: Permission denied)
>> Mon Mar 18 10:43:02.500372813 2019: pid 2453: thd-0x14ff600: planner: 
>> security_close(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP))
>> Mon Mar 18 10:43:02.500754962 2019: pid 2453: thd-0x14ff600: planner: 
>> security_stream_close(0x7fd5dc001eb0)
>> Mon Mar 18 10:43:02.519951961 2019: pid 2453: thd-0x1682050: planner: 
>> ambind failed: ambind: bind failed A: Permission denied
>> Mon Mar 18 10:43:02.520019407 2019: pid 2453: thd-0x1682050: planner: 
>> stream_client: Could not bind to port in range 512-1023.
> 
> 
> By doing nothing else but calling the planner from 3.5 instead of 3.5.1 
> (which needs a handful of 3.5 libs as well), both these issues are 
> resolved.  I tracked it down even more, and making the old planner.c use 
> newer libs gives it the same problems.  The single problematic lib is 
> libamserver-3.5.1.so... Changing just that lib back to 3.5 fixes the 
> issue, for whatever that's worth to anyone here.
> 
> I'm thinking a complete downgrade from 3.5.1 to 3.5 is called for...
> 

I haven't seen the memory issue either.

-- 
Orion Poplawski
Manager of NWRA Technical Systems          720-772-5637
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       or...@nwra.com
Boulder, CO 80301                 https://www.nwra.com/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to