Re: High load on BIND DNS and query timeouts after RPZ XFR retrieve

2019-05-20 Thread Mukund Sivaraman
On Sun, May 19, 2019 at 10:55:53PM +0200, Peter V wrote:
> Hi all,
> 
> I would like to get opinion on issue I was involved over weekend.
> Customer utilizes RPZ feed from spamhaus and worked pretty OK for some
> months after initial deployment.
> They reported issue with wrong performance of BIND DNS; 
> BIND version: 9.10.8-P1 

BIND 9.11 and below can't sometimes keep up with Spamhaus's feeds (their
rate of change) without significant tuning. RPZ in BIND 9.11
(non-subscription open source version) and below updates its summary
datastructures synchronously along with policy zone updates that causes
severe lock contention with the query path. With Spamhaus feeds, updates
can be almost continuous with no relief.

BIND 9.12+ mitigates this somewhat by refactoring the RPZ summary
datastructure update path so it doesn't happen synchronously with the
RPZ zone updates, albeit with some differences (esp. for the typical
Spamhaus feeds' users - changes from RPZ feeds are visible every 60s in
the default configuration). You may want to try BIND 9.12+ to see if it
helps your case.

(An alternative on BIND 9.10 is to try if forcing AXFR by using
"request-ixfr no;" helps. This uses different codepaths within named
that could reduce some lock contention - however, it would behave poorly
with Spamhaus's feeds which are quite large. At least the transfer rate
would have to be limited somehow, and I know that it hasn't helped for
some users.)

This is an elaborate topic more than just RPZ.

Mukund
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


High load on BIND DNS and query timeouts after RPZ XFR retrieve

2019-05-19 Thread Peter V
Hi all,

I would like to get opinion on issue I was involved over weekend.
Customer utilizes RPZ feed from spamhaus and worked pretty OK for some
months after initial deployment.
They reported issue with wrong performance of BIND DNS; 
BIND version: 9.10.8-P1 

I observed BIND CPU usage went from 5% to ~200% CPU non stop; noticing
slow processing of UDP with timeout for dig output;
netstat confirmed that udp4 queue is not processed on time.

ZERES STATE   C   TIMEWCPU COMMAND
1050 named8  200  4058M  3261M sigwai  0 114.7H 182.67%
named

After different looks and restart of BIND which temporary resolve issue
for few minutes, I notice journal error on BIND

_May 17 22:11:46 DNS named[60244]: zone dbl.rpz.spamhaus.org/IN: journal
file is out of date: removing journal file_

So I started to suspect some weird state with this RPZ

Look on last 3 transfer; 2 looks OK, last one is weird with 5 record and
slow processing - I see also change for (re)loading policy name which
was new for me.

_May 17 20:47:10 DNS named[1050]: zone dbl.rpz.spamhaus.org/IN: Transfer
started._
_ May 17 20:47:11 DNS named[1050]: transfer of 'dbl.rpz.spamhaus.org/IN'
from 199.168.90.51#53: connected using 10.192.176.53#28493_
_ May 17 20:47:11 DNS named[1050]: zone dbl.rpz.spamhaus.org/IN:
transferred serial 1558140361_
_ May 17 20:47:11 DNS named[1050]: transfer of 'dbl.rpz.spamhaus.org/IN'
from 199.168.90.51#53: Transfer status: success_
_ May 17 20:47:11 DNS named[1050]: transfer of 'dbl.rpz.spamhaus.org/IN'
from 199.168.90.51#53: Transfer completed: 1 messages, 154 records, 3234
bytes, 0.017 secs (190235 bytes/sec)_

_ May 17 20:51:38 DNS named[1050]: zone dbl.rpz.spamhaus.org/IN:
Transfer started._
_ May 17 20:51:38 DNS named[1050]: transfer of 'dbl.rpz.spamhaus.org/IN'
from 199.168.90.51#53: connected using 10.192.176.53#56279_
_ May 17 20:51:38 DNS named[1050]: zone dbl.rpz.spamhaus.org/IN:
transferred serial 1558140601_
_ May 17 20:51:38 DNS named[1050]: transfer of 'dbl.rpz.spamhaus.org/IN'
from 199.168.90.51#53: Transfer status: success_
_ May 17 20:51:38 DNS named[1050]: transfer of 'dbl.rpz.spamhaus.org/IN'
from 199.168.90.51#53: Transfer completed: 3 messages, 1244 records,
25623 bytes, 0.037 secs (692513 bytes/sec)_

_ May 17 20:55:51 DNS named[1050]: zone dbl.rpz.spamhaus.org/IN:
Transfer started._
_ May 17 20:55:51 DNS named[1050]: transfer of 'dbl.rpz.spamhaus.org/IN'
from 199.168.90.51#53: connected using 10.192.176.53#25939_
_ May 17 20:55:52 DNS named[1050]: (re)loading policy zone
'dbl.rpz.spamhaus.org' changed from 5940230 to 5 qname, 0 to 0 nsdname,
886 to 874 IP, 0 to 0 NSIP, 0 to 0 CLIENTIP entries_
_ May 17 20:55:53 DNS named[1050]: zone dbl.rpz.spamhaus.org/IN:
transferred serial 1558140721_
_ May 17 20:55:53 DNS named[1050]: transfer of 'dbl.rpz.spamhaus.org/IN'
from 199.168.90.51#53: Transfer status: success_
_ May 17 20:55:53 DNS named[1050]: transfer of 'dbl.rpz.spamhaus.org/IN'
from 199.168.90.51#53: Transfer completed: 1 messages, 5 records, 310
bytes, 2.251 secs (137 bytes/sec)_

 

Above would be just interesting event on one DNS server, but I got
another DNS report issue from completely different customer same day
later. 
Common: 

* same RPZ feed configured
* issue happened after transferring same serial - _1558140721_

__On other side, it was just one server from 2 for each customer, so I
cannot say it impacted directly each server used by this feed. 

=== 

In both cases, we deconfigure RPZ feed which resolved issue immediately
without any restart, we did just rndc reload after removing RPZ
statements for this feed. 

Due production state, was not easy to take more valuable data and that's
first part I would like to ask audience for guideline. 

Best Regards,
Peter___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users