Re: rpz testing -> shut down hung fetch while resolving

2023-01-28 Thread Havard Eidnes via bind-users
>> I recently made an upgrade of BIND to version 9.18.11 on our
>> resolver cluster, following the recent announcement.  Shortly
>> thereafter I received reports that the validation that lookups of
>> "known entries" in our quite small RPZ feed (it's around 1MB
>> on-disk) no longer succeeds as expected, but instead take a long
>> time, finally gives SRVFAIL to the client, and associated with
>> this we get this log message:
>>
>> Jan 26 18:41:27 xxx-res named[6179]: shut down hung fetch while resolving 
>> 'known-rpz-entry.no/A'
>
> This usually means there's a circular dependency somewhere in the
> resolution or validation process. For example, we can't resolve a name
> without looking up the address of a name server, but that lookup can't
> succeed until the original name is resolved. The two lookups will wait on
> each other for ten seconds, and then the whole query times out and issues
> that log message.
>
> The log message is new in 9.18, but the 10-second delay and SERVFAIL
> response would probably have happened in earlier releases as well.

This turned out to be related to the fact that we had configured
query forwarding from two of our nodes to two of the others with
the intention to build a larger central cache, and improve query
response time for the resolvers which did that forwarding.

Once I commented out the query forwarding, this problem no longer
occurred. Our forwarding config was of this form:

  forwarders {
  128.39.x.y;
  158.38.z.r;
  };
  // But if both are dead (unlikely), do resolution ourselves
  forward first;

This part is now commented out and I've done "rndc reconfig", and
the SERVFAIL responses to the "known rpz-blocked entries" no
longer occur.  But ... the two resolvers will now have to build a
cache of their own, and do not benefit from the cache built on
the two more "central" nodes.

Regards,

- Håvard
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: rpz testing -> shut down hung fetch while resolving

2023-01-26 Thread Evan Hunt
On Thu, Jan 26, 2023 at 07:03:37PM +0100, Havard Eidnes via bind-users wrote:
> Hi,
> 
> I recently made an upgrade of BIND to version 9.18.11 on our
> resolver cluster, following the recent announcement.  Shortly
> thereafter I received reports that the validation that lookups of
> "known entries" in our quite small RPZ feed (it's around 1MB
> on-disk) no longer succeeds as expected, but instead take a long
> time, finally gives SRVFAIL to the client, and associated with
> this we get this log message:
> 
> Jan 26 18:41:27 xxx-res named[6179]: shut down hung fetch while resolving 
> 'known-rpz-entry.no/A'

This usually means there's a circular dependency somewhere in the
resolution or validation process. For example, we can't resolve a name
without looking up the address of a name server, but that lookup can't
succeed until the original name is resolved. The two lookups will wait on
each other for ten seconds, and then the whole query times out and issues
that log message.

The log message is new in 9.18, but the 10-second delay and SERVFAIL
response would probably have happened in earlier releases as well.

-- 
Evan Hunt -- e...@isc.org
Internet Systems Consortium, Inc.
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


rpz testing -> shut down hung fetch while resolving

2023-01-26 Thread Havard Eidnes via bind-users
Hi,

I recently made an upgrade of BIND to version 9.18.11 on our
resolver cluster, following the recent announcement.  Shortly
thereafter I received reports that the validation that lookups of
"known entries" in our quite small RPZ feed (it's around 1MB
on-disk) no longer succeeds as expected, but instead take a long
time, finally gives SRVFAIL to the client, and associated with
this we get this log message:

Jan 26 18:41:27 xxx-res named[6179]: shut down hung fetch while resolving 
'known-rpz-entry.no/A'

Initially I thought that this was new behaviour between BIND
9.18.10 and 9.18.11, but after downgrading to 9.18.10 on one of
the affected nodes, this problem is still observable there.
Also, only a subset of our 4 nodes exhibit this behaviour,
despite the unaffected ones running 9.18.11, which is quite
strange.  None of the name servers are under severe strain by any
measure -- one affected sees around 200qps, another around 50qps
at the time of writing.

I want to ask if this sort of issue is already known (I briefly
searched the issues on ISC's gitlab and came up empty), and also
to ask if there is any particular sort of information I should
collect to narrow this down if it is a new issue.

Regards,

- Håvard
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users