On 24/12/2021 09:24, Hartmut Birr wrote:
> It looks like that
>
> Commit: 1ce1c6beae9f683bec54cba4c0d375f85b209b95
> Caching cleanup. Use cached NXDOMAIN to answer queries of any type.
>
> does introduce the error.
>
> This pre-commits are fine:
>
> Commit: 51d56df7a3a125e117b3278cab16281c85500287
> Add RFC 4833 DHCP options "posix-timezone" and "tzdb-timezone".
>
> Commit: cac9ca38f62437c65464f58fc54342c7f294c40b
> Treat ANY queries the same as CNAME queries WRT to DNSSEC on CNAME targets.
>
> Regards,
> Hartmut
>
Nice work finding that.
My hypothesis on this goes like this.
1) The "internal error" is triggered during cache insertion when the
cache is full, and a record has to be deleted. cache_scan_free() gets
called with the contents of the least recently used record in the cache
and it deletes all instances of this (so, all A records of the correct
name, or all records or whatever).
2) Since there's at least one record which should have been deleted by
this (the least recently used record that started the process) then
after this process there should be at least one free cache record and
the insertion can be retried and should succeed. If nothing gets deleted
by cache_scan_free then there will again be no free records, and rather
than going into an infinite loop, the internal error gets logged and
insertion is abandoned.
3) The commit you found changes the way NXDOMAIN records are stored:
These used to be stored with a type, If a query for an A record returned
NXDOMAIN then a cache record would be stored with F_NXDOMAIN and F_IPV4
set in the flags. This is a historical ananchronism. If the domain
doesn't exist it doesn't exist for all query types. The code therefore
now stores a cache entry with only F_NXDOMAIN set, and that's good to
answer a query of any type.
4) The problem is that cache_scan_free() fails to delete a cache record
with only F_NXDOMAIN set, so if such a record fall to the end of the LRU
list and then needs to be deleted, the deletion will fail and the
internal error is triggered.
Given the above, I found a way to reproduce the bug: start dnsmasq with
a small cache, then make more queries which have NXDOMAIN answers than
the size of the cache. The cache_size+1'th query triggers the bug.
The fix is tiny, and fixes the problem for me, at least for my method of
reproduction.
Please see
https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=ea33a0130366d316f01be4c891e4f5b247f97171
Reassurance that the bug is fixed for you too would be appreciated.
Cheers, and Happy Christmas.
Simon.
> ___
> Dnsmasq-discuss mailing list
> Dnsmasq-discuss@lists.thekelleys.org.uk
> https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss
>
___
Dnsmasq-discuss mailing list
Dnsmasq-discuss@lists.thekelleys.org.uk
https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss