That looks sensible except for one thing. I wasn't sure about the logging in the first place, and having to add Yet Another Config Option to control it is the last straw; I think it's better just removed.
Will apply the others and test tomorrow. Cheers, Simon. On 25/02/2021 18:48, Petr Menšík wrote: > I have found important issue in first patch, sending a new set. > This time, it should really fix all issues I reproduced using my test. > I dropped NETLINK_NO_ENOBUFS setting. It only prevents kernel sending > userspace notice some messages were dropped. Which I think is important > information. Instead I made restarting interface reading, including > resetting not found interfaces. If ENOBUFS arrives, it would force > re-reading both IPv6 and IPv4 interfaces. Should reduce attempts to > listen on interfaces removed in the mean time. > > It reduces number of resynchronizations, because it always enqueue only > one event per all netlinks events. It reads events in loop inside > netlink_multicast() until EAGAIN is received. It should help draining > asynchronous messages much more quickly. > > I made just one big patch for core fix, patch 0001. I am leaving broken > down commits at github [1], if a better work steps are visible. I > rebased several times, it does not show my exactl workflow. > > Following patches are supposed to be improvements. > - Patch 0002 removes fnctl in netlink_multicast, reduces number of > kernel syscalls per message. > - Patch 0003 adds a new --log-listen option and removes debug logging by > default. It was me who added that and I still find it quite useful. But > I think logging can be quite slow, slowing significantly on many address > changes. Might be better with --log-async. User would have to use > --log-listen if interested in watching listeners changes. > - Patch 0004 - Obtain MTU only in case it would be used. Attempt to > reduce innecessary syscall inside iface_enumerate loop in some cases. > > 1. https://github.com/InfrastructureServices/dnsmasq/tree/netlink-tests > > On 2/25/21 1:19 AM, Simon Kelley wrote: >> On 18/02/2021 11:56, Petr Menšík wrote: >>> Hi Simon and others, >>> >>> I have started checking behaviour of dnsmasq on fast netlink changes, >>> reported originally on RHEL7 bug[1]. Found commit 1627d577[2] helps a >>> lot on RHEL 7, which is already in current version. But for some reason, >>> even latest dnsmasq 2.84 does not pass my test[3] on RHEL8 (bug #1927973 >>> [4]). >>> >>> The core of the test is repeating few times this: >>> one_retry() { >>> local MAX=254 >>> local DOWN_DELAY=0 >>> for X in `seq 1 $MAX`; do >>> ifconfig eth0:${X} 100.123.1.${X} netmask 255.255.255.0 >>> #ip address add 100.123.1.${X} dev eth0 label eth0:${X} >>> done && sleep $DOWN_DELAY && for X in `seq 1 $MAX`; do >>> ifconfig eth0:${X} down >>> #ip address del 100.123.1.${X} dev eth0 label eth0:${X} >>> done >>> } >>> >>> Then checking dnsmasq stops listening on them again. >>> >>> I found by analysing strace of test, it re-reads current addresses >>> innecessary often. It enqueues re-reading every time multicast netlink >>> arrives. That means a lot of CPU cycles, when it just sends newaddress() >>> event request every time. It seems to me, just once per multicast >>> messages row is enough. It needs to know just current state to run >>> newaddress() just last time. >>> >>> I have tested my patches with my test and it helps, works also on RHEL8. >>> I think it takes a lot less netlink reading without affecting accuracy. >>> Patches 1 and 2 are fixing things. Patches 3 and 4 are just attempt to >>> optimize speed of netlink handling a bit. Changed netlink_multicast to >>> read in a row, until EAGAIN appears. Should help to drain netlink socket >>> faster, avoiding ENOBUFFS. >> >> I need to stare at this for a bit longer, but it looks as if it's on the >> right lines. >> >> One thing to consider: the actions following a NEWROUTE event are to >> resend a DNS packet when the first DNS packet triggers a modem >> dial-on-demand. Modems doing dial-on-demand is pretty quaint, and >> something that could probably be removed without inconveniencing anyone >> in the 2020s. > > Depends whether it isn't used also in other circumstances, like > automatic restart after lost link on uplink interface. Number of route > messages is not small, maybe manual option for requesting route watching > should be used. I think it could be removed, unless it has hidden benefits. >>> >>> I think it should handle errors better in create_listeners. If tcp fails >>> to bind on address, just UDP received would listen. Problem is >>> create_bound_listeners does not know it is incomplete, it would be fixed >>> only by deletion of interface address and re-adding it again. But that >>> is related but different problem, I have no fix candidate yet. >> >> In newaddress, check for this error after calling >> create_bound_listeners() and loop back to enumerate_interfaces() until >> it gets a clean run? It that in danger of looping forever? > It could be repeated only few time, let's say 5 times. >> >> The nasty logic that inhibits interface enumeration more than once per >> select loop in enumerate_interfaces() would have to be finessed. > But I modified enumerate_interfaces() to restart always both ipv4 and > ipv6 on ENOBUFS. It improved a lot and at least my test haven't hit this > issue again. >> >>> >>> When it fails, lines similar to this appear in log: >>> failed to create listening socket for 100.123.1.250: Cannot assign >>> requested address >>> >>> What do you think? Have you ever got dnsmasq --bind-dynamic into state, >>> where it required restart to continue serving correct addresses? >> >> I'm not aware of that happening, but I suspect I don't do the sort of >> things which make it happen. > I admit it is rare and requires frequent change in addresses together > with limited computing power when reading netlink. >> >> >> >> Cheers, >> >> Simon. >> >>> >>> Also opinions welcome. >>> >>> Thanks, >>> Petr >>> >>> 1. https://bugzilla.redhat.com/show_bug.cgi?id=1887649 >>> 2. >>> http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=1627d577af03cdf747285e79fa747b6aaae8033f >>> 3. >>> https://github.com/InfrastructureServices/dnsmasq-tests/tree/master/Regression/netlink-interface-fast-changes >>> 4. https://bugzilla.redhat.com/show_bug.cgi?id=1927973 >>> _______________________________________________ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss