Re: f33: systemd-resolved hang on ip query
On Mon, Dec 14, 2020 at 11:01:36AM +, Dridi Boukelmoune wrote: > > It looks like resolved tries to resolve the name on two scopes (global and > > one specific to some interface). This will happen if the name lookup > > priority > > is the same for both the two scopes. > > Interesting, I'll search the docs. I don't recall seeing anything > about priority and that's definitely something I would want to tweak. > > > Maybe you're hitting https://github.com/systemd/systemd/issues/17040? > > One of patches being prepared is > > https://github.com/systemd/systemd/pull/17535/commits/1e5eb07b34bf3ee5420ed6e290ad524f8e26eebf. > > I'll subscribe to this issue, it definitely looks like the main > problem I'm running into. > > > There'll be quite a number of patches for resolved in the upcoming > > systemd-248 > > release. It'd probably make sense to wait and test if the issue is still > > reproducible with 248-rc1. > > I had found https://github.com/systemd/systemd/pull/17535 but didn't > find anything conclusive regarding my case. Maybe I should wait until > the RC1 is available to reassess the situation? Would this RC1 land in > updates-testing for f33? Probably not in F33. I think we may backport some/many of those patches, but it's too early to say. Zbyszek ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
> It looks like resolved tries to resolve the name on two scopes (global and > one specific to some interface). This will happen if the name lookup priority > is the same for both the two scopes. Interesting, I'll search the docs. I don't recall seeing anything about priority and that's definitely something I would want to tweak. > Maybe you're hitting https://github.com/systemd/systemd/issues/17040? > One of patches being prepared is > https://github.com/systemd/systemd/pull/17535/commits/1e5eb07b34bf3ee5420ed6e290ad524f8e26eebf. I'll subscribe to this issue, it definitely looks like the main problem I'm running into. > There'll be quite a number of patches for resolved in the upcoming systemd-248 > release. It'd probably make sense to wait and test if the issue is still > reproducible with 248-rc1. I had found https://github.com/systemd/systemd/pull/17535 but didn't find anything conclusive regarding my case. Maybe I should wait until the RC1 is available to reassess the situation? Would this RC1 land in updates-testing for f33? Thanks a lot ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
On Sun, Dec 13, 2020 at 03:58:06PM +, Dridi Boukelmoune wrote: > > That's totally on me, I read the resolved.conf(5) manual but didn't > > pay attention to the systemd-resolved.service(8) manual in the SEE > > ALSO section: that would have led me to resolvectl. Thanks for the > > pointer, I'll do some reading, probably figure what I mis-configured > > and otherwise file a BZ. > > I think it's a bit of both mis-configuration and (potential) bugs (plural). > > I have my global configuration: > > DNS=address1 > FallbackDNS=address2 address3 > > And with `resolvectl status` I can see my network's configuration: > > DNS=192.168.something > FallbackDNS=address4 > > With debug logs enabled I can see 4 attempts at resolving com.example: > > - A via address1 > - via address1 > - A via 192.168.something > - via 192.168.something It looks like resolved tries to resolve the name on two scopes (global and one specific to some interface). This will happen if the name lookup priority is the same for both the two scopes. > First (maybe) bug? all 4 attempts advertise cache misses. > > > Cache miss for net.example IN A > > Cache miss for net.example IN > > Cache miss for net.example IN A > > Cache miss for net.example IN > > Working on a cache myself, my intuition would tell me to check the cache once > and dispatch queries on a miss, not the other way around. Your mileage may > vary, I will not argue that point. The lookup process for the two scopes and the protocols is independent, so the cache is checked for each of the four possibilities. (Or if you will, the two per-scope caches are checked for each of the two protocols.) > Then following the transactions by number I quickly see this: > > > Processing incoming packet on transaction 42493 (rcode=NXDOMAIN). > > Added NXDOMAIN cache entry for net.example IN ANY 1388s > > Transaction 42493 for on scope dns on */* now > > complete with from network (unsigned). > > Processing incoming packet on transaction 49810 (rcode=NXDOMAIN). > > Added NXDOMAIN cache entry for net.example IN ANY 1388s > > Transaction 49810 for on scope dns on */* now complete > > with from network (unsigned). > > The answers came quickly from address1, and were allegedly cached. > > Second bug? subsequent queries will still be cache misses. > > Meanwhile, the two queries forwarded to 192.168.something are failing in a > loop. That is expected since this host is currently down. And since this is > UDP and there are retries, that's where it takes forever and turns an innocent > query into a DoS for blocking clients like getaddrinfo(). Maybe you're hitting https://github.com/systemd/systemd/issues/17040? One of patches being prepared is https://github.com/systemd/systemd/pull/17535/commits/1e5eb07b34bf3ee5420ed6e290ad524f8e26eebf. > Very soon, it tries to fall back to address4: > > > Switching to DNS server address4 for interface wlp2s0. > > Cache miss for net.example IN A > > Transaction 44836 for scope dns on wlp2s0/*. > > Using feature level TLS+EDNS0 for transaction 44836. > > Using DNS server address4 for transaction 44836. > > Sending query via TCP since UDP isn't supported. > > Using feature level TLS+EDNS0 for transaction 44836. > > Third bug? systemd-resolved seems to have wrongfully recorded that UDP didn't > work for address4, where prior to that it failed for 192.168.something and > address4 was attempted for the very first time at this point. > > It looks like during startup the primary DNS was probed and that led to the > following logs in a loop: > > > Using degraded feature set UDP instead of UDP+EDNS0 for DNS server > > 192.168.something. > > Using degraded feature set TCP instead of UDP for DNS server > > 192.168.something. > > Using degraded feature set UDP instead of TCP for DNS server > > 192.168.something. > > Using degraded feature set TCP instead of UDP for DNS server > > 192.168.something. > > Using degraded feature set UDP instead of TCP for DNS server > > 192.168.something. > > Using degraded feature set TCP instead of UDP for DNS server > > 192.168.something. > > Using degraded feature set UDP instead of TCP for DNS server > > 192.168.something. > > It might have resulted in a coin flip between UDP and TCP for the whole link > instead of the primary DNS since neither worked. > > The fallback DNS for my network will also fail numerous times since only TCP > is attempted and only UDP is supported, but it fails faster thanks to TCP > being TCP. > > Eventually the lookup fails: > > > Transaction 44836 for on scope dns on wlp2s0/* now > > complete with from none (unsigned). > > It's hard to tell which logs belong to what, because there doesn't seem to be > a parent transaction of the 4 started at the begining. It's difficult without > a correlation id to tell when exactly it finishes, but once both 44836 and its > counterpart are freed, I see this: > > > Sent message type=error sender=n/a destination=:1.71134
Re: f33: systemd-resolved hang on ip query
> The answers came quickly from address1, and were allegedly cached. > > Second bug? subsequent queries will still be cache misses. Before anyone asks, I have Cache=yes in resolved.conf, I was at least up to speed with the contents of the resolved.conf(5) manual. Dridi ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
> That's totally on me, I read the resolved.conf(5) manual but didn't > pay attention to the systemd-resolved.service(8) manual in the SEE > ALSO section: that would have led me to resolvectl. Thanks for the > pointer, I'll do some reading, probably figure what I mis-configured > and otherwise file a BZ. I think it's a bit of both mis-configuration and (potential) bugs (plural). I have my global configuration: DNS=address1 FallbackDNS=address2 address3 And with `resolvectl status` I can see my network's configuration: DNS=192.168.something FallbackDNS=address4 With debug logs enabled I can see 4 attempts at resolving com.example: - A via address1 - via address1 - A via 192.168.something - via 192.168.something First (maybe) bug? all 4 attempts advertise cache misses. > Cache miss for net.example IN A > Cache miss for net.example IN > Cache miss for net.example IN A > Cache miss for net.example IN Working on a cache myself, my intuition would tell me to check the cache once and dispatch queries on a miss, not the other way around. Your mileage may vary, I will not argue that point. Then following the transactions by number I quickly see this: > Processing incoming packet on transaction 42493 (rcode=NXDOMAIN). > Added NXDOMAIN cache entry for net.example IN ANY 1388s > Transaction 42493 for on scope dns on */* now complete > with from network (unsigned). > Processing incoming packet on transaction 49810 (rcode=NXDOMAIN). > Added NXDOMAIN cache entry for net.example IN ANY 1388s > Transaction 49810 for on scope dns on */* now complete > with from network (unsigned). The answers came quickly from address1, and were allegedly cached. Second bug? subsequent queries will still be cache misses. Meanwhile, the two queries forwarded to 192.168.something are failing in a loop. That is expected since this host is currently down. And since this is UDP and there are retries, that's where it takes forever and turns an innocent query into a DoS for blocking clients like getaddrinfo(). Very soon, it tries to fall back to address4: > Switching to DNS server address4 for interface wlp2s0. > Cache miss for net.example IN A > Transaction 44836 for scope dns on wlp2s0/*. > Using feature level TLS+EDNS0 for transaction 44836. > Using DNS server address4 for transaction 44836. > Sending query via TCP since UDP isn't supported. > Using feature level TLS+EDNS0 for transaction 44836. Third bug? systemd-resolved seems to have wrongfully recorded that UDP didn't work for address4, where prior to that it failed for 192.168.something and address4 was attempted for the very first time at this point. It looks like during startup the primary DNS was probed and that led to the following logs in a loop: > Using degraded feature set UDP instead of UDP+EDNS0 for DNS server > 192.168.something. > Using degraded feature set TCP instead of UDP for DNS server > 192.168.something. > Using degraded feature set UDP instead of TCP for DNS server > 192.168.something. > Using degraded feature set TCP instead of UDP for DNS server > 192.168.something. > Using degraded feature set UDP instead of TCP for DNS server > 192.168.something. > Using degraded feature set TCP instead of UDP for DNS server > 192.168.something. > Using degraded feature set UDP instead of TCP for DNS server > 192.168.something. It might have resulted in a coin flip between UDP and TCP for the whole link instead of the primary DNS since neither worked. The fallback DNS for my network will also fail numerous times since only TCP is attempted and only UDP is supported, but it fails faster thanks to TCP being TCP. Eventually the lookup fails: > Transaction 44836 for on scope dns on wlp2s0/* now > complete with from none (unsigned). It's hard to tell which logs belong to what, because there doesn't seem to be a parent transaction of the 4 started at the begining. It's difficult without a correlation id to tell when exactly it finishes, but once both 44836 and its counterpart are freed, I see this: > Sent message type=error sender=n/a destination=:1.71134 path=n/a > interface=n/a member=n/a cookie=100466 reply_cookie=2 signature=s > error-name=org.freedesktop.DBus.Error.Timeout error-message=All attempts to > contact name servers or networks failed > Sent message type=method_call sender=n/a destination=org.freedesktop.DBus > path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=RemoveMatch > cookie=100467 reply_cookie=0 signature=s error-name=n/a error-message=n/a I can only assume that this is the final error of my attempt to resolve a non existend domain. I think that this error triggers the second bug: since the resolution itself failed because of one missing DNS server, nothing was actually inserted in the cache. The cache entries were created, but not added. I tried to have another resolution for the same domain while the first one was busy waiting for UDP packets and it also blocked. This at least
Re: f33: systemd-resolved hang on ip query
> An unfinished dbus call should only happen when the server fails > internally and aborts the connection. If there's an error or timeout in > the query resolution, it should still terminate the connection with > some response. > > Please enable debug logs with 'resolvectl log-level debug' and do the > reproducer and show the logs. I think it'd be better to do this in BZ > though, it seems too complex for the mailing list. > Please also include rpm versions and 'resolvectl status' output. $ resolvectl query example.com example.com: 93.184.216.34 2606:2800:220:1:248:1893:25c8:1946 -- Information acquired via protocol DNS in 1.6ms. -- Data is authenticated: no $ resolvectl query com.example com.example: resolve call failed: All attempts to contact name servers or networks failed That's totally on me, I read the resolved.conf(5) manual but didn't pay attention to the systemd-resolved.service(8) manual in the SEE ALSO section: that would have led me to resolvectl. Thanks for the pointer, I'll do some reading, probably figure what I mis-configured and otherwise file a BZ. Cheers ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
On Tue, Dec 08, 2020 at 05:01:52PM +, Dridi Boukelmoune wrote: > Greetings, > > I'm not sure whether I am doing something wrong so I'd rather get > someone's opinion before submitting a bug report. > > Since the upgrade to f33 I replaced my stubby setup with > systemd-resolved since it is now the default. I was OK with that > change since I didn't lose functionality compared to my previous > setup. But it is breaking getaddrinfo() and IP address resolution in > general, and that's an annoying regression. > > With varnish we use getaddrinfo() for both IP addresses and domain > names, optionally we may set the numeric flag but otherwise it used to > work out of the box. Now if I try to resolve an IP address without the > numeric flag it hangs, never receiving a response from > systemd-resolved: > > > #0 0x7f011ed8690e in ppoll () from /lib64/libc.so.6 > > #1 0x7f011c8604f6 in bus_poll.lto_priv () from > > /lib64/libnss_resolve.so.2 > > #2 0x7f011c860f86 in sd_bus_call () from /lib64/libnss_resolve.so.2 > > #3 0x7f011c85b249 in _nss_resolve_gethostbyname4_r () from > > /lib64/libnss_resolve.so.2 > > #4 0x7f011ed7a397 in gaih_inet.constprop () from /lib64/libc.so.6 > > #5 0x7f011ed7b269 in getaddrinfo () from /lib64/libc.so.6 An unfinished dbus call should only happen when the server fails internally and aborts the connection. If there's an error or timeout in the query resolution, it should still terminate the connection with some response. Please enable debug logs with 'resolvectl log-level debug' and do the reproducer and show the logs. I think it'd be better to do this in BZ though, it seems too complex for the mailing list. Please also include rpm versions and 'resolvectl status' output. Zbyszek ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
On 12/10/20 7:16 PM, Paul Wouters wrote: On Wed, 9 Dec 2020, Dridi Boukelmoune wrote: This again leads to a required architecture change. We really need to have a captive portal namespace, that handles all of this while the applications still consider the network is down. Once the captive portal has passed and our internet link is "clean", should this be bridged into the regular network namespace so applications see the network as "active". Any state of DNS/browser that was used inside the captive portal namespace is then destroyed (it is untrusted and unverifiable data) That is how Android manages captive portals. Whoever created this captive portals concept should be slapped each day for ever, but that's where the world has gone. -- Roberto Ragusamail at robertoragusa.it ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
> Instead, we have gnome, NM, systemd-resolved, firefox et all fighting > over who and how to handle captive portal authentication. What bothers me first and foremost is that I'm not connecting through a captive portal, and somehow I can't fully trust systemd-resolved to DoTheRightThing(tm). On paper I would love to stick to systemd-resolved as my system-wide stub resolver, but now I'm considering going back to stubby, losing turnkey caching and DNSSEC among other interesting properties. I like the opinionated nature of systemd-resolved but the lack of NXDOMAIN makes several (mis)use cases unbearable, like the one described in this thread or even simple typos. There are test cases I can no longer run during $DAYJOB because of this specific opinion (although I still haven't ruled out a mistake on my end). I generally agree that there seems to be a lack of cohesion in the network stack, but have nothing constructive to propose in that area. Between the aforementioned applications and the nsswitch configuration, we are in flexibility hell :) [1] With Fedora 33 I'm trying to understand whether the regression on my system is a bug or a misconfiguration of systemd-resolved, and of course with the current worldwide situation I only have limited networks I can connect to to try different scenarios. None of them involve captive networks. I'll keep searching sporadically until I run out of spare time, at which point I'll have to locally undo this change and go back to my old setup. Dridi [1] exaggerating on purpose ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
On Wed, 9 Dec 2020, Dridi Boukelmoune wrote: So it looks like my initial intuition that there could be a mitigation of sorts is starting to hold water. The problem now is that clients on my system using getaddrinfo in a way that was legit until now are now being DoS'd by systemd-resolved, waiting forever for a reply that is not coming. This again leads to a required architecture change. We really need to have a captive portal namespace, that handles all of this while the applications still consider the network is down. Once the captive portal has passed and our internet link is "clean", should this be bridged into the regular network namespace so applications see the network as "active". Any state of DNS/browser that was used inside the captive portal namespace is then destroyed (it is untrusted and unverifiable data) That is, only the cpative portal handling code sees these bogus DNS messages, and no regular applications see this. This would also avoid any applications from throwing SSL certificate errors because they are connecting to the network too quickly when the network is still being in captive mode, and your SSL cert is replaced with the portal SSL cert. Pidgin is specificaly bad with this, firefox has builtin logic to prevent all its tabs from reloading in captive portal page clones. Instead, we have gnome, NM, systemd-resolved, firefox et all fighting over who and how to handle captive portal authentication. Paul ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
> I wouldn't mind the mitigation, if only I could disable it. Does > anyone know any better? I'm still suspecting I configured something > wrong but at the same time systemd seems to have a history with > NXDOMAIN handling. I found several things, including this related to NXDOMAIN: https://github.com/systemd/systemd/pull/17535/commits/4f9bcde3c3acadffc298a53fb60f7caf9f7bee20 The plot thickens :( ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
On Tue, Dec 8, 2020 at 8:34 PM Marius Schwarz wrote: > > Am 08.12.20 um 19:32 schrieb Dridi Boukelmoune: > > > >> Petr was so nice to supply a test procedure, i suggest that you use it > >> also. > > I'll try to strace stuff to to see what's going on, but I can only > > assume that this BZ is not trying to resolve ip addresses through > > systemd-resolved. > > > > > > No, they didn't . An pretimed bind-libs update, caused apps not to be > able to resolve hostnames . they crashed. > All tools which did it themself, worked "in a way". they first tried > local resolving with /etc/hosts, thats where libc crashed, which took time, > and then used root dns to do theire jobs. > > It could have the same underlying issue: not matching sys libs. I > suggest to update them. Actually, it looks like this is happening for all NXDOMAIN replies. $ dig @1.1.1.1 com.example | grep -e SERVER -e HEADER ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 29880 ;; SERVER: 1.1.1.1#53(1.1.1.1) $ dig +timeout=1 com.example ; <<>> DiG 9.11.25-RedHat-9.11.25-2.fc33 <<>> +timeout=1 com.example ;; global options: +cmd ;; connection timed out; no servers could be reached A quick search for systemd-resolved nxdomain yields many results with a syslog I do not see on my system: > Server returned error NXDOMAIN, mitigating potential DNS violation > DVE-2018-0001 So it looks like my initial intuition that there could be a mitigation of sorts is starting to hold water. The problem now is that clients on my system using getaddrinfo in a way that was legit until now are now being DoS'd by systemd-resolved, waiting forever for a reply that is not coming. I wouldn't mind the mitigation, if only I could disable it. Does anyone know any better? I'm still suspecting I configured something wrong but at the same time systemd seems to have a history with NXDOMAIN handling. Thanks, Dridi ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
Am 08.12.20 um 19:32 schrieb Dridi Boukelmoune: Petr was so nice to supply a test procedure, i suggest that you use it also. I'll try to strace stuff to to see what's going on, but I can only assume that this BZ is not trying to resolve ip addresses through systemd-resolved. No, they didn't . An pretimed bind-libs update, caused apps not to be able to resolve hostnames . they crashed. All tools which did it themself, worked "in a way". they first tried local resolving with /etc/hosts, thats where libc crashed, which took time, and then used root dns to do theire jobs. It could have the same underlying issue: not matching sys libs. I suggest to update them. best regards, Marius ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
> Can you pls get another stackframe and compare it ( it won't match 100% > as differen apps go different way) with this bugreport: It won't matter much, the next frame is in Varnish code. Here is the pstack dump of dig: > Thread 4 (Thread 0x7fee8a3e4640 (LWP 1768516) "isc-socket"): > #0 0x7fee8bf72c4e in epoll_wait () from /lib64/libc.so.6 > #1 0x7fee8c0bb4cc in watcher () from /lib64/libisc.so.1107 > #2 0x7fee8b9ae3f9 in start_thread () from /lib64/libpthread.so.0 > #3 0x7fee8bf72903 in clone () from /lib64/libc.so.6 > Thread 3 (Thread 0x7fee8abe5640 (LWP 1768515) "isc-timer"): > #0 0x7fee8b9b49e8 in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7fee8c0b3908 in isc_condition_waituntil () from > /lib64/libisc.so.1107 > #2 0x7fee8c0a525f in run.lto_priv () from /lib64/libisc.so.1107 > #3 0x7fee8b9ae3f9 in start_thread () from /lib64/libpthread.so.0 > #4 0x7fee8bf72903 in clone () from /lib64/libc.so.6 > Thread 2 (Thread 0x7fee8b3e6640 (LWP 1768514) "isc-worker"): > #0 0x7fee8b9b46c2 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x7fee8c0a22ea in run.lto_priv () from /lib64/libisc.so.1107 > #2 0x7fee8b9ae3f9 in start_thread () from /lib64/libpthread.so.0 > #3 0x7fee8bf72903 in clone () from /lib64/libc.so.6 > Thread 1 (Thread 0x7fee8b428c80 (LWP 1768513) "dig"): > #0 0x7fee8beaed8a in sigsuspend () from /lib64/libc.so.6 > #1 0x7fee8c0a62eb in isc.app_ctxrun () from /lib64/libisc.so.1107 > #2 0x7fee8c0a6f1f in isc_app_run () from /lib64/libisc.so.1107 > #3 0x55dc25b87127 in main () > https://bugzilla.redhat.com/show_bug.cgi?id=1904415 > > I see similarities there. I case of the BR, bind-libs and glic releases > did not match as it looks ( a thesis so far, no hard facts ). This looks like a different problem. > Petr was so nice to supply a test procedure, i suggest that you use it also. I'll try to strace stuff to to see what's going on, but I can only assume that this BZ is not trying to resolve ip addresses through systemd-resolved. Thanks for the pointers, Dridi ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Re: f33: systemd-resolved hang on ip query
Am 08.12.20 um 18:01 schrieb Dridi Boukelmoune: Since the upgrade to f33 I replaced my stubby setup with systemd-resolved since it is now the default. I was OK with that change since I didn't lose functionality compared to my previous setup. But it is breaking getaddrinfo() and IP address resolution in general, and that's an annoying regression. With varnish we use getaddrinfo() for both IP addresses and domain names, optionally we may set the numeric flag but otherwise it used to work out of the box. Now if I try to resolve an IP address without the numeric flag it hangs, never receiving a response from systemd-resolved: #0 0x7f011ed8690e in ppoll () from /lib64/libc.so.6 #1 0x7f011c8604f6 in bus_poll.lto_priv () from /lib64/libnss_resolve.so.2 #2 0x7f011c860f86 in sd_bus_call () from /lib64/libnss_resolve.so.2 #3 0x7f011c85b249 in _nss_resolve_gethostbyname4_r () from /lib64/libnss_resolve.so.2 #4 0x7f011ed7a397 in gaih_inet.constprop () from /lib64/libc.so.6 #5 0x7f011ed7b269 in getaddrinfo () from /lib64/libc.so.6 Can you pls get another stackframe and compare it ( it won't match 100% as differen apps go different way) with this bugreport: https://bugzilla.redhat.com/show_bug.cgi?id=1904415 I see similarities there. I case of the BR, bind-libs and glic releases did not match as it looks ( a thesis so far, no hard facts ). Petr was so nice to supply a test procedure, i suggest that you use it also. Best regards, Marius Schwarz ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
f33: systemd-resolved hang on ip query
Greetings, I'm not sure whether I am doing something wrong so I'd rather get someone's opinion before submitting a bug report. Since the upgrade to f33 I replaced my stubby setup with systemd-resolved since it is now the default. I was OK with that change since I didn't lose functionality compared to my previous setup. But it is breaking getaddrinfo() and IP address resolution in general, and that's an annoying regression. With varnish we use getaddrinfo() for both IP addresses and domain names, optionally we may set the numeric flag but otherwise it used to work out of the box. Now if I try to resolve an IP address without the numeric flag it hangs, never receiving a response from systemd-resolved: > #0 0x7f011ed8690e in ppoll () from /lib64/libc.so.6 > #1 0x7f011c8604f6 in bus_poll.lto_priv () from /lib64/libnss_resolve.so.2 > #2 0x7f011c860f86 in sd_bus_call () from /lib64/libnss_resolve.so.2 > #3 0x7f011c85b249 in _nss_resolve_gethostbyname4_r () from > /lib64/libnss_resolve.so.2 > #4 0x7f011ed7a397 in gaih_inet.constprop () from /lib64/libc.so.6 > #5 0x7f011ed7b269 in getaddrinfo () from /lib64/libc.so.6 I checked with dig(1) and got the same behavior, so it happens regardless of the method, be it via the DBUS/libnss_resolve route or straight UDP: $ dig getfedora.org | grep -e HEADER -e SERVER ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6462 ;; SERVER: 127.0.0.53#53(127.0.0.53) $ dig +timeout=1 1.1.1.1 ; <<>> DiG 9.11.24-RedHat-9.11.24-2.fc33 <<>> +timeout=1 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached $ dig +timeout=1 @1.1.1.1 1.1.1.1 | grep -e HEADER -e SERVER ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51616 ;; SERVER: 1.1.1.1#53(1.1.1.1) $ dig +timeout=1 @8.8.8.8 1.1.1.1 | grep -e HEADER -e SERVER ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 40077 ;; SERVER: 8.8.8.8#53(8.8.8.8) I'm not getting an answer from systemd-resolved when I try to query an IP address, despite recursive resolvers replying with NXDOMAIN. This is the case for my network's resolver, not just the 1.1.1.1 and 8.8.8.8 examples I gave above. The resolved.conf(5) manual is rather short, and I'm not seeing anything obvious that could explain this behavior. At best, I could assume a DoS mitigation, refusing to resolve blatantly invalid domains, but that's breaking the automatic getaddrinfo() fallback to resolving the numeric IP. In particular, when my recursive resolver doesn't make a big deal about it, I'd rather get a timely NXDOMAIN. Any ideas? Thanks, Dridi ___ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org