Re: f33: systemd-resolved hang on ip query

2020-12-14 Thread Zbigniew Jędrzejewski-Szmek
On Mon, Dec 14, 2020 at 11:01:36AM +, Dridi Boukelmoune wrote:
> > It looks like resolved tries to resolve the name on two scopes (global and
> > one specific to some interface). This will happen if the name lookup 
> > priority
> > is the same for both the two scopes.
> 
> Interesting, I'll search the docs. I don't recall seeing anything
> about priority and that's definitely something I would want to tweak.
> 
> > Maybe you're hitting https://github.com/systemd/systemd/issues/17040?
> > One of patches being prepared is
> > https://github.com/systemd/systemd/pull/17535/commits/1e5eb07b34bf3ee5420ed6e290ad524f8e26eebf.
> 
> I'll subscribe to this issue, it definitely looks like the main
> problem I'm running into.
> 
> > There'll be quite a number of patches for resolved in the upcoming 
> > systemd-248
> > release. It'd probably make sense to wait and test if the issue is still
> > reproducible with 248-rc1.
> 
> I had found https://github.com/systemd/systemd/pull/17535 but didn't
> find anything conclusive regarding my case. Maybe I should wait until
> the RC1 is available to reassess the situation? Would this RC1 land in
> updates-testing for f33?

Probably not in F33. I think we may backport some/many of those
patches, but it's too early to say.

Zbyszek
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-14 Thread Dridi Boukelmoune
> It looks like resolved tries to resolve the name on two scopes (global and
> one specific to some interface). This will happen if the name lookup priority
> is the same for both the two scopes.

Interesting, I'll search the docs. I don't recall seeing anything
about priority and that's definitely something I would want to tweak.

> Maybe you're hitting https://github.com/systemd/systemd/issues/17040?
> One of patches being prepared is
> https://github.com/systemd/systemd/pull/17535/commits/1e5eb07b34bf3ee5420ed6e290ad524f8e26eebf.

I'll subscribe to this issue, it definitely looks like the main
problem I'm running into.

> There'll be quite a number of patches for resolved in the upcoming systemd-248
> release. It'd probably make sense to wait and test if the issue is still
> reproducible with 248-rc1.

I had found https://github.com/systemd/systemd/pull/17535 but didn't
find anything conclusive regarding my case. Maybe I should wait until
the RC1 is available to reassess the situation? Would this RC1 land in
updates-testing for f33?

Thanks a lot
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-14 Thread Zbigniew Jędrzejewski-Szmek
On Sun, Dec 13, 2020 at 03:58:06PM +, Dridi Boukelmoune wrote:
> > That's totally on me, I read the resolved.conf(5) manual but didn't
> > pay attention to the systemd-resolved.service(8) manual in the SEE
> > ALSO section: that would have led me to resolvectl. Thanks for the
> > pointer, I'll do some reading, probably figure what I mis-configured
> > and otherwise file a BZ.
> 
> I think it's a bit of both mis-configuration and (potential) bugs (plural).
> 
> I have my global configuration:
> 
> DNS=address1
> FallbackDNS=address2 address3
> 
> And with `resolvectl status` I can see my network's configuration:
> 
> DNS=192.168.something
> FallbackDNS=address4
> 
> With debug logs enabled I can see 4 attempts at resolving com.example:
> 
> - A via address1
> -  via address1
> - A via 192.168.something
> -  via 192.168.something

It looks like resolved tries to resolve the name on two scopes (global and
one specific to some interface). This will happen if the name lookup priority
is the same for both the two scopes.

> First (maybe) bug? all 4 attempts advertise cache misses.
> 
> > Cache miss for net.example IN A
> > Cache miss for net.example IN 
> > Cache miss for net.example IN A
> > Cache miss for net.example IN 
> 
> Working on a cache myself, my intuition would tell me to check the cache once
> and dispatch queries on a miss, not the other way around. Your mileage may
> vary, I will not argue that point.

The lookup process for the two scopes and the protocols is independent,
so the cache is checked for each of the four possibilities. (Or if you
will, the two per-scope caches are checked for each of the two protocols.)

> Then following the transactions by number I quickly see this:
> 
> > Processing incoming packet on transaction 42493 (rcode=NXDOMAIN).
> > Added NXDOMAIN cache entry for net.example IN ANY 1388s
> > Transaction 42493 for  on scope dns on */* now 
> > complete with  from network (unsigned).
> > Processing incoming packet on transaction 49810 (rcode=NXDOMAIN).
> > Added NXDOMAIN cache entry for net.example IN ANY 1388s
> > Transaction 49810 for  on scope dns on */* now complete 
> > with  from network (unsigned).
> 
> The answers came quickly from address1, and were allegedly cached.
> 
> Second bug? subsequent queries will still be cache misses.
> 
> Meanwhile, the two queries forwarded to 192.168.something are failing in a
> loop. That is expected since this host is currently down. And since this is
> UDP and there are retries, that's where it takes forever and turns an innocent
> query into a DoS for blocking clients like getaddrinfo().

Maybe you're hitting https://github.com/systemd/systemd/issues/17040?
One of patches being prepared is
https://github.com/systemd/systemd/pull/17535/commits/1e5eb07b34bf3ee5420ed6e290ad524f8e26eebf.

> Very soon, it tries to fall back to address4:
> 
> > Switching to DNS server address4 for interface wlp2s0.
> > Cache miss for net.example IN A
> > Transaction 44836 for  scope dns on wlp2s0/*.
> > Using feature level TLS+EDNS0 for transaction 44836.
> > Using DNS server address4 for transaction 44836.
> > Sending query via TCP since UDP isn't supported.
> > Using feature level TLS+EDNS0 for transaction 44836.
> 
> Third bug? systemd-resolved seems to have wrongfully recorded that UDP didn't
> work for address4, where prior to that it failed for 192.168.something and
> address4 was attempted for the very first time at this point.
> 
> It looks like during startup the primary DNS was probed and that led to the
> following logs in a loop:
> 
> > Using degraded feature set UDP instead of UDP+EDNS0 for DNS server 
> > 192.168.something.
> > Using degraded feature set TCP instead of UDP for DNS server 
> > 192.168.something.
> > Using degraded feature set UDP instead of TCP for DNS server 
> > 192.168.something.
> > Using degraded feature set TCP instead of UDP for DNS server 
> > 192.168.something.
> > Using degraded feature set UDP instead of TCP for DNS server 
> > 192.168.something.
> > Using degraded feature set TCP instead of UDP for DNS server 
> > 192.168.something.
> > Using degraded feature set UDP instead of TCP for DNS server 
> > 192.168.something.
> 
> It might have resulted in a coin flip between UDP and TCP for the whole link
> instead of the primary DNS since neither worked.
> 
> The fallback DNS for my network will also fail numerous times since only TCP
> is attempted and only UDP is supported, but it fails faster thanks to TCP
> being TCP.
> 
> Eventually the lookup fails:
> 
> > Transaction 44836 for  on scope dns on wlp2s0/* now 
> > complete with  from none (unsigned).
> 
> It's hard to tell which logs belong to what, because there doesn't seem to be
> a parent transaction of the 4 started at the begining. It's difficult without
> a correlation id to tell when exactly it finishes, but once both 44836 and its
>  counterpart are freed, I see this:
> 
> > Sent message type=error sender=n/a destination=:1.71134 

Re: f33: systemd-resolved hang on ip query

2020-12-13 Thread Dridi Boukelmoune
> The answers came quickly from address1, and were allegedly cached.
>
> Second bug? subsequent queries will still be cache misses.

Before anyone asks, I have Cache=yes in resolved.conf, I was at least
up to speed with the contents of the resolved.conf(5) manual.

Dridi
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-13 Thread Dridi Boukelmoune
> That's totally on me, I read the resolved.conf(5) manual but didn't
> pay attention to the systemd-resolved.service(8) manual in the SEE
> ALSO section: that would have led me to resolvectl. Thanks for the
> pointer, I'll do some reading, probably figure what I mis-configured
> and otherwise file a BZ.

I think it's a bit of both mis-configuration and (potential) bugs (plural).

I have my global configuration:

DNS=address1
FallbackDNS=address2 address3

And with `resolvectl status` I can see my network's configuration:

DNS=192.168.something
FallbackDNS=address4

With debug logs enabled I can see 4 attempts at resolving com.example:

- A via address1
-  via address1
- A via 192.168.something
-  via 192.168.something

First (maybe) bug? all 4 attempts advertise cache misses.

> Cache miss for net.example IN A
> Cache miss for net.example IN 
> Cache miss for net.example IN A
> Cache miss for net.example IN 

Working on a cache myself, my intuition would tell me to check the cache once
and dispatch queries on a miss, not the other way around. Your mileage may
vary, I will not argue that point.

Then following the transactions by number I quickly see this:

> Processing incoming packet on transaction 42493 (rcode=NXDOMAIN).
> Added NXDOMAIN cache entry for net.example IN ANY 1388s
> Transaction 42493 for  on scope dns on */* now complete 
> with  from network (unsigned).
> Processing incoming packet on transaction 49810 (rcode=NXDOMAIN).
> Added NXDOMAIN cache entry for net.example IN ANY 1388s
> Transaction 49810 for  on scope dns on */* now complete 
> with  from network (unsigned).

The answers came quickly from address1, and were allegedly cached.

Second bug? subsequent queries will still be cache misses.

Meanwhile, the two queries forwarded to 192.168.something are failing in a
loop. That is expected since this host is currently down. And since this is
UDP and there are retries, that's where it takes forever and turns an innocent
query into a DoS for blocking clients like getaddrinfo().

Very soon, it tries to fall back to address4:

> Switching to DNS server address4 for interface wlp2s0.
> Cache miss for net.example IN A
> Transaction 44836 for  scope dns on wlp2s0/*.
> Using feature level TLS+EDNS0 for transaction 44836.
> Using DNS server address4 for transaction 44836.
> Sending query via TCP since UDP isn't supported.
> Using feature level TLS+EDNS0 for transaction 44836.

Third bug? systemd-resolved seems to have wrongfully recorded that UDP didn't
work for address4, where prior to that it failed for 192.168.something and
address4 was attempted for the very first time at this point.

It looks like during startup the primary DNS was probed and that led to the
following logs in a loop:

> Using degraded feature set UDP instead of UDP+EDNS0 for DNS server 
> 192.168.something.
> Using degraded feature set TCP instead of UDP for DNS server 
> 192.168.something.
> Using degraded feature set UDP instead of TCP for DNS server 
> 192.168.something.
> Using degraded feature set TCP instead of UDP for DNS server 
> 192.168.something.
> Using degraded feature set UDP instead of TCP for DNS server 
> 192.168.something.
> Using degraded feature set TCP instead of UDP for DNS server 
> 192.168.something.
> Using degraded feature set UDP instead of TCP for DNS server 
> 192.168.something.

It might have resulted in a coin flip between UDP and TCP for the whole link
instead of the primary DNS since neither worked.

The fallback DNS for my network will also fail numerous times since only TCP
is attempted and only UDP is supported, but it fails faster thanks to TCP
being TCP.

Eventually the lookup fails:

> Transaction 44836 for  on scope dns on wlp2s0/* now 
> complete with  from none (unsigned).

It's hard to tell which logs belong to what, because there doesn't seem to be
a parent transaction of the 4 started at the begining. It's difficult without
a correlation id to tell when exactly it finishes, but once both 44836 and its
 counterpart are freed, I see this:

> Sent message type=error sender=n/a destination=:1.71134 path=n/a 
> interface=n/a member=n/a cookie=100466 reply_cookie=2 signature=s 
> error-name=org.freedesktop.DBus.Error.Timeout error-message=All attempts to 
> contact name servers or networks failed
> Sent message type=method_call sender=n/a destination=org.freedesktop.DBus 
> path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=RemoveMatch 
> cookie=100467 reply_cookie=0 signature=s error-name=n/a error-message=n/a

I can only assume that this is the final error of my attempt to resolve a
non existend domain. I think that this error triggers the second bug: since
the resolution itself failed because of one missing DNS server, nothing was
actually inserted in the cache. The cache entries were created, but not added.

I tried to have another resolution for the same domain while the first one
was busy waiting for UDP packets and it also blocked. This at least 

Re: f33: systemd-resolved hang on ip query

2020-12-13 Thread Dridi Boukelmoune
> An unfinished dbus call should only happen when the server fails
> internally and aborts the connection. If there's an error or timeout in
> the query resolution, it should still terminate the connection with
> some response.
>
> Please enable debug logs with 'resolvectl log-level debug' and do the
> reproducer and show the logs. I think it'd be better to do this in BZ
> though, it seems too complex for the mailing list.
> Please also include rpm versions and 'resolvectl status' output.

$ resolvectl query example.com
example.com:
93.184.216.34
2606:2800:220:1:248:1893:25c8:1946
-- Information acquired via protocol DNS in 1.6ms.
-- Data is authenticated: no

$ resolvectl query com.example
com.example: resolve call failed: All attempts to contact name servers
or networks failed

That's totally on me, I read the resolved.conf(5) manual but didn't
pay attention to the systemd-resolved.service(8) manual in the SEE
ALSO section: that would have led me to resolvectl. Thanks for the
pointer, I'll do some reading, probably figure what I mis-configured
and otherwise file a BZ.

Cheers
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-13 Thread Zbigniew Jędrzejewski-Szmek
On Tue, Dec 08, 2020 at 05:01:52PM +, Dridi Boukelmoune wrote:
> Greetings,
> 
> I'm not sure whether I am doing something wrong so I'd rather get
> someone's opinion before submitting a bug report.
> 
> Since the upgrade to f33 I replaced my stubby setup with
> systemd-resolved since it is now the default. I was OK with that
> change since I didn't lose functionality compared to my previous
> setup. But it is breaking getaddrinfo() and IP address resolution in
> general, and that's an annoying regression.
> 
> With varnish we use getaddrinfo() for both IP addresses and domain
> names, optionally we may set the numeric flag but otherwise it used to
> work out of the box. Now if I try to resolve an IP address without the
> numeric flag it hangs, never receiving a response from
> systemd-resolved:
> 
> > #0  0x7f011ed8690e in ppoll () from /lib64/libc.so.6
> > #1  0x7f011c8604f6 in bus_poll.lto_priv () from 
> > /lib64/libnss_resolve.so.2
> > #2  0x7f011c860f86 in sd_bus_call () from /lib64/libnss_resolve.so.2
> > #3  0x7f011c85b249 in _nss_resolve_gethostbyname4_r () from 
> > /lib64/libnss_resolve.so.2
> > #4  0x7f011ed7a397 in gaih_inet.constprop () from /lib64/libc.so.6
> > #5  0x7f011ed7b269 in getaddrinfo () from /lib64/libc.so.6

An unfinished dbus call should only happen when the server fails
internally and aborts the connection. If there's an error or timeout in
the query resolution, it should still terminate the connection with
some response.

Please enable debug logs with 'resolvectl log-level debug' and do the
reproducer and show the logs. I think it'd be better to do this in BZ
though, it seems too complex for the mailing list. 
Please also include rpm versions and 'resolvectl status' output.

Zbyszek
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-11 Thread Roberto Ragusa

On 12/10/20 7:16 PM, Paul Wouters wrote:

On Wed, 9 Dec 2020, Dridi Boukelmoune wrote:

This again leads to a required architecture change. We really need to
have a captive portal namespace, that handles all of this while the
applications still consider the network is down. Once the captive
portal has passed and our internet link is "clean", should this be
bridged into the regular network namespace so applications see the
network as "active". Any state of DNS/browser that was used inside
the captive portal namespace is then destroyed (it is untrusted and
unverifiable data)


That is how Android manages captive portals.
Whoever created this captive portals concept should be slapped
each day for ever, but that's where the world has gone.



--
   Roberto Ragusamail at robertoragusa.it
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-10 Thread Dridi Boukelmoune
> Instead, we have gnome, NM, systemd-resolved, firefox et all fighting
> over who and how to handle captive portal authentication.

What bothers me first and foremost is that I'm not connecting through
a captive portal, and somehow I can't fully trust systemd-resolved to
DoTheRightThing(tm).

On paper I would love to stick to systemd-resolved as my system-wide
stub resolver, but now I'm considering going back to stubby, losing
turnkey caching and DNSSEC among other interesting properties. I like
the opinionated nature of systemd-resolved but the lack of NXDOMAIN
makes several (mis)use cases unbearable, like the one described in
this thread or even simple typos. There are test cases I can no longer
run during $DAYJOB because of this specific opinion (although I still
haven't ruled out a mistake on my end).

I generally agree that there seems to be a lack of cohesion in the
network stack, but have nothing constructive to propose in that area.
Between the aforementioned applications and the nsswitch
configuration, we are in flexibility hell :) [1]

With Fedora 33 I'm trying to understand whether the regression on my
system is a bug or a misconfiguration of systemd-resolved, and of
course with the current worldwide situation I only have limited
networks I can connect to to try different scenarios. None of them
involve captive networks. I'll keep searching sporadically until I run
out of spare time, at which point I'll have to locally undo this
change and go back to my old setup.

Dridi

[1] exaggerating on purpose
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-10 Thread Paul Wouters

On Wed, 9 Dec 2020, Dridi Boukelmoune wrote:


So it looks like my initial intuition that there could be a mitigation
of sorts is starting to hold water. The problem now is that clients on
my system using getaddrinfo in a way that was legit until now are now
being DoS'd by systemd-resolved, waiting forever for a reply that is
not coming.


This again leads to a required architecture change. We really need to
have a captive portal namespace, that handles all of this while the
applications still consider the network is down. Once the captive
portal has passed and our internet link is "clean", should this be
bridged into the regular network namespace so applications see the
network as "active". Any state of DNS/browser that was used inside
the captive portal namespace is then destroyed (it is untrusted and
unverifiable data)

That is, only the cpative portal handling code sees these bogus DNS
messages, and no regular applications see this. This would also avoid
any applications from throwing SSL certificate errors because they are
connecting to the network too quickly when the network is still being
in captive mode, and your SSL cert is replaced with the portal SSL cert.
Pidgin is specificaly bad with this, firefox has builtin logic to prevent
all its tabs from reloading in captive portal page clones.

Instead, we have gnome, NM, systemd-resolved, firefox et all fighting
over who and how to handle captive portal authentication.

Paul
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-10 Thread Dridi Boukelmoune
> I wouldn't mind the mitigation, if only I could disable it. Does
> anyone know any better? I'm still suspecting I configured something
> wrong but at the same time systemd seems to have a history with
> NXDOMAIN handling.

I found several things, including this related to NXDOMAIN:

https://github.com/systemd/systemd/pull/17535/commits/4f9bcde3c3acadffc298a53fb60f7caf9f7bee20

The plot thickens :(
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-09 Thread Dridi Boukelmoune
On Tue, Dec 8, 2020 at 8:34 PM Marius Schwarz  wrote:
>
> Am 08.12.20 um 19:32 schrieb Dridi Boukelmoune:
> >
> >> Petr was so nice to supply a test procedure, i suggest that you use it 
> >> also.
> > I'll try to strace stuff to to see what's going on, but I can only
> > assume that this BZ is not trying to resolve ip addresses through
> > systemd-resolved.
> >
> >
>
> No, they didn't . An pretimed bind-libs update, caused apps not to be
> able to resolve hostnames . they crashed.
> All tools which did it themself, worked "in a way". they first tried
> local resolving with /etc/hosts, thats where libc crashed, which took time,
> and then used root dns to do theire jobs.
>
> It could have the same underlying issue: not matching sys libs. I
> suggest to update them.

Actually, it looks like this is happening for all NXDOMAIN replies.

$ dig @1.1.1.1 com.example | grep -e SERVER -e HEADER
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 29880
;; SERVER: 1.1.1.1#53(1.1.1.1)

$ dig +timeout=1 com.example
; <<>> DiG 9.11.25-RedHat-9.11.25-2.fc33 <<>> +timeout=1 com.example
;; global options: +cmd
;; connection timed out; no servers could be reached

A quick search for systemd-resolved nxdomain yields many results with
a syslog I do not see on my system:

> Server returned error NXDOMAIN, mitigating potential DNS violation 
> DVE-2018-0001

So it looks like my initial intuition that there could be a mitigation
of sorts is starting to hold water. The problem now is that clients on
my system using getaddrinfo in a way that was legit until now are now
being DoS'd by systemd-resolved, waiting forever for a reply that is
not coming.

I wouldn't mind the mitigation, if only I could disable it. Does
anyone know any better? I'm still suspecting I configured something
wrong but at the same time systemd seems to have a history with
NXDOMAIN handling.

Thanks,
Dridi
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-08 Thread Marius Schwarz

Am 08.12.20 um 19:32 schrieb Dridi Boukelmoune:



Petr was so nice to supply a test procedure, i suggest that you use it also.

I'll try to strace stuff to to see what's going on, but I can only
assume that this BZ is not trying to resolve ip addresses through
systemd-resolved.




No, they didn't . An pretimed bind-libs update, caused apps not to be 
able to resolve hostnames . they crashed.
All tools which did it themself, worked "in a way". they first tried 
local resolving with /etc/hosts, thats where libc crashed, which took time,

and then used root dns to do theire jobs.

It could have the same underlying issue: not matching sys libs. I 
suggest to update them.


best regards,
Marius
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-08 Thread Dridi Boukelmoune
> Can you pls get another stackframe and compare it ( it won't match 100%
> as differen apps go different way) with this bugreport:

It won't matter much, the next frame is in Varnish code.

Here is the pstack dump of dig:

> Thread 4 (Thread 0x7fee8a3e4640 (LWP 1768516) "isc-socket"):
> #0  0x7fee8bf72c4e in epoll_wait () from /lib64/libc.so.6
> #1  0x7fee8c0bb4cc in watcher () from /lib64/libisc.so.1107
> #2  0x7fee8b9ae3f9 in start_thread () from /lib64/libpthread.so.0
> #3  0x7fee8bf72903 in clone () from /lib64/libc.so.6
> Thread 3 (Thread 0x7fee8abe5640 (LWP 1768515) "isc-timer"):
> #0  0x7fee8b9b49e8 in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7fee8c0b3908 in isc_condition_waituntil () from 
> /lib64/libisc.so.1107
> #2  0x7fee8c0a525f in run.lto_priv () from /lib64/libisc.so.1107
> #3  0x7fee8b9ae3f9 in start_thread () from /lib64/libpthread.so.0
> #4  0x7fee8bf72903 in clone () from /lib64/libc.so.6
> Thread 2 (Thread 0x7fee8b3e6640 (LWP 1768514) "isc-worker"):
> #0  0x7fee8b9b46c2 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7fee8c0a22ea in run.lto_priv () from /lib64/libisc.so.1107
> #2  0x7fee8b9ae3f9 in start_thread () from /lib64/libpthread.so.0
> #3  0x7fee8bf72903 in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x7fee8b428c80 (LWP 1768513) "dig"):
> #0  0x7fee8beaed8a in sigsuspend () from /lib64/libc.so.6
> #1  0x7fee8c0a62eb in isc.app_ctxrun () from /lib64/libisc.so.1107
> #2  0x7fee8c0a6f1f in isc_app_run () from /lib64/libisc.so.1107
> #3  0x55dc25b87127 in main ()

> https://bugzilla.redhat.com/show_bug.cgi?id=1904415
>
> I see similarities there. I case of the BR, bind-libs and glic releases
> did not match as it looks ( a thesis so far, no hard facts ).

This looks like a different problem.

> Petr was so nice to supply a test procedure, i suggest that you use it also.

I'll try to strace stuff to to see what's going on, but I can only
assume that this BZ is not trying to resolve ip addresses through
systemd-resolved.

Thanks for the pointers,
Dridi
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


Re: f33: systemd-resolved hang on ip query

2020-12-08 Thread Marius Schwarz

Am 08.12.20 um 18:01 schrieb Dridi Boukelmoune:

Since the upgrade to f33 I replaced my stubby setup with
systemd-resolved since it is now the default. I was OK with that
change since I didn't lose functionality compared to my previous
setup. But it is breaking getaddrinfo() and IP address resolution in
general, and that's an annoying regression.

With varnish we use getaddrinfo() for both IP addresses and domain
names, optionally we may set the numeric flag but otherwise it used to
work out of the box. Now if I try to resolve an IP address without the
numeric flag it hangs, never receiving a response from
systemd-resolved:


#0  0x7f011ed8690e in ppoll () from /lib64/libc.so.6
#1  0x7f011c8604f6 in bus_poll.lto_priv () from /lib64/libnss_resolve.so.2
#2  0x7f011c860f86 in sd_bus_call () from /lib64/libnss_resolve.so.2
#3  0x7f011c85b249 in _nss_resolve_gethostbyname4_r () from 
/lib64/libnss_resolve.so.2
#4  0x7f011ed7a397 in gaih_inet.constprop () from /lib64/libc.so.6
#5  0x7f011ed7b269 in getaddrinfo () from /lib64/libc.so.6




Can you pls get another stackframe and compare it ( it won't match 100% 
as differen apps go different way) with this bugreport:


https://bugzilla.redhat.com/show_bug.cgi?id=1904415

I see similarities there. I case of the BR, bind-libs and glic releases 
did not match as it looks ( a thesis so far, no hard facts ).


Petr was so nice to supply a test procedure, i suggest that you use it also.


Best regards,
Marius Schwarz
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org


f33: systemd-resolved hang on ip query

2020-12-08 Thread Dridi Boukelmoune
Greetings,

I'm not sure whether I am doing something wrong so I'd rather get
someone's opinion before submitting a bug report.

Since the upgrade to f33 I replaced my stubby setup with
systemd-resolved since it is now the default. I was OK with that
change since I didn't lose functionality compared to my previous
setup. But it is breaking getaddrinfo() and IP address resolution in
general, and that's an annoying regression.

With varnish we use getaddrinfo() for both IP addresses and domain
names, optionally we may set the numeric flag but otherwise it used to
work out of the box. Now if I try to resolve an IP address without the
numeric flag it hangs, never receiving a response from
systemd-resolved:

> #0  0x7f011ed8690e in ppoll () from /lib64/libc.so.6
> #1  0x7f011c8604f6 in bus_poll.lto_priv () from /lib64/libnss_resolve.so.2
> #2  0x7f011c860f86 in sd_bus_call () from /lib64/libnss_resolve.so.2
> #3  0x7f011c85b249 in _nss_resolve_gethostbyname4_r () from 
> /lib64/libnss_resolve.so.2
> #4  0x7f011ed7a397 in gaih_inet.constprop () from /lib64/libc.so.6
> #5  0x7f011ed7b269 in getaddrinfo () from /lib64/libc.so.6

I checked with dig(1) and got the same behavior, so it happens
regardless of the method, be it via the DBUS/libnss_resolve route or
straight UDP:

$ dig getfedora.org | grep -e HEADER -e SERVER
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6462
;; SERVER: 127.0.0.53#53(127.0.0.53)

$ dig +timeout=1 1.1.1.1
; <<>> DiG 9.11.24-RedHat-9.11.24-2.fc33 <<>> +timeout=1 1.1.1.1
;; global options: +cmd
;; connection timed out; no servers could be reached

$ dig +timeout=1 @1.1.1.1 1.1.1.1 | grep -e HEADER -e SERVER
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51616
;; SERVER: 1.1.1.1#53(1.1.1.1)

$ dig +timeout=1 @8.8.8.8 1.1.1.1 | grep -e HEADER -e SERVER
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 40077
;; SERVER: 8.8.8.8#53(8.8.8.8)

I'm not getting an answer from systemd-resolved when I try to query an
IP address, despite recursive resolvers replying with NXDOMAIN. This
is the case for my network's resolver, not just the 1.1.1.1 and
8.8.8.8 examples I gave above. The resolved.conf(5) manual is rather
short, and I'm not seeing anything obvious that could explain this
behavior. At best, I could assume a DoS mitigation, refusing to
resolve blatantly invalid domains, but that's breaking the automatic
getaddrinfo() fallback to resolving the numeric IP. In particular,
when my recursive resolver doesn't make a big deal about it, I'd
rather get a timely NXDOMAIN.

Any ideas?

Thanks,
Dridi
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org