Re: Source IP incorrect on multi homed systems

2023-02-19 Thread Peter Linder
Indeed this is how you typically set up a multihomed service (addresses 
on lo and then announce that using BGP or something).


If you use one of the network links directly for the service and that 
link network goes down (it may not even be in your AS so you may not 
know?) then the service is offline.


use a route-map in your bgp config to set the src address of routes to 
the address on lo, that works for wg :)


/Peter


On 2023-02-19 13:10, Nico Schottelius wrote:

Aside from nginx + icmp being handled correctly as a reference,
I want to further elaborate on this case to show that something is
really wrong with the current behaviour:

A typical scenario for routers is to have a lot of global reachable IP
addresses (IPv6, IPv4) assigned to the loopback interface, such as this
system:

[13:11] router2.place6:~# ip a sh dev lo
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
 inet6 2a0a:e5c0:1e:a::b/128 scope global
valid_lft forever preferred_lft forever
 inet6 2a0a:e5c0:1e:a::a/128 scope global
valid_lft forever preferred_lft forever
 inet6 2a0a:e5c0:2:a::b/128 scope global
valid_lft forever preferred_lft forever
 inet6 2a0a:e5c0:2:a::a/128 scope global
valid_lft forever preferred_lft forever
 inet6 2a0a:e5c0:2:1::7/128 scope global
valid_lft forever preferred_lft forever
 inet6 2a0a:e5c0:2:1::6/128 scope global
valid_lft forever preferred_lft forever
 inet6 2a0a:e5c0:2:1::5/128 scope global
valid_lft forever preferred_lft forever
 inet6 ::1/128 scope host
valid_lft forever preferred_lft forever

The motivation behind that is that independent of the actual routing
interface, these IP addresses are always reachable.

Now in the case of wireguard selecting the source IP based on the
outgoing interface, this is never going to work, as lo cannot send
packets to the outside world.


Nico Schottelius  writes:


Let me rephrase the problem statement:

 - ping and http calls to the multi homed machine work correctly:
   I can ping 147.78.195.254 and the reply contains the same address.
   I can ping 195.141.200.73 and the reply contains the same address.
   I can curl 147.78.195.254 and the reply contains the same address.
   I can curl 195.141.200.73 and the reply contains the same address.

 - wireguard does NOT work because it changes the reply address:
   A packet sent to 147.78.195.254 is being replied with 195.141.200.73

In general, processes reply with the IP address that was used to contact
them and not with the outgoing interface address, which would also break
adding IP addresses to the loopback interface.

For full detail, see ip addresses [0] and routing below [1] and tests
executed [2].

I believe that this is a bug in wireguard.



[2]

Let's see how it looks like in detail:

1) ping to 147.78.195.254: works

[9:14] nb3:~% ping -c2 147.78.195.254
PING 147.78.195.254 (147.78.195.254) 56(84) bytes of data.
64 bytes from 147.78.195.254: icmp_seq=1 ttl=53 time=7.27 ms
64 bytes from 147.78.195.254: icmp_seq=2 ttl=53 time=6.30 ms

--- 147.78.195.254 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 6.296/6.781/7.267/0.485 ms

/ # tcpdump -ni any host 194.5.220.43
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 
262144 bytes
08:14:48.379618 net1  In  IP 194.5.220.43 > 147.78.195.254: ICMP echo request, 
id 89, seq 1, length 64
08:14:48.379651 net2  Out IP 147.78.195.254 > 194.5.220.43: ICMP echo reply, id 
89, seq 1, length 64
08:14:49.380340 net1  In  IP 194.5.220.43 > 147.78.195.254: ICMP echo request, 
id 89, seq 2, length 64
08:14:49.380392 net2  Out IP 147.78.195.254 > 194.5.220.43: ICMP echo reply, id 
89, seq 2, length 64

2) ping to 195.141.200.73

[9:14] nb3:~% ping -c2 195.141.200.73
PING 195.141.200.73 (195.141.200.73) 56(84) bytes of data.
64 bytes from 195.141.200.73: icmp_seq=1 ttl=53 time=11.3 ms
64 bytes from 195.141.200.73: icmp_seq=2 ttl=53 time=6.81 ms

--- 195.141.200.73 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 6.813/9.057/11.301/2.244 ms
[9:15] nb3:~%
/ # tcpdump -ni any host 194.5.220.43
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 
262144 bytes
08:16:19.257697 net2  In  IP 194.5.220.43 > 195.141.200.73: ICMP echo request, 
id 91, seq 1, length 64
08:16:19.257730 net2  Out IP 195.141.200.73 > 194.5.220.43: ICMP echo reply, id 
91,

Re: potentially disallowing IP fragmentation on wg packets, and handling routing loops better

2021-06-07 Thread Peter Linder

This is indeed the case for me, spot on.

On 2021-06-07 13:46, Roman Mamedov wrote:

So this same host that just generated the 1574-byte encapsulated VXLAN packet
with something it received via its eth0 port, now needs to send it further to
its WG peer(s). For this to succeed, the in-tunnel WG MTU needs to be 1574 or
more, not 1412 or 1420, as VXLAN itself can't be fragmented[1]; or even if it
could, that would mean a much worse overhead ratio than currently.


Re: potentially disallowing IP fragmentation on wg packets, and handling routing loops better

2021-06-06 Thread Peter Linder
This would break things for me. We're doing a lot of L2 over L3 site to 
site stuff and we are using wireguard as the outer layer. Inner layer is 
vxlan or l2tpv3.


In particular, people connect lots of stuff with no regard for MTU. For 
some things it's also very hard to change so we just assume people 
don't. Since the L3 network typically has the same MTU as the inner L2 
network, we need fragmentation. There is no practical way to be able to 
tell hosts on the L2 network about the limited mtu, for all we know they 
don't even run IP


It really does work without a hassle, it is not very very slow at all. 
Performance is down perhaps by a factor of 3 compared to setting a 
smaller MTU/MSS, but we can still push 350mbit/s with an atom 2ghz cpu, 
and around 800mbit/s with a xeon cpu, with fragmentation for most 
packets. This is one case where wireguard really works well!


IMHO, having wireguard generating fragmentable packets adds a lot to its 
usefulness. With that said, it's not the end of the world for me as I 
can just compile my own but I'd rather not :-)



On 2021-06-06 11:13, Jason A. Donenfeld wrote:

Hi,

WireGuard is an encrypted point-to-multipoint tunnel, where onion
layering of packets via a single interface or multiple is a useful
feature. This makes handling routing loops very hard to manage and
detect. I'm considering changing and simplifying loop mitigation to a
different strategy, but not without some discussion of its
implications.

Specifically the change would be to not allow IP fragmentation of the
encrypted UDP packets. This way, in the case of a loop, eventually the
packet size exceeds MTU, and it gets dropped: dumb and effective.
Depending on how this discussion goes, a compromise would be to not
allow fragmentation, but only for forwarded and kernel-generated
packets, not not for locally generated userspace packets. That's more
complex and I don't like it as much as just disallowing IP
fragmentation all together.

Pros:
- It solves the routing loop problem very simply.
- Usually when people are fragmenting packets like that, things become
very, very slow anyway, and it'd be better to just stop working
entirely, so that people adjust their MTU.
- Is anybody actually relying on this?

Cons:
- Maybe people are running
wireguard-over-gre-over-vxlan-over-l2tp-over-pppoe-over-god-knows-what-else,
and this reduces the MTU to below 1280, yet they still want to put
IPv6 through wireguard, and are willing to accept the performance
implications.
- Some people don't know how to fix their MTUs, and breaking rather
than just becoming really slow isn't the best outcome there, maybe.
- Maybe people are relying on this?

Before anybody asks: we're not going to add a knob for this, simply by
virtue of this being a decision with pros and cons. Please don't bring
that up.

I'd be very interested in opinions about this. Are there additional
pros and cons? I know the matter has come up a few times on the list,
mostly with people _wanting_ fragmentation (I've CCd a few people from
those threads - Roman, I expect you to vigorously argue the
pro-fragmentation stance ;-). but I'm not convinced the outcome of
those threads was correct, other than, "yea, that's easy enough to
enable." But on the other hand, maybe the cons are real enough we
should rethink this.

Please let me know thoughts and ideas.

Thanks,
Jason