BTW there are more routecache related patches that I think we can bring in 
from Spirent fork which I documented here 
- https://github.com/cloudius-systems/osv/issues/1197.
This one looks like a bug fix  
- 
https://github.com/SpirentOrion/osv/commit/38df7da9fbe2f171297f41ecae58edc9bd436617
This one more of an ehnancement 
- 
https://github.com/SpirentOrion/osv/commit/a20c112bf47a4a36b1fff17313f9d1b2e3ce919c.
What do you think about those above?

The patch we are discussing here addresses "real" bug I have come accross 
myself.

On Sunday, July 24, 2022 at 6:02:22 AM UTC-4 n...@scylladb.com wrote:

> Hi, this code is probably correct given that both you and the Spirent guys 
> gave it a lot of thought, but there's something I'm not sure I understand:
>
> If I understand correctly (but please correct me if I'm wrong), the 
> problem is that OSv has somewhere an "optimization" that
> internal connections to the node's external IP address are handled 
> internally (loopback) instead of being sent to the (virtual) network
> card, so if we cache this knowledge, then next time if the connection is 
> of the "other" type (external connection after we cached the
> internal route, or internal connection after the external route) the 
> result will be either not working (passing external connection
> through loopback) or inefficient (passing internal connection through the 
> real device).
>

Right, that is exactly what I am thinking as well. I am not an expert here, 
but I think the if_simloop() function called from ether_output() is where
that optimization happens. For illustration see this:

"local call (from itself on 192.168.122.15)"
0xffff80000119b040 /httpclient.so   0        15.368111615 net_packet_in     
   b'IP 192.168.122.15.46890 > 192.168.122.15.8000: Flags ÆSÅ, seq 
744996567, win 65535, options Æmss 1460,nop,wscale 6,sackOK,TS val 15368 
ecr 0Å, length 0'
  log_packet_in(mbuf*, int) core/net_trace.cc:143
  netisr_queue_workstream(netisr_workstream*, unsigned int, netisr_work*, 
mbuf*, int*) Æclone .constprop.0Å bsd/sys/net/netisr.cc:633
  netisr_queue_src bsd/sys/net/netisr.cc:684
  netisr_queue_src bsd/sys/net/netisr.cc:712
  *if_simloop bsd/sys/net/if_loop.cc:291*
  *ether_output bsd/sys/net/if_ethersubr.cc:256*
 * ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*) 
bsd/sys/netinet/ip_output.cc:621*
  tcp_output bsd/sys/netinet/tcp_output.cc:1385
  tcp_usr_connect bsd/sys/netinet/tcp_usrreq.cc:465
  tcp_usr_connect bsd/sys/netinet/tcp_usrreq.cc:436
  kern_connect bsd/sys/kern/uipc_syscalls.cc:374

"external call"
0xffff8000007e0040 virtio-net-rx    0         4.284705130 net_packet_out   
    b'IP 192.168.122.15.8000 > 192.168.122.1.45818: Flags Æ.Å, ack 512095, 
win 65534, length 0'
  log_packet_out(mbuf*, int) core/net_trace.cc:148
  *ether_output bsd/sys/net/if_ethersubr.cc:397 //This should really 
be ether_output_frame*
  *ether_output bsd/sys/net/if_ethersubr.cc:366*
  *ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*) 
bsd/sys/netinet/ip_output.cc:621*
  tcp_twrespond bsd/sys/netinet/tcp_timewait.cc:553
  tcp_twstart bsd/sys/netinet/tcp_timewait.cc:281
  tcp_do_segment(mbuf*, tcphdr*, socket*, tcpcb*, int, int, unsigned char, 
int, bool&) bsd/sys/netinet/tcp_input.cc:2511
  tcp_input bsd/sys/netinet/tcp_input.cc:956
  ip_input bsd/sys/netinet/ip_input.cc:774
  netisr_dispatch_src bsd/sys/net/netisr.cc:769

BTW in case of the local call the qualification "net_packet_in" vs 
"net_packet_out" might be confusing. So this if_simloop() might be what is 
used to handle this kind of "loops" to itself.


> If I understand correctly your patch works because external connections 
> happen to look up the gateway (*.1)
> instead of the actual IP address, so it makes it easy to tell these two 
> cases apart. Is this true?
>
I think these are actually the "responses" back from 192.168.122.15 to the 
"world outside" (pointed by the gateway 192.168.122.1) when
the routecache is used as well. 

>
> I'm not sure this is the *best* way to tell these two cases apart, but I 
> probably agree it's a "good enough" way. As you can
> see in the comments I wrote in routecache.hh (which I should have 
> copy-edited more carefully, I see they have a lot of
> silly mistakes...), routecache.hh is all about the art of the "good 
> enough", so it makes sense to continue this.
>
 
In nutshell, this new version of the logic checks the device associated 
with the cached (present) route entry to determine what kind of match it 
should do - just netmask if non-loopback or
more involved in a loopback one.

>
>
>
> --
> Nadav Har'El
> n...@scylladb.com
>
>
> On Sat, Jul 23, 2022 at 9:15 PM Waldemar Kozaczuk <jwkoz...@gmail.com> 
> wrote:
>
>> This patch modifies the OSv routecache to handle the case when a route
>> entry gets added for a non-loopback address and loopback device before
>> packets start coming from outside of the local network. The practical
>> example is an app that on startup calls itself on its HTTP endpoint but
>> using non-loopback address like 192.168.122.15 before any external
>> requests from outside of the guest. This is exactly what would happen
>> when running SeaWeedFS (see #1188). This bug can be also replicated
>> with a golang-httpclient and httpserver-api apps:
>>
>> ./scripts/build image=golang-httpclient,httpserver-monitoring-api
>> .script/run.py -api -e '/httpclient.so 
>> http://192.168.122.15:8000/os/version'
>>
>> After it starts, run 'curl http://localhost:8000/os/version' and
>> observe curl never receive the response.
>>
>> Let us assume we have an app that binds to 192.168.122.15 and listens
>> on some port. When a call to connect on a non-loopback address comes from
>> the app itself, it goes through the layers of the networking stack and
>> eventually calls route_cache::lookup() to find a route entry for
>> 192.168.122.15 as in the stack trace below.
>>
>>   in route_cache::lookup (dst=dst@entry=0xffff8000014bb84c, 
>> fibnum=<optimized out>, ret=ret@entry=0xffff8000014bb860) at 
>> ./bsd/sys/net/routecache.hh:197
>>   in_pcbladdr (inp=inp@entry=0xffffa0000155a200, 
>> faddr=faddr@entry=0xffff8000014bb98c, laddr=laddr@entry=0xffff8000014bb988, 
>> cred=0x0) at bsd/sys/netinet/in_pcb.cc:881
>>   in_pcbconnect_setup (inp=0xffffa0000155a200, nam=0xffffa000008e0510, 
>> laddrp=0xffff8000014bb9e4, lportp=0xffff8000014bb9e2, 
>> faddrp=0xffffa0000155a264, fportp=0xffffa0000155a254, 
>> oinpp=0xffff8000014bb9e8, cred=0x0) at bsd/sys/netinet/in_pcb.cc:1056
>>   tcp_connect (tp=tp@entry=0xffffa00000f76800, 
>> nam=nam@entry=0xffffa000008e0510, td=<optimized out>) at 
>> bsd/sys/netinet/tcp_usrreq.cc:1089
>>   tcp_usr_connect (td=<optimized out>, nam=0xffffa000008e0510, 
>> so=<optimized out>) at bsd/sys/netinet/tcp_usrreq.cc:463
>>   tcp_usr_connect (so=<optimized out>, nam=0xffffa000008e0510, 
>> td=<optimized out>) at bsd/sys/netinet/tcp_usrreq.cc:436
>>   kern_connect (fd=<optimized out>, sa=0xffffa000008e0510) at 
>> bsd/sys/kern/uipc_syscalls.cc:374
>>
>> Initially the route cache is empty so it adds new entry that associates 
>> the
>> address 192.168.122.15, the loopback device and its netmask 255.0.0.0.
>> Once all subsequent networking stack calls to handle this particular
>> HTTP request complete, at some point later external client calls the
>> same HTTP endpoint from outside of OSv guest. While handling this call,
>> at some point we get to this point as illustrated by the stack trace 
>> below:
>>
>>   in route_cache::lookup (dst=<optimized out>, fibnum=<optimized out>, 
>> ret=0xffff800000c4ead0) at ./bsd/sys/net/routecache.hh:197
>>   in tcp_maxmtu (inc=<optimized out>, flags=0x0) at 
>> bsd/sys/netinet/tcp_subr.cc:1690
>>   in tcp_mssopt (inc=0xffffa00000f7b910) at 
>> bsd/sys/netinet/tcp_input.cc:3131
>>   in syncookie_generate (flowlabel=<synthetic pointer>, 
>> sc=0xffffa00000f7b900, sch=<optimized out>) at 
>> bsd/sys/netinet/tcp_syncache.cc:1536
>>   in _syncache_add (inc=<optimized out>, to=<optimized out>, 
>> th=<optimized out>, inp=<optimized out>, lsop=0xffff800000c4edc8, 
>> m=0xffffa00000f7bf00, toepcb=0x0, tu=0x0) at 
>> bsd/sys/netinet/tcp_syncache.cc:1210
>>   in tcp_input (m=0xffffa00000f7bf00, off0=<optimized out>) at 
>> bsd/sys/netinet/tcp_input.cc:941
>>   in ip_input (m=<optimized out>) at bsd/sys/netinet/ip_input.cc:774
>>   in netisr_dispatch_src (proto=1, source=<optimized out>, 
>> m=0xffffa00000f7bf00) at bsd/sys/net/netisr.cc:769
>>   in netisr_dispatch_src (proto=9, source=<optimized out>, 
>> m=0xffffa00000f7bf00) at bsd/sys/net/netisr.cc:769
>>   in virtio::net::receiver (this=0xffff900000ba3000) at 
>> drivers/virtio-net.cc:545
>>
>> This time the destination address (dst) is 192.168.122.1 which is a
>> gateway for the non-loopback (eth0) network. However at this point we have
>> a single route entry created when handling the first request and the 
>> search()
>> method used by route_cache::lookup() simply compares this entry network
>> with the network of the IP address in question by applying netmask
>> 255.0.0.0 and it matches so that entry is returned. The problem is that
>> this and all subsequent packets get routed to the loopback device
>> instead of the virtio device in this case. And the client never receives
>> relevant packets to complete the handshake and TCP connection does
>> not get established from the client perspective. Effectively the
>> application will never be able to receive any external requests, however
>> it can still handle any internal ones like the 1st one in this example.
>>
>> What is interesting if we revert the sequence in this example and send
>> external request before the app makes an internal request to itself
>> everything works fine. It is because when handling the external call,
>> the initial and only route entry gets created for IP 192.168.122.1 and
>> netmask 255.255.255.0 and non-loopback (eth0) device. The subsequent
>> internal request would match the netmask and go through the ifloop()
>> like this illustrates:
>>
>> 0xffff80000119b040 /httpclient.so   0        15.368111615 net_packet_in  
>>       b'IP 192.168.122.15.46890 > 192.168.122.15.8000: Flags [S], seq 
>> 744996567, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 15368 
>> ecr 0], length 0'
>>   log_packet_in(mbuf*, int) core/net_trace.cc:143
>>   netisr_queue_workstream(netisr_workstream*, unsigned int, netisr_work*, 
>> mbuf*, int*) [clone .constprop.0] bsd/sys/net/netisr.cc:633
>>   netisr_queue_src bsd/sys/net/netisr.cc:684
>>   netisr_queue_src bsd/sys/net/netisr.cc:712
>>   if_simloop bsd/sys/net/if_loop.cc:291
>>   ether_output bsd/sys/net/if_ethersubr.cc:256
>>   ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*) 
>> bsd/sys/netinet/ip_output.cc:621
>>   tcp_output bsd/sys/netinet/tcp_output.cc:1385
>>   tcp_usr_connect bsd/sys/netinet/tcp_usrreq.cc:465
>>   tcp_usr_connect bsd/sys/netinet/tcp_usrreq.cc:436
>>   kern_connect bsd/sys/kern/uipc_syscalls.cc:374
>>
>> So this patch fixes this problem by changing the search() method
>> to handle the first scenario. In essence, instead of simply
>> comparing the networks of the entry IP address and dst, it first
>> identifies type of the device the entry is associated with. If
>> non-loopback it does the same as before, if device is loopback if checks
>> if dst is loopback in which case it is a match otherwise it compares
>> full IP addresses. With the patch applied, in first scenario
>> when the second external call is handled the entry device is
>> checked which is loopback, and new logic returns null as full addresses
>> would not match.
>>
>> This patch has been backported from the original one by Jan-Michael Kho
>> contributed to the Spirent fork of OSv -
>>
>> https://github.com/SpirentOrion/osv/commit/f6e6e54ba14a7e9c1e04574769d0dd1832d66b92
>> .
>>
>> Co-authored-by: "Jan-Michael Kho" <jan...@spirent.com>
>> Co-authored-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
>> Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
>> ---
>>  bsd/sys/net/routecache.hh | 25 +++++++++++++++++++++++--
>>  1 file changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/bsd/sys/net/routecache.hh b/bsd/sys/net/routecache.hh
>> index 0c5c1f40..f71ac8f6 100644
>> --- a/bsd/sys/net/routecache.hh
>> +++ b/bsd/sys/net/routecache.hh
>> @@ -54,6 +54,7 @@
>>  #include <bsd/sys/net/if.h>
>>  #include <bsd/sys/net/if_dl.h>
>>  #include <bsd/sys/netinet/in_var.h>
>> +#include <bsd/sys/netinet/in.h>
>>
>>  #include <bsd/sys/net/route.h>
>>
>> @@ -88,6 +89,10 @@ public:
>>  #endif
>>          return *this;
>>      }
>> +
>> +    bool is_loopback(void) const {
>> +        return (rt_ifp && (rt_ifp->if_flags & IFF_LOOPBACK)) ? true : 
>> false;
>> +    }
>>  };
>>
>>  // Silly routing table implementation, allowing search given address in 
>> list
>> @@ -116,10 +121,26 @@ public:
>>          }
>>          entries.emplace_front(a, n, r);
>>      }
>> +    // address should be in host order
>> +    bool is_loopback_net(u32 address) const {
>> +        return ((address >> IN_CLASSA_NSHIFT) == IN_LOOPBACKNET) ? true 
>> : false;
>> +    }
>>      nonlockable_rtentry *search(u32 address) {
>>          for (silly_rtable_entry &e : entries) {
>> -            if ((e.address & e.netmask) == (address & e.netmask)) {
>> -                return &e.rte;
>> +            if (e.rte.is_loopback() == false) {
>> +                if ((e.address & e.netmask) == (address & e.netmask)) {
>> +                    return &e.rte;
>> +                }
>> +            } else {
>> +                if (is_loopback_net(address)) {
>> +                    return &e.rte;
>> +                }
>> +                // We shouldn't use this entry on IP addresses just 
>> because they're
>> +                // on the same network as our non-loopback address. So 
>> match the entire
>> +                // address.
>> +                if (e.address == address) {
>> +                    return &e.rte;
>> +                }
>>              }
>>          }
>>          return nullptr;
>> -- 
>> 2.35.1
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "OSv Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to osv-dev+u...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/osv-dev/20220723181515.293695-1-jwkozaczuk%40gmail.com
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/9a5ecfc8-6b89-4088-8506-153276ab5c1bn%40googlegroups.com.

Reply via email to