[Vserver] [RFC] Future Linux-Vserver Networking (Part 1)

Herbert Poetzl Sun, 28 Mar 2004 19:44:30 -0800

Hello Community!

I finished investigating the options we have regarding the network 
(interface) development in future linux-vserver versions, and I'd 
like to get your opinion on several issues and/or ideas ...


this is going to be a little longer, and I'd suggest to read it
thoroughly and think about it before replying (but you probably 
do that anyways ;)


I'll do that in several parts, so I can accumulate questions,
suggestions, answers, etc, and respond to them, while I proceed,
so do not expect this to be something final, and do not hesitate
to ask questions and/or provide feedback ... 

------------

first, a short overview about the basic principles in use, and 
the 'building blocks', I identified and researched.

  Network Interfaces [ip link]
    - provide a handle to a physical or virtual device
    - have a physical address (eg. MAC for ethernet)
    - do traffic accounting rx/tx errors/drops/...
 
  IP Addresses [ip addr]
    - provide an internet address ipv4/ipv6/...
    - associated with a link (interface) as primary/secondary
    - have/define a network (address/netmask)
 
  Network Sockets [netstat -atuw]
    - provide an interface to send/receive messages
    - associated with an address (not an interface)

what we currently use in linux-vserver:

  Network Context [chbind]
    - limits the addresses to a given set of addresses
    - is inherited from parent to child process
    - is applied to socket operations
    - limits the visibility of addresses
    - doesn't know anything about interfaces
    - can not be modified or migrated into


what is the difference to the UML/QEMU/VMware network device?

  basically a network interface is something, where a packet
  enters or leaves the host (server), and that is, what the
  tun/tap device on the host, and the network driver on the
  UML/QEMU/VMware client does.
  
  consider the following setup:
  
    host:   eth0: <some-network-ip>
            tun0: 10.0.0.1/24
            lo:   127.0.0.1/8

    client: eth0: 10.0.0.2/24
            lo:   127.0.0.1/8
    
  what happens on a 'ping -c 1 10.0.0.2' issued on the host?

  H# ping -c 1 10.0.0.2
  PING 10.0.0.2 (10.0.0.2) from 10.0.0.1 : 56(84) bytes of data.
  64 bytes from 10.0.0.2: icmp_seq=0 ttl=64 time=44.554 msec


  HOST (MAC-H, 10.0.0.1)                (MAC-C, 10.0.0.2) CLIENT  
  |                                                            | 
  | arp: who-has 10.0.0.2 tell 10.0.0.1 ---------------------> |
  | <------------------------- arp: reply 10.0.0.2 is-at MAC-C |
  |                                                            |
  | icmp: 10.0.0.1 > 10.0.0.2: echo request -----------------> |
  |                                                            |
  | <--------------------- arp: who-has 10.0.0.1 tell 10.0.0.2 |
  | arp: reply 10.0.0.1 is-at MAC-H -------------------------> |
  |                                                            |
  | <------------------- icmp: 10.0.0.2 > 10.0.0.1: echo reply |


  and the ifconfig on the client (and on the host, except for
  some differences in the packet size[1]) now show:

  eth0  Link encap:Ethernet  HWaddr MAC-C
        inet addr:10.0.0.2  Bcast:10.0.0.255  Mask: ...
        UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
        RX packets:3 errors:0 dropped:0 overruns:0 frame:0
        TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
        RX bytes:218 (218.0 b)  TX bytes:218 (218.0 b)
   
  what did UML/QEMU/VMware do in that process? simple, the 
  application did receive 3 packets from host, via tun0, and
  transmitted them to the client kernel via eth0, and it also
  received 3 packets from the client, which it delivered via
  the tun0 device to the network stack of the host.
  
  now, let's have a look at the same ping on the client side:
  
  C# ping -c 1 10.0.0.2
  PING 10.0.0.2 (10.0.0.2) from 10.0.0.2 : 56(84) bytes of data.
  64 bytes from 10.0.0.2: icmp_seq=0 ttl=64 time=4.391 msec
  
  CLIENT (MAC-C, 10.0.0.2)              (MAC-C, 10.0.0.2) CLIENT  
  |                                                            | 
  | icmp: 10.0.0.2 > 10.0.0.2: echo request -----------------> |
  | <------------------- icmp: 10.0.0.2 > 10.0.0.2: echo reply |

  and the ifconfig on the client shows:
  
  eth0  Link encap:Ethernet  HWaddr MAC-C
        inet addr:10.0.0.2  Bcast:10.0.0.255  Mask: ...
        UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
        RX packets:0 errors:0 dropped:0 overruns:0 frame:0
        TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
        RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

  lo    Link encap:Local Loopback
        inet addr:127.0.0.1  Mask:255.0.0.0
        UP LOOPBACK RUNNING  MTU:16436  Metric:1
        RX packets:2 errors:0 dropped:0 overruns:0 frame:0
        TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
        RX bytes:168 (168.0 b)  TX bytes:168 (168.0 b)
  
  what was the part of UML/QEMU/VMware in that process?
  at least nothing network related, because the entire ping
  was handled on the client, which used the loopback interface
  to reach one of its local addresses, disabling the lo device
  would cause the ping to fail.
  
interesting things to spend a second thought on:

  - why does the host->client ping take ~10 times longer?
  - why does lo show 2 packets received and 2 transmitted?
  - why does lo account a different size than tun0?
  - why does tun0 account a different size than eth0?
  
next part:  routing and netfilter (probably)

best,
Herbert


[1] if you do a detailed dump and have a close look at the 
    accounted network data, you will find that the client 
    receives more data than the host transmits (via tun0)

_______________________________________________
Vserver mailing list
[EMAIL PROTECTED]
http://list.linux-vserver.org/mailman/listinfo/vserver

[Vserver] [RFC] Future Linux-Vserver Networking (Part 1)

Reply via email to