Hi!

I think I broke any (negative) record as far as the "response latency" is
concerned! Really sorry about that.

Lots of fixes/urgent things have gone in the way and made me postpone it every
time.

Since you are working on relevant things right now (just show the IP
reservations made by Petr), I would like the design to be clear and not have
any conflicts with existing implementation.

Lets resurrect this thread, shall we?

* Jose A. Lopes <[email protected]> [2014-05-13 09:50:23 +0200]:

.. 50+ days ago  :)

> On Apr 23 16:22, Dimitris Aragiorgis wrote:
> > This design doc describes how to extend the existing network
> > management and make it more flexible and able to deal with more
> > generic use cases. It proposes support for:
> > 
> >  - Networks with multiple subnets
> >  - Subnets with multiple IP pools
> >  - NICs with multiple IPs from various subnets of a single network
> > 
> > Signed-off-by: Dimitris Aragiorgis <[email protected]>
> > ---
> > 
> > Hello team,
> > 
> > After our discussions during GanetiCon 2013 and a recent discussion with
> > Jose, I'm sending the revised design document for networks, incorporating
> > all your comments.
> > 
> > Looking forward to your feedback,
> > dimara
> > 
> >  Makefile.am             |    1 +
> >  doc/design-draft.rst    |    1 +
> >  doc/design-network2.rst |  400 
> > +++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 402 insertions(+)
> >  create mode 100644 doc/design-network2.rst
> > 
> > diff --git a/Makefile.am b/Makefile.am
> > index f2589e6..140608f 100644
> > --- a/Makefile.am
> > +++ b/Makefile.am
> > @@ -586,6 +586,7 @@ docinput = \
> >     doc/design-multi-reloc.rst \
> >     doc/design-multi-version-tests.rst \
> >     doc/design-network.rst \
> > +   doc/design-network2.rst \
> >     doc/design-node-add.rst \
> >     doc/design-oob.rst \
> >     doc/design-openvswitch.rst \
> > diff --git a/doc/design-draft.rst b/doc/design-draft.rst
> > index 55bed7c..926f35b 100644
> > --- a/doc/design-draft.rst
> > +++ b/doc/design-draft.rst
> > @@ -23,6 +23,7 @@ Design document drafts
> >     design-node-security.rst
> >     design-systemd.rst
> >     design-cpu-speed.rst
> > +   design-network2.rst
> >  
> >  .. vim: set textwidth=72 :
> >  .. Local Variables:
> > diff --git a/doc/design-network2.rst b/doc/design-network2.rst
> > new file mode 100644
> > index 0000000..84a44e8
> > --- /dev/null
> > +++ b/doc/design-network2.rst
> > @@ -0,0 +1,400 @@
> > +============================
> > +Network Management (revised)
> > +============================
> > +
> > +.. contents:: :depth: 4
> > +
> > +This is a design document detailing how to extend the existing network
> > +management and make it more flexible and able to deal with more generic
> > +use cases.
> > +
> > +
> > +Current state and shortcomings
> > +------------------------------
> > +
> > +Currently in Ganeti, networks are tightly connected with IP pools,
> > +since creation of a network implies the existence of one subnet
> > +and the corresponding IP pool. This design does not allow common
> > +scenarios like:
> > +
> > +- L2 only networks
> > +- IPv6 only networks
> > +- Networks without an IP pool
> > +- Networks with an IPv6 pool
> > +- Networks with multiple IP pools (alternative to externally reserving
> > +  IPs)
> > +
> > +Additionally one cannot have multiple IP pools inside one network.
> > +Finally, from the instance perspective, a NIC cannot get more than one
> > +IPs (v4 and v6).
> > +
> > +
> > +Proposed changes
> > +----------------
> > +
> > +In order to deal with the above shortcomings, we propose to extend
> > +the existing networks in Ganeti and support:
> > +
> > +a) Networks with multiple subnets
> > +b) Subnets with multiple IP pools
> > +c) NICs with multiple IPs from various subnets of a single network
> > +
> > +These changes bring up some design and implementation issues that we
> > +discuss in the following sections.
> > +
> > +Semantics
> > +++++++++++
> > +
> > +Quoting the initial network management design doc "an IP pool consists
> > +of two bitarrays. Specifically the ``reservations`` bitarray which holds
> > +all IP addresses reserved by Ganeti instances and the ``external
> > +reservations`` bitarray with all IPs that are excluded from the IP pool
> > +and cannot be assigned automatically by Ganeti to instances (via
> > +ip=pool)".
> > +
> > +Without violating those semantics, here, we clarify the following
> > +definitions.
> > +
> > +**network**: A cluster level taggable configuration object with a
> > +user-provider name, (e.g. network1, network2), UUID and MAC prefix.
> > +
> > +**L2**: The `mode` and `link` with which we connect a network to a
> > +nodegroup. A NIC attached to a network will inherit this info, just like
> > +connecting an Ethernet cable to a physical NIC. In this sense we only
> > +have one L2 info per NIC.
> > +
> > +**L3**: A CIDR and a gateway related to the network. Since a NIC can
> > +have multiple IPs on the same cable each network can have multiple L3
> > +info with the restriction that they do not overlap with each other.
> 
> Hi,
> 
> Great design document.  Great job!
> 

Thanks.

> I would like to ask a few things.
> 
> Is the gateway optional?
> 

Yes. The gateway will be optional just like it currently is. The use case is
private networks that do not have a default route.

> > +
> > +**subnet**: A subnet is the above L3 info plus some additional information
> > +(see below).
> > +
> > +**ip**: A valid IP should reside in a network's subnet, and should not
> > +be used by more than one instance. An IP can be either obtained dynamically
> > +from a pool or requested explicitly from a subnet (or a pool).
> > +
> > +**range**: Sequential IPs inside one subnet calculated either from the
> > +first IP and a size (e.g. start=192.0.2.10, size=10) or the first IP and
> > +the last IP (e.g. start=192.0.2.10, end=192.0.2.19). A single IP can
> > +also be thought of as an IP range with size=1 (see configuration
> > +changes).
> > +
> > +**reservations**: All IPs that are used by instances in the cluster at
> > +any time.
> > +
> > +**external reservations**: All IPs that are supposed to be reserved
> > +by the admin for either some external component or specific instances.
> > +If one instance requests an external IP explicitly (ip=192.0.2.100),
> > +Ganeti will allow the operation only if ``--force`` is given. Still, the
> > +admin can externally reserve an IP that is already in use by an
> > +instance, as happens now. This helps to reserve an IP for future use and
> > +at the same time prevent any possible race between the instance that
> > +releases this IP and another that tries to retrieve it.
> > +
> > +**pool**: A (range, reservations, name) tuple from which instances can
> > +dynamically obtain an IP. Reservations is a bitarray with
> > +length the size of the range, and is needed so that we know which IPs
> > +are available at any time without querying all instances. The use of
> > +name is explained below. A subnet can have multiple pools.
> > +
> > +
> > +Split L2 from L3
> > +++++++++++++++++
> > +
> > +Currently networks in Ganeti do not separate L2 from L3. This means
> > +that one cannot use L2 only networks. The reason is because the CIDR
> > +(passed currently with the ``--network`` option) and the derived IP pool
> > +are mandatory. This design makes L3 info optional. This way we can have
> > +an L2 only network just by connecting a Ganeti network to a nodegroup
> > +with the desired `mode` and `link`. Then one could add one or more subnets
> > +to the existing network.
> > +
> > +
> > +Multiple Subnets per Network
> > +++++++++++++++++++++++++++++
> > +
> > +Currently the IPv4 CIDR is mandatory for a network. Also a network can
> > +obtain at most one IPv4 CIDR and one IPv6 CIDR. These restrictions will
> > +be lifted.
> > +
> > +This design doc introduces support for multiple subnets per network. The
> > +L3 info will be moved inside the subnet. A subnet will have a `name` and
> > +a `uuid` just like NIC and Disk config objects. Additionally it will 
> > contain
> > +the `dhcp` flag which is explained below, and the `pools` and `external`
> > +fields which are mentioned in the next section. Only the `cidr` will be
> > +mandatory.
> > +
> > +Any subnet related actions will be done via the new ``--subnet`` option.
> > +Its syntax will be similar to ``--net``.
> > +
> > +The network's subnets must not overlap with each other. Logic will
> > +validate any operations related to reserving/releasing of IPs and check
> > +whether a requested IP is included inside one of the network's subnets.
> > +Just like currently, the L3 info will be exported to NIC configuration
> > +hooks and scripts as environment variables. The example below adds
> > +subnets to a network:
> > +
> > +::
> > +
> > +  gnt-network modify --subnet 
> > add:cidr=10.0.0.0/24,gateway=10.0.0.1,dhcp=true net1
> > +  gnt-network modify --subnet add:cidr=2001::/64,gateway=2001::1,dhcp=true 
> > net1
> > +
> > +To remove a subnet from a network one should use:
> > +
> > +::
> > +
> > +  gnt-network modify --subnet some-ident:remove network1
> > +
> > +where some-ident can be either a CIDR, a name or a UUID. Ganeti will
> > +allow this operation only if no instances use IPs from this subnet.
> > +
> > +Since DHCP is allowed only for a single CIDR on the same cable, the
> > +subnet must have a `dhcp` flag. Logic must not allow more that one
> > +subnets of the same version in the same network to have dhcp enabled. To
> > +modify a subnet's name or dhcp flag one could use:
> > +
> > +::
> > +
> > +  gnt-network modify --subnet some-ident:modify,dhcp=false,name=foo 
> > network1
> > +
> > +This would search for a registered subnet that matches the identifier,
> > +disable DHCP on it and change its name. If ``dhcp=true`` is passed,
> > +logic will first check if another subnet of the same version has dhcp
> > +enabled.
> 
> Could you please help me understand what 'subnet of the same version'
> means?  I am not familiar with this terminology.
> 

I mean v4 and v6. As far as I know, we cannot have dhcp enabled on
multiple subnets. So we should have max one v4 subnet and one v6 subnet
with dhcp enabled. Maybe I should rephrase that, to be more clear.

> Also, is it the case that the 'dhcp' parameter is meant only for
> validation purposes?  In other words, is Ganeti enabling DHCP here?
> If this parameter is only used for validation purposes, we have to be
> careful not to mislead people into thinking that Ganeti is actually
> starting a DHCP service.
> 

This parameter is to be exported to ifup scripts and hooks. Ganeti will
just check the aforementioned constraint. And yes we have to make clear
that Ganeti has nothing to do with a DHCP service. I will add a line
here noting it.

> > +
> > +Changing the CIDR or the gateway of a subnet should also be supported.
> > +
> > +::
> > +
> > +  gnt-network modify --subnet some-ident:modify,cidr=192.0.2.0/22 net1
> > +  gnt-network modify --subnet some-ident:modify,cidr=192.0.2.32/28 net1
> > +  gnt-network modify --subnet some-ident:modify,gateway=192.0.2.40 net1
> > +
> > +Before expanding a subnet logic should should check for overlapping
> > +subnets. Shrinking the subnet should be allowed only if the ranges
> > +that are about to be trimmed are not included either in pool
> > +reservations or external ranges.
> > +
> > +
> > +Multiple IP pools per Subnet
> > +++++++++++++++++++++++++++++
> > +
> > +Currently IP pools are automatically created during network creation and
> > +include the whole subnet. Some IPs can be excluded from the pool by
> > +passing them explicitly with ``--add-reserved-ips`` option.
> > +
> > +Still for IPv6 subnets or even big IPv4 ones this might be insufficient.
> > +It is impossible to have two bitarrays for a /64 prefix. Even for IPv4
> > +networks a /20 subnet currently requires 8K long bitarrays. And the
> > +second 4K is practically useless since the external reservations are way
> > +less than the actual reservations.
> > +
> > +This design extract IP pool management from the network logic, and pools
> > +will become optional. Currently the pool is created based on the
> > +network's CIDR. With multiple subnets per network, we should be able to
> > +create and add IP pools to a network (and eventually to the
> > +corresponding subnet). Each pool will have an optional user friendly
> > +`name` so that the end user can refer to it (see instance related
> > +operations).
> > +
> > +The user will be able to obtain dynamically an IP only if we have
> > +already defined a pool for a network's subnet. One would use ``ip=pool``
> > +for the first available IP of the first available pool, or
> > +``ip=some-pool-name`` for the first available IP of a specific pool.
> > +
> > +Any pool related actions will be done via the new ``--pool`` option.
> > +
> > +In order to add a pool a relevant subnet should pre-exist. Overlapping
> > +pools won't be allowed. For example:
> > +
> > +::
> > +
> > +  gnt-network modify --pool add:192.0.2.10-192.0.2.100,name=pool1 net1
> > +  gnt-network modify --pool add:10.0.0.7-10.0.0.20 net1
> > +  gnt-network modify --pool add:10.0.0.100 net1
> 
> This is very cool.  We can simplify things by making reserved IPs just
> another IP pool.  Some examples:
> 
>   gnt-network modify --pool 
> add:192.0.2.10-192.0.2.100,name=pool1,reserved=true net1
>   gnt-network modify --pool add:10.0.0.7-10.0.0.20,reserved=true net1
>   gnt-network modify --pool add:10.0.0.100,reserved=true net1
> 
> This way we don't have to have 2 places internally to keep track of
> internal and external reservations.  We can't just reuse the same
> concepts and the same code.  Naturally, for the case of reserved IP
> pools we would not construct the bitarray.
> 
> What do you think?  Do you see any problems with this?
> 


Well I am not very fond of the `reserved` attribute in the `--pool`
option. I would prefer to keep the old interface, i.e.
--add|remove-reserved-ips, and enhance it with IP range support.
In other words, the user interface would be:

gnt-network modify --pool add:192.0.2.10-192.0.2.15,name=pool1
                   --pool add:10.0.0.8/29,name=pool2
                   --pool add:10.0.0.40-10.0.0.45,name=pool3
                   --add-reserved-ips 192.0.2.15,10.0.0.8-10.0.0.15,10.2.4.5 
net1

This will create something like:

net1 {
  subnets [
    uuid1 {
        name: subnet1
        cidr: 192.0.2.0/24
        pools: [
          {range:Range(192.0.2.10, 192.0.2.15), reservations: 00000, name:pool1}
          ]
        reserved: [192.0.2.15]
        }
    uuid2  {
        name: subnet2
        cidr: 10.0.0.0/24
        pools: [
          {range:10.0.0.8/29, reservations: 00000000, name:pool3}
          {range:10.0.0.40-10.0.0.45, reservations: 000000, name:pool3}
          ]
        reserved: [Range(10.0.0.8, 10.0.0.15), 10.2.4.5]
        }
    ]
}

Range(start, end) will be some json representation of an IPRange()

This way I see the following advantages:

 1) Keep the existing semantics for pools and external reservations
 2) Each list has similar entries: one has pools the other has ranges.
    The pool must have a bitarray, and has an optional name.
    It is meaningless to add a name and a bitarray (as you said) to
    external ranges.
 3) Each list must not have overlapping ranges. Still external
    reservations can overlap with pools.
 4) The --pool option supports add|remove|modify command just like
    `--net` and `--disk` and operate on single entities (a restriction
    that is not needed for external reservations). Plus the modify
    command is meaningless with reserved=true.
 5) Another thing, and probably the most important, is that in order to
    get the first available IP, only the reserved list must be checked
    for conflicts. The ipaddr.summarize_address_range(first, last) could
    be very helpful.


If everything was under pools, like you say, for any operation (ip
reservation, pool creation, pool removal, etc) we would have to parse
the whole list and add logic (several if's) to separate between
actual pools or external ranges.

Moreover --add|remove-reserved-ips should still exist for backwards
compatibility.

As far as the code reuse you mention, I have in mind an IpRange class
that will implement basic validation and check methods, while the
Pool class will extend it with bitarray arithmetic. Helper methods
will act on lists of either Pools of IpRanges. Additionally we should
add logic to split ranges (and the corresponding pools).

So, what do you think?

> > +
> > +will first parse and find the ranges. Then for each range, Ganeti will
> > +try to find a matching subnet meaning that a pool must be a subrange of
> > +the subnet. If found, the range with empty reservations will be appended
> > +to the list of the subnet's pools. Moreover, logic must be added to
> > +reserve the IPs that are currently in use by instances of this network.
> > +
> > +During pool removal, logic should be added to split pools if ranges
> > +given overlap existing ones. For example:
> > +
> > +::
> > +
> > +  gnt-network modify --pool remove:192.0.2.20-192.0.2.50 net1
> > +
> > +will split the pool previously added (10-100) into two new ones;
> > +10-19 and 51-100. The corresponding bitarrays will be trimmed
> > +accordingly. The name will be preserved.
> > +
> > +The same things apply to external reservations. Just like now,
> > +modifications will take place via the ``--add|remove-reserved-ips``
> > +option. Logic must be added to support IP ranges.
> > +
> > +Based on the aforementioned we propose the following changes:
> > +
> > +1) Change the IP pool representation in config data.
> > +
> > +  Existing `reservations` and `external_reservations` bitarrays will be
> > +  removed. Instead, for each subnet we will have:
> > +
> > +  * `pools`: List of (IP range, reservations bitarray) tuples.
> > +  * `external`: List of IP ranges
> > +
> > +  For external ranges the reservations bitarray is not needed
> > +  since this will be all 1's.
> > +
> > +2) Change the network module logic.
> > +
> > +  The above changes should be done in the network module and be transparent
> > +  to the rest of the Ganeti code. If a random IP from the networks is
> > +  requested, Ganeti searches for an available IP from the first pool of
> > +  the first subnet. If it is full it gets to the next pool. Then to the
> > +  next subnet and so on. Of course the `external` IP ranges will be
> > +  excluded. If an IP is explicitly requested, Ganeti will try to find a
> > +  matching subnet. Its pools and external will be checked for
> > +  availability. All this logic will be extracted in a separate class
> > +  with helper methods for easier manipulation of IP ranges and
> > +  bitarrays.
> > +
> > +3) Changes in config module.
> > +
> > +  We should not have instances with the same IP inside the same network.
> > +  We introduce _AllIPs() helper config method that will hold all existing
> > +  (IP, network) tuples. Config logic will check this list as well
> > +  before passing it to TemporaryReservationManager.
> > +
> > +4) Change the query mechanism.
> > +
> > +  Since we have more that one subnets the new `subnets` field will
> > +  include a list of:
> > +
> > +  * cidr: IPv4 or IPv6 CIDR
> > +  * gateway: IPv4 or IPv6 address
> > +  * dhcp: True or False
> > +  * name: The user friendly name for the subnet
> > +
> > +  Since we want to support small pools inside big subnets, current query
> > +  results are not practical as far as the `map` field is concerned. It
> > +  should be replaced with the new `pools` field for each subnet, which will
> > +  contain:
> > +
> > +  * start: The first IP of the pool
> > +  * end: The last IP of the pool
> > +  * map: A string with 'X' for reserved IPs (either external or not) and
> > +    with '.' for all available ones inside the pool
> > +
> > +
> > +
> > +Multiple IPs per NIC
> > +++++++++++++++++++++
> > +
> > +Currently IP is a simple string inside the NIC object and there is a
> > +one-to-one mapping between the `ip` and the `network` slots. The whole
> > +logic behind this is that a NIC belongs to a network (cable) and
> > +inherits its mode and link. This rational will not change.
> > +
> > +Since this design adds support for multiple subnets per network, a NIC
> > +must be able to obtain multiple IPs from various subnets of the same
> > +network network. Thus we change the `ip` slot into a list.
> > +
> > +During instance related operations it should be used something like:
> > +
> > +::
> > +
> > +  gnt-instance add --net 
> > 0:ip=192.0.2.4,ip=pool,ip=some-pool-name,network=network1 inst1
> > +
> > +
> > +This will be parsed, converted to a proper list (e.g. ip = [192.0.2.4,
> > +"pool", "some-pool-name"]) and finally passed to the corresponding opcode.
> > +Based on the previous example, here the first IP will match subnet1, the
> > +second IP will be retrieved from the first available pool of the first
> > +available subnet, and the third from the pool with the some-pool name.
> > +
> > +During instance modification, the `ip` option will refer to the first IP
> > +of the NIC, whereas the `ipx` will refer to the X'th IP.
> > +
> > +
> > +Configuration changes
> > +---------------------
> > +
> > +IPRange config object:
> > +  Introduce new config object that will hold ranges needed by pools, and
> > +  reservations. It will be either a tuple of (start, size, end) or a
> > +  simple sting. The `end` is redundant and can derive from start and
> > +  size in runtime, but will appear in the representation for readability
> > +  and debug reasons.
> 
> This is good.  Internally we keep only (start, end) or (start, size) but
> when we print, we show the actual triple (start, size, end).  Cool!
> 

Good. So as soon as we agree on everything, I will send an interdiff (or maybe
the whole design doc) with all the changes discussed in the thread so that we
can merge the design doc into master, right?

Again sorry for the late response.

Cheers,
dimara

> Cheers,
> Jose
> 
> > +
> > +Pool config object:
> > +  Introduce new config object to represent a single subnet's pool. It
> > +  will have the `range`, `reservations`, `name` slots. The range slot
> > +  will be an IPRange config object, the reservations a bitarray and the
> > +  name a simple string.
> > +
> > +Subnet config object:
> > +  Introduce new config object with slots: `name`, `uuid`, `cidr`,
> > +  `gateway`, `dhcp`, `pools`, `external`. Pools is a list of Pool config
> > +  objects. External is a list of IPRange config objects. All ranges must
> > +  reside inside the subnet's CIDR. Only `cidr` will be mandatory. The
> > +  `dhcp` attribute will be False by default.
> > +
> > +Network config objects:
> > +  The L3 and the IP pool representation will change. Specifically all
> > +  slots besides `name`, `mac_prefix`, and `tag` will be removed. Instead
> > +  the slot `subnets` with a list of Subnet config objects will be added.
> > +
> > +NIC config objects:
> > +  NIC's network slot will be removed and the `ip` slot will be modified
> > +  to a list of strings.
> > +
> > +KVM runtime files:
> > +  Any change done in config data must be done also in KVM runtime files.
> > +  For this purpose the existing _UpgradeSerializedRuntime() can be used.
> > +
> > +
> > +Exported variables
> > +------------------
> > +
> > +The exported variables during instance related operations will be just
> > +like Linux uses aliases for interfaces. Specifically:
> > +
> > +``IP:i`` for the ith IP.
> > +
> > +``NETWORK_*:i`` for the ith subnet. * is SUBNET, GATEWAY, DHCP.
> > +
> > +In case of hooks those variables will be prefixed with ``INSTANCE_NICn``
> > +for the nth NIC.
> > +
> > +
> > +Backwards Compatibility
> > +-----------------------
> > +
> > +The existing networks representation will be internally modified.
> > +They will obtain one subnet, and one pool with range the whole subnet.
> > +
> > +During `gnt-network add` if the deprecated ``--network`` option is passed
> > +will still create a network with one subnet, and one IP pool with the
> > +size of the subnet. Otherwise ``--subnet`` and ``--pool`` options
> > +will be needed.
> > +
> > +The query mechanism will also include the deprecated `map` field. For the
> > +newly created network this will contain only the mapping of the first
> > +pool. The deprecated `network`, `gateway`, `network6`, `gateway6` fields
> > +will point to the first IPv4 and IPv6 subnet accordingly.
> > +
> > +During instance related operation the `ip` argument of the ``--net``
> > +option will refer to the first IP of the NIC.
> > +
> > +Hooks and scripts will still have the same environment exported in case
> > +of single IP per NIC.
> > +
> > +
> > +.. vim: set textwidth=72 :
> > +.. Local Variables:
> > +.. mode: rst
> > +.. fill-column: 72
> > +.. End:
> > -- 
> > 1.7.10.4
> 
> 
> 
> -- 
> Jose Antonio Lopes
> Ganeti Engineering
> Google Germany GmbH
> Dienerstr. 12, 80331, München
> 
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
> Geschäftsführer: Graham Law, Christine Elizabeth Flores
> Steuernummer: 48/725/00206
> Umsatzsteueridentifikationsnummer: DE813741370

Attachment: signature.asc
Description: Digital signature

Reply via email to