Re: [PATCH] [RFC] Linux network namespace support for haproxy [v2]

KOVACS Krisztian Tue, 18 Nov 2014 03:21:14 -0800

Dear Willy,

First of all, thanks for the review and your comments. Please see my reply
below, inlined with the text.


On Tue, Nov 18, 2014 at 11:37 AM, Willy Tarreau <[email protected]> wrote:

> Hi Krisztian,
>
> On Mon, Nov 17, 2014 at 06:08:22PM +0100, KOVACS Krisztian wrote:
> > Dear Willy & list,
> >
> > We've finally had some time to spend on the namespace support previously
> > discussed here on the mailing list.
>
> Great, thanks for this!
>
> > Please find attached the new version of the patch implementing namespace
> > support and hopefully fixing some of your concerns. Changes compared to
> the
> > previous version of the patch:
> >
> > * namespaces have to be explicitly listed on the configuration file, and
> > the file descriptors are opened when loading the configuration file and
> > cached
>
> This raises an interesting question for the list participants here : do
> we want to have an explicit list to limit access to the namespaces or do
> we want the namespaces list to be automatically set up based on use ?
>
> I'm seeing pros and cons for each solution :
>
>    - pros of explicit list : allows an admin to limit access to certain
>      namespaces only, which is convenient in contexts where multiple
>      configs are assembled together from various untrustedA users.
>
>    - cons of explicit list : requires manual addition to the list for
>      each new namespace, which can be really unconvenient when dealing
>      with APIs.
>
> I tend to prefer an automatic list (ie: namespaces are automatically
> registered as soon as they're referenced without having a separate
> section for this), at least because it matches better what we do for
> everything else (eg: we don't declare IP addresses nor ports in a
> separate section). And it would avoid a double declaration. But there
> could be very valid reasons for that list, I don't know.
>

Please see my reply at the end of the mail.


> > * the previous syntax for specifying that the namespace of the server
> > connection should be taken from the namespace of the client connection
> has
> > been changed from 'namespace fromproxy' to 'namespace *'
>
> I think it's a good idea.
>
> > * we've added some documentation describing the flow of network namespace
> > information
>
> That certainly helps :-)
>
> I'm having a few comments inlined about the patch below :
>
> diff --git a/include/common/namespace.h b/include/common/namespace.h
> new file mode 100644
> index 0000000..4457d6c
> --- /dev/null
> +++ b/include/common/namespace.h
> @@ -0,0 +1,24 @@
> +#ifndef NAMESPACE_H
> +
> +#include <stdlib.h>
> +#include <eb32tree.h>
> +
> +struct netns_entry;
> +int socketat(const struct netns_entry *ns, int domain, int type, int
> protocol);
> +
> +#ifdef CONFIG_HAP_NS
> +
> +struct netns_entry
> +{
> +       struct eb32_node node;
> +       const char *name;
> +       size_t name_len;
> +       int fd;
> +};
>
> This one above should use an ebpt_node instead which carries the pointer
> to the string, like this :
>
> struct netns_entry
> {
>         struct ebpt_node node;
>         int fd;
> };
>
> (more info on that below).
>

Thanks. I have to admit that we're not really familiar with the various
ebtree implementations in the code...


>
> diff --git a/src/connection.c b/src/connection.c
> index 3af6d9a..373fb2b 100644
> --- a/src/connection.c
> +++ b/src/connection.c
> @@ -14,6 +14,7 @@
>
>  #include <common/compat.h>
>  #include <common/config.h>
> +#include <common/namespace.h>
>
>  #include <proto/connection.h>
>  #include <proto/fd.h>
> @@ -217,6 +218,14 @@ void conn_update_sock_polling(struct connection *c)
>         c->flags = f;
>  }
>
> +/*
> + * Get data length from tlv
> + */
> +static size_t get_tlv_length(const struct tlv *src)
> +{
> +       return (src->length_hi << 8) | src->length_lo;
> +}
> +
>  /* This handshake handler waits a PROXY protocol header at the beginning
> of the
>   * raw data stream. The header looks like this :
>   *
> @@ -245,6 +254,7 @@ int conn_recv_proxy(struct connection *conn, int flag)
>         char *line, *end;
>         struct proxy_hdr_v2 *hdr_v2;
>         const char v2sig[] = PP2_SIGNATURE;
> +       size_t tlv_length = 0;
>
>         /* we might have been called just after an asynchronous shutr */
>         if (conn->flags & CO_FL_SOCK_RD_SH)
> @@ -431,6 +441,7 @@ int conn_recv_proxy(struct connection *conn, int flag)
>                         ((struct sockaddr_in 
> *)&conn->addr.to)->sin_addr.s_addr
> = hdr_v2->addr.ip4.dst_addr;
>                         ((struct sockaddr_in *)&conn->addr.to)->sin_port
> = hdr_v2->addr.ip4.dst_port;
>                         conn->flags |= CO_FL_ADDR_FROM_SET |
> CO_FL_ADDR_TO_SET;
> +                       tlv_length = ntohs(hdr_v2->len) -
> PP2_ADDR_LEN_INET;
>                         break;
>                 case 0x21:  /* TCPv6 */
>                         ((struct sockaddr_in6
> *)&conn->addr.from)->sin6_family = AF_INET6;
> @@ -440,8 +451,34 @@ int conn_recv_proxy(struct connection *conn, int flag)
>                         memcpy(&((struct sockaddr_in6 
> *)&conn->addr.to)->sin6_addr,
> hdr_v2->addr.ip6.dst_addr, 16);
>                         ((struct sockaddr_in6 *)&conn->addr.to)->sin6_port
> = hdr_v2->addr.ip6.dst_port;
>                         conn->flags |= CO_FL_ADDR_FROM_SET |
> CO_FL_ADDR_TO_SET;
> +                       tlv_length = ntohs(hdr_v2->len) -
> PP2_ADDR_LEN_INET6;
>                         break;
>                 }
> +
> +               /* TLV parsing */
> +               if (tlv_length) {
> +                       size_t tlv_offset = trash.len - tlv_length;
> +                       while (tlv_offset < trash.len) {
> +                               const struct tlv *tlv_packet = (struct tlv
> *) &trash.str[tlv_offset];
> +                               const size_t tlv_len =
> get_tlv_length(tlv_packet);
> +                               tlv_offset += tlv_len + TLV_HEADER_SIZE;
> +
> +                               switch (tlv_packet->type) {
> +#ifdef CONFIG_HAP_NS
> +                               case PP2_TYPE_NETNS: {
> +                                       const struct netns_entry *ns;
> +                                       ns =
> netns_store_lookup((char*)tlv_packet->value, tlv_len);
> +                                       if (ns)
> +                                               conn->proxy_netns = ns;
> +                                       break;
> +                               }
> +#endif
>
> I think some bounds checkings are missing above : nothing prevents tlv_len
> from being larger than trash.len - tlv_offset, which means that a large
> PPv2 header with an NS TLV entry close to the end could cause
> netns_store_lookup() to read beyond the trash buffer's end. At least that's
> how I'm reading it, I could be wrong.
>

Indeed, that seems to be the case.

However, there's one more thing I've found:
  * trash.len >= PP2_HEADER_LEN + hdr_v2->len, or else we'd jump to the
missing label
  * here we have an interesting thing: the code assumes that if this is a
proxy command, then hdr_v2->len >= PP2_ADDR_LEN_INET. I don't quite see
where this is checked, so I'd think we can read past the end of the buffer
while parsing the address.


> diff --git a/src/namespace.c b/src/namespace.c
> new file mode 100644
> index 0000000..5983aca
> --- /dev/null
> +++ b/src/namespace.c
> @@ -0,0 +1,109 @@
> +#define _GNU_SOURCE
> +
> +#include <common/namespace.h>
> +#include <common/compiler.h>
> +#include <common/hash.h>
> +
> +#include <sched.h>
> +#include <stdio.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +#include <sys/socket.h>
> +
> +#include <string.h>
> +#ifdef CONFIG_HAP_NS
> +
> +/* Opens the namespace <ns_name> and returns the FD or -1 in case of error
> + * (check errno).
> + */
> +static int open_named_namespace(const char *ns_name)
> +{
> +       char buf[512];
> +       int fd;
> +
> +       snprintf(buf, sizeof(buf), "/var/run/netns/%s", ns_name);
> +       fd = open(buf, O_RDONLY);
> +
> +       return fd;
> +}
>
> Here you need to test snprintf's return value otherwise we can open any
> crap
> that remains in the stack if ns_name is too large. So both < 0 and >buf
> need
> to return -1. BTW you don't even need to declare a local buffer for this,
> you
> can use the trash buffer which is made exactly for that purpose, and use
> chunk_printf() with it, it will synthesize the checks and return -1 in case
> of error. That would give something like this :
>
> static int open_named_namespace(const char *ns_name)
> {
>         if (chunk_printf(&trash, "/var/run/netns/%s", ns_name) < 0)
>                 return -1;
>         return open(trash.str, O_RDONLY);
> }
>
>
Agreed.


> +/* Opens the current namespace and returns the FD or -1 in case of error
> + * (check errno). Caches its output so that once called it always returns
> + * the same fd.
> + *
> + * This is supposed to be called before calling any setns() syscalls so
> + * that it can open the file descriptor for the default namespace.
> + */
> +static int get_default_namespace()
> +{
> +       static int fd = -1;
> +
> +       if (fd == -1) {
>
> Please use "unlikely()" here since it will happen only at open time. We
> could do even better, have this done in the constructor so that we never
> have to call a function of this and directly reference the file descriptor.
> That would also have the benefit of being thread-safe (thinking about the
> future :-)).
>
> +               char buf[512];
> +               snprintf(buf, sizeof(buf), "/proc/%d/ns/net", getpid());
> +               fd = open(buf, O_RDONLY);
> +       }
> +
> +       return fd;
> +}
>
> Same comment as above, though in this case, there's no risk of overflow.
>
> +static struct eb_root namespace_tree_root = EB_ROOT;
> +
> +struct netns_entry* netns_store_insert(const char *ns_name, size_t
> ns_name_len)
> +{
> +       struct netns_entry *entry = NULL;
> +       int fd = open_named_namespace(ns_name);
> +       if (fd == -1)
> +               goto out;
> +
> +       entry = (struct netns_entry *)calloc(1, sizeof(struct
> netns_entry));
>
> Maybe return NULL here if !entry ?
>
> +       entry->fd = fd;
> +       entry->name = strdup(ns_name);
> +       entry->name_len = ns_name_len;
> +       entry->node.key = hash_sdbm(entry->name, entry->name_len);
> +
> +       eb32_insert(&namespace_tree_root, &entry->node);
> +out:
> +       return entry;
> +}
>
> Please do not hash for inserting into the tree, it creates collisions and
> needlessly complexifies the operations, better use an indirect string tree
> which works with the ebpt_node instead, and get rid of the name_len which
> is not used anymore :
>
>         entry = (struct netns_entry *)calloc(1, sizeof(struct
> netns_entry));
>         if (!entry)
>                 return entry;
>         entry->fd = fd;
>         entry->node.key = strdup(ns_name);
>         ebis_insert(&namespace_tree_root, &entry->node);
>
> +const struct netns_entry* netns_store_lookup(const char *ns_name, size_t
> ns_name_len)
> +{
> +       unsigned int hash;
> +       struct eb32_node *node;
> +
> +       hash = hash_sdbm(ns_name, ns_name_len);
> +       node = eb32_lookup(&namespace_tree_root, hash);
> +       if (node)
> +               return eb32_entry(node, struct netns_entry, node);
> +       else
> +               return NULL;
> +}
> +#endif
>
> So here with the code simply becomes :
>
> const struct netns_entry* netns_store_lookup(const char *ns_name)
> {
>         struct ebpt_node *node;
>
>         node = ebis_lookup(&namespace_tree_root, ns_name);
>         if (node)
>                 return ebpt_entry(node, struct netns_entry, node);
>         else
>                 return NULL;
> }
>
>
Thanks, this indeed seems much better.


>
> +/* Opens a socket in the namespace described by <ns> with the parameters
> <domain>,
> + * <type> and <protocol> and returns the FD or -1 in case of error (check
> errno).
> + */
> +int socketat(const struct netns_entry *ns, int domain, int type, int
> protocol)
>
> Just thinking about something, in Linux, many syscalls already exist with
> the "at" suffix to indicate a variant working based on a file descriptor
> pointing to a directory. While I'm not seeing any risk that "socketat"
> would one day exist, I think we still have time to invent a less
> conflicting
> name, what do you think ? Maybe something like ns_socket() or any other
> idea ?
>
> +{
> +       int sock;
> +       int current_namespace = get_default_namespace();
> +
> +#ifdef CONFIG_HAP_NS
> +       if (ns && setns(ns->fd, CLONE_NEWNET) == -1)
> +               return -1;
> +#endif
> +       sock = socket(domain, type, protocol);
> +
> +#ifdef CONFIG_HAP_NS
> +       if (ns && setns(current_namespace, CLONE_NEWNET) == -1) {
> +               close(sock);
> +               return -1;
> +       }
> +#endif
>
> Above I suspect that if you build without CONFIG_HAP_NS you'll get
> a warning saying that current_namespace is not used. If you change
> get_default_namespace() to just "static int default_namespace" which
> is initialized by a constructor, this will simplify this part as well.
>

Indeed. We've tried to avoid these warnings elsewhere, so there's certainly
point in keeping it clean here as well.


>
> diff --git a/src/proto_tcp.c b/src/proto_tcp.c
> index cfa62f7..128b5f7 100644
> --- a/src/proto_tcp.c
> +++ b/src/proto_tcp.c
> @@ -33,6 +33,7 @@
>  #include <common/errors.h>
>  #include <common/mini-clist.h>
>  #include <common/standard.h>
> +#include <common/namespace.h>
>
>  #include <types/global.h>
>  #include <types/capture.h>
> @@ -247,6 +248,15 @@ int tcp_bind_socket(int fd, int flags, struct
> sockaddr_storage *local, struct so
>         return 0;
>  }
>
> +static int create_server_socket(struct connection *conn)
>
> This name should probably mention that it's a TCP socket ?
>
> +{
> +       const struct netns_entry *ns = objt_server(conn->target)->netns;
> +
> +       if (objt_server(conn->target)->flags & SRV_F_USE_NS_FROM_PP)
> +               ns = conn->proxy_netns;
> +
> +       return socketat(ns, conn->addr.to.ss_family, SOCK_STREAM,
> IPPROTO_TCP);
> +}
>
>  /*
>   * This function initiates a TCP connection establishment to the target
> assigned
> @@ -2007,6 +2023,30 @@ static int bind_parse_interface(char **args, int
> cur_arg, struct proxy *px, stru
>  }
>  #endif
>
>
> OK overall that's really very very good. After reading all the code, I'm
> still thinking that we should have the namespaces automatically declared.
> It would basically work like this : "bind" and "server" lines referencing
> a namespace would call netns_find_or_init(nsname). This function would
> lookup the requested name, and if not found, try to initialize it exactly
> like it's currently done in the namespace section. Upon error, the error
> would simply be returned in the context of the bind and server lines which
> reference the namespace.
>
> I'm just seeing one point, maybe you created them for the proxy protocol
> only ? Indeed, if you receive a connection mentionning a namespace that
> was not initialized and you want to connect to the server using that
> namespace, it would not exist. Thus maybe we could make the namespaces
> section optional to serve only that purpose. That means that most users
> who use explicit namespaces everywhere will never need it, and those who
> want to support extra namespaces will need to declare them in this section.
>
> What's your opinion ?
>

Indeed, the point of the explicit namespace list was that it seemed the
only way how you can control which namespaces can be used via the proxy
protocol. Your suggestion above makes sense and still makes it possible to
use only pre-declared namespaces with the proxy protocol while keeping the
configuration much simpler for other use cases.


>
> Are you interested that I give you some help on this last-mile polishing
> regarding the review above so that you don't waste your time looking for
> some specific points ?


Sure, and thanks for your help in advance.

-- 
KOVACS Krisztian

Re: [PATCH] [RFC] Linux network namespace support for haproxy [v2]

Reply via email to