Re: [ovs-dev] [PATCH v7 3/7] ovn: Introduce "chassisredirect" port binding

2017-01-08 Thread Mickey Spiegel
On Fri, Jan 6, 2017 at 8:31 PM, Mickey Spiegel 
wrote:

>
> On Fri, Jan 6, 2017 at 4:21 PM, Mickey Spiegel 
> wrote:
>
>>
>> On Fri, Jan 6, 2017 at 4:11 PM, Ben Pfaff  wrote:
>>
>>> On Fri, Jan 06, 2017 at 03:47:03PM -0800, Mickey Spiegel wrote:
>>> > On Fri, Jan 6, 2017 at 3:20 PM, Ben Pfaff  wrote:
>>> >
>>> > > On Fri, Jan 06, 2017 at 12:00:30PM -0800, Mickey Spiegel wrote:
>>> > > > Currently OVN handles all logical router ports in a distributed
>>> manner,
>>> > > > creating instances on each chassis.  The logical router ingress and
>>> > > > egress pipelines are traversed locally on the source chassis.
>>> > > >
>>> > > > In order to support advanced features such as one-to-many NAT (aka
>>> IP
>>> > > > masquerading), where multiple private IP addresses spread across
>>> > > > multiple chassis are mapped to one public IP address, it will be
>>> > > > necessary to handle some of the logical router processing on a
>>> specific
>>> > > > chassis in a centralized manner.
>>> > > >
>>> > > > The goal of this patch is to develop abstractions that allow for a
>>> > > > subset of router gateway traffic to be handled in a centralized
>>> manner
>>> > > > (e.g. one-to-many NAT traffic), while allowing for other subsets of
>>> > > > router gateway traffic to be handled in a distributed manner (e.g.
>>> > > > floating IP traffic).
>>> > > >
>>> > > > This patch introduces a new type of SB port_binding called
>>> > > > "chassisredirect".  A "chassisredirect" port represents a
>>> particular
>>> > > > instance, bound to a specific chassis, of an otherwise distributed
>>> > > > port.  The ovn-controller on that chassis populates the "chassis"
>>> > > > column for this record as an indication for other ovn-controllers
>>> of
>>> > > > its physical location.  Other ovn-controllers do not treat this
>>> port
>>> > > > as a local port.
>>> > > >
>>> > > > A "chassisredirect" port should never be used as an "inport".
>>> When an
>>> > > > ingress pipeline sets the "outport", it may set the value to a
>>> logical
>>> > > > port of type "chassisredirect".  This will cause the packet to be
>>> > > > directed to a specific chassis to carry out the egress logical
>>> router
>>> > > > pipeline, in the same way that a logical switch forwards egress
>>> traffic
>>> > > > to a VIF port residing on a specific chassis.  At the beginning of
>>> the
>>> > > > egress pipeline, the "outport" will be reset to the value of the
>>> > > > distributed port.
>>> > > >
>>> > > > For outbound traffic to be handled in a centralized manner, the
>>> > > > "outport" should be set to the "chassisredirect" port representing
>>> > > > centralized gateway functionality in the otherwise distributed
>>> router.
>>> > > > For outbound traffic to be handled in a distributed manner,
>>> locally on
>>> > > > the source chassis, the "outport" should be set to the existing
>>> "patch"
>>> > > > port representing distributed gateway functionality.
>>> > > >
>>> > > > Inbound traffic will be directed to the appropriate chassis by
>>> > > > restricting source MAC address usage and ARP responses to that
>>> chassis,
>>> > > > or by running dynamic routing protocols.
>>> > > >
>>> > > > Note that "chassisredirect" ports have no associated IP or MAC
>>> addresses.
>>> > > > Any pipeline stages that depend on port specific IP or MAC
>>> addresses
>>> > > > should be carried out in the context of the distributed port.
>>> > > >
>>> > > > Although the abstraction represented by the "chassisredirect" port
>>> > > > binding is generalized, in this patch the "chassisredirect" port
>>> binding
>>> > > > is only created for NB logical router ports that specify the new
>>> > > > "redirect-chassis" option.  There is no explicit notion of a
>>> > > > "chassisredirect" port in the NB database.  The expectation is when
>>> > > > capabilities are implemented that take advantage of
>>> "chassisredirect"
>>> > > > ports (e.g. NAT), the addition of flows specifying a
>>> "chassisredirect"
>>> > > > port as the outport will also be triggered by the presence of the
>>> > > > "redirect-chassis" option.  Such flows are added for NB logical
>>> router
>>> > > > ports that specify the "redirect-chassis" option.
>>> > > >
>>> > > > Signed-off-by: Mickey Spiegel 
>>> > >
>>> > > chassisredirect ports seem incredibly similar to vif ports.  Is the
>>> only
>>> > > difference that the output port is changed at the beginning of the
>>> > > egress pipeline?  That's something that could be implemented in the
>>> > > logical egress pipeline with 'outport = "...";'.  We do say that the
>>> > > outport isn't supposed to be modified in an egress pipeline, but
>>> nothing
>>> > > enforces that and if it's actually useful then we could just change
>>> the
>>> > > documentation.
>>> > >
>>> >
>>> > I don't get the similarity to vif ports.
>>> >
>>> > I need to create two different ports for each logical 

Re: [ovs-dev] [PATCH 2/4] datapath: Limits the number of tx/rx queues for netdev-dummy.

2017-01-08 Thread nickcooper-zhangtonghao
Thanks Daniele,
Yes, it’s a small improvement. but it is necessary for us. I will check it in 
set_config(). One question to ask: should we check the tx/rx queue for 
netdev-dpdk in set_config()?

Now we check it in dpdk_eth_dev_init().

Thanks.



> On Jan 9, 2017, at 11:22 AM, Daniele Di Proietto  wrote:
> 
> 2017-01-08 17:30 GMT-08:00 nickcooper-zhangtonghao  >:
>> This patch avoids the ovs_rcu to report WARN, caused by blocked
>> for a long time, when ovs-vswitchd processes a port with many
>> rx/tx queues. The number of tx/rx queues per port may be appropriate,
>> because the dpdk uses it as an default max value.
>> 
>> Signed-off-by: nickcooper-zhangtonghao > >
> 
> I don't think this is a big deal, since netdev-dummy is only used for
> testing, but don't you think it's better to check it in set_config()
> and return an error?
> 
> Also, could you use the prefix netdev-dummy, instead of datapath?
> 
> Thanks,
> 
> Daniele

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 4/4] datapath: Uses the OVS_CORE_UNSPEC instead of magic numbers.

2017-01-08 Thread Daniele Di Proietto
2017-01-08 17:30 GMT-08:00 nickcooper-zhangtonghao :
> This patch uses OVS_CORE_UNSPEC for the queue unpinned instead
> of "-1". More important, the "-1" casted to unsigned int is
> equal to NON_PMD_CORE_ID. We make the distinction between them.
>
> Signed-off-by: nickcooper-zhangtonghao 

Thanks, this bothered me as well.  In fact I sent a patch for it in
the past as part of a series:

https://mail.openvswitch.org/pipermail/ovs-dev/2016-December/325692.html.

This shouldn't fix any problems, because I think we only compared
core_id with pmd threads (not why the non-pmd), but I agree that using
-1 for an unsigned is not pretty.

I fixed the title and applied this to master, thanks

> ---
>  lib/dpif-netdev.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 0b73056..99e4d35 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -1293,7 +1293,7 @@ port_create(const char *devname, const char *type,
>   devname, ovs_strerror(errno));
>  goto out_rxq_close;
>  }
> -port->rxqs[i].core_id = -1;
> +port->rxqs[i].core_id = OVS_CORE_UNSPEC;
>  n_open_rxqs++;
>  }
>
> @@ -1517,7 +1517,7 @@ has_pmd_rxq_for_numa(struct dp_netdev *dp, int numa_id)
>  for (i = 0; i < port->n_rxq; i++) {
>  unsigned core_id = port->rxqs[i].core_id;
>
> -if (core_id != -1
> +if (core_id != OVS_CORE_UNSPEC
>  && ovs_numa_get_numa_id(core_id) == numa_id) {
>  return true;
>  }
> @@ -2704,7 +2704,7 @@ parse_affinity_list(const char *affinity_list, unsigned 
> *core_ids, int n_rxq)
>  int error = 0;
>
>  for (i = 0; i < n_rxq; i++) {
> -core_ids[i] = -1;
> +core_ids[i] = OVS_CORE_UNSPEC;
>  }
>
>  if (!affinity_list) {
> @@ -3617,7 +3617,7 @@ dp_netdev_add_port_rx_to_pmds(struct dp_netdev *dp,
>
>  for (i = 0; i < port->n_rxq; i++) {
>  if (pinned) {
> -if (port->rxqs[i].core_id == -1) {
> +if (port->rxqs[i].core_id == OVS_CORE_UNSPEC) {
>  continue;
>  }
>  pmd = dp_netdev_get_pmd(dp, port->rxqs[i].core_id);
> @@ -3631,7 +3631,7 @@ dp_netdev_add_port_rx_to_pmds(struct dp_netdev *dp,
>  pmd->isolated = true;
>  dp_netdev_pmd_unref(pmd);
>  } else {
> -if (port->rxqs[i].core_id != -1) {
> +if (port->rxqs[i].core_id != OVS_CORE_UNSPEC) {
>  continue;
>  }
>  pmd = dp_netdev_less_loaded_pmd_on_numa(dp, numa_id);
> @@ -3760,7 +3760,7 @@ dp_netdev_reset_pmd_threads(struct dp_netdev *dp)
>  for (i = 0; i < port->n_rxq; i++) {
>  unsigned core_id = port->rxqs[i].core_id;
>
> -if (core_id != -1) {
> +if (core_id != OVS_CORE_UNSPEC) {
>  numa_id = ovs_numa_get_numa_id(core_id);
>  hmapx_add(, (void *) numa_id);
>  }
> --
> 1.8.3.1
>
>
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 3/4] datapath: Uses the NR_QUEUE instead of magic numbers.

2017-01-08 Thread Daniele Di Proietto
2017-01-08 17:30 GMT-08:00 nickcooper-zhangtonghao :
> The NR_QUEUE is defined in "lib/dpif-netdev.h", netdev-dpdk
> uses it instead of magic number. netdev-dummy should be
> in the same case.
>
> Signed-off-by: nickcooper-zhangtonghao 

Thanks, I changed the prefix of the commit message and applied to master

> ---
>  lib/netdev-dummy.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c
> index d75e597..8d9c805 100644
> --- a/lib/netdev-dummy.c
> +++ b/lib/netdev-dummy.c
> @@ -868,8 +868,8 @@ netdev_dummy_set_config(struct netdev *netdev_, const 
> struct smap *args)
>  goto exit;
>  }
>
> -new_n_rxq = MAX(smap_get_int(args, "n_rxq", 1), 1);
> -new_n_txq = MAX(smap_get_int(args, "n_txq", 1), 1);
> +new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> +new_n_txq = MAX(smap_get_int(args, "n_txq", NR_QUEUE), 1);
>  new_numa_id = smap_get_int(args, "numa_id", 0);
>  if (new_n_rxq != netdev->requested_n_rxq
>  || new_n_txq != netdev->requested_n_txq
> --
> 1.8.3.1
>
>
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 2/4] datapath: Limits the number of tx/rx queues for netdev-dummy.

2017-01-08 Thread Daniele Di Proietto
2017-01-08 17:30 GMT-08:00 nickcooper-zhangtonghao :
> This patch avoids the ovs_rcu to report WARN, caused by blocked
> for a long time, when ovs-vswitchd processes a port with many
> rx/tx queues. The number of tx/rx queues per port may be appropriate,
> because the dpdk uses it as an default max value.
>
> Signed-off-by: nickcooper-zhangtonghao 

I don't think this is a big deal, since netdev-dummy is only used for
testing, but don't you think it's better to check it in set_config()
and return an error?

Also, could you use the prefix netdev-dummy, instead of datapath?

Thanks,

Daniele

> ---
>  lib/netdev-dummy.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c
> index d406cbc..d75e597 100644
> --- a/lib/netdev-dummy.c
> +++ b/lib/netdev-dummy.c
> @@ -897,6 +897,9 @@ netdev_dummy_get_numa_id(const struct netdev *netdev_)
>  return numa_id;
>  }
>
> +
> +#define DUMMY_MAX_QUEUES_PER_PORT 1024
> +
>  /* Sets the number of tx queues and rx queues for the dummy PMD interface. */
>  static int
>  netdev_dummy_reconfigure(struct netdev *netdev_)
> @@ -905,8 +908,8 @@ netdev_dummy_reconfigure(struct netdev *netdev_)
>
>  ovs_mutex_lock(>mutex);
>
> -netdev_->n_txq = netdev->requested_n_txq;
> -netdev_->n_rxq = netdev->requested_n_rxq;
> +netdev_->n_txq = MIN(DUMMY_MAX_QUEUES_PER_PORT, netdev->requested_n_txq);
> +netdev_->n_rxq = MIN(DUMMY_MAX_QUEUES_PER_PORT, netdev->requested_n_rxq);
>  netdev->numa_id = netdev->requested_numa_id;
>
>  ovs_mutex_unlock(>mutex);
> --
> 1.8.3.1
>
>
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] dpdk: Late initialization.

2017-01-08 Thread Daniele Di Proietto
With this commit, we allow the user to set other_config:dpdk-init=true
after the process is started.  This makes it easier to start Open
vSwitch with DPDK using standard init scripts without restarting the
service.

This is still far from ideal, because initializing DPDK might still
abort the process (e.g. if there not enough memory), so the user must
check the status of the process after setting dpdk-init to true.

Nonetheless, I think this is an improvement, because it doesn't require
restarting the whole unit.

CC: Aaron Conole 
Signed-off-by: Daniele Di Proietto 
---
v1->v2: No change, first non-RFC post.
---
 lib/dpdk-stub.c |  8 
 lib/dpdk.c  | 30 --
 tests/ofproto-macros.at |  2 +-
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
index bd981bb90..daef7291f 100644
--- a/lib/dpdk-stub.c
+++ b/lib/dpdk-stub.c
@@ -27,13 +27,13 @@ VLOG_DEFINE_THIS_MODULE(dpdk);
 void
 dpdk_init(const struct smap *ovs_other_config)
 {
-static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+if (smap_get_bool(ovs_other_config, "dpdk-init", false)) {
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
 
-if (ovsthread_once_start()) {
-if (smap_get_bool(ovs_other_config, "dpdk-init", false)) {
+if (ovsthread_once_start()) {
 VLOG_ERR("DPDK not supported in this copy of Open vSwitch.");
+ovsthread_once_done();
 }
-ovsthread_once_done();
 }
 }
 
diff --git a/lib/dpdk.c b/lib/dpdk.c
index ee4360b22..008c6c06d 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -273,12 +273,6 @@ dpdk_init__(const struct smap *ovs_other_config)
 cpu_set_t cpuset;
 char *sock_dir_subcomponent;
 
-if (!smap_get_bool(ovs_other_config, "dpdk-init", false)) {
-VLOG_INFO("DPDK Disabled - to change this requires a restart.\n");
-return;
-}
-
-VLOG_INFO("DPDK Enabled, initializing");
 if (process_vhost_flags("vhost-sock-dir", xstrdup(ovs_rundir()),
 NAME_MAX, ovs_other_config,
 _dir_subcomponent)) {
@@ -413,11 +407,27 @@ dpdk_init__(const struct smap *ovs_other_config)
 void
 dpdk_init(const struct smap *ovs_other_config)
 {
-static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+static bool enabled = false;
+
+if (enabled || !ovs_other_config) {
+return;
+}
+
+if (smap_get_bool(ovs_other_config, "dpdk-init", false)) {
+static struct ovsthread_once once_enable = OVSTHREAD_ONCE_INITIALIZER;
 
-if (ovs_other_config && ovsthread_once_start()) {
-dpdk_init__(ovs_other_config);
-ovsthread_once_done();
+if (ovsthread_once_start(_enable)) {
+VLOG_INFO("DPDK Enabled - initializing...");
+dpdk_init__(ovs_other_config);
+VLOG_INFO("DPDK Enabled - initialized");
+ovsthread_once_done(_enable);
+}
+} else {
+static struct ovsthread_once once_disable = OVSTHREAD_ONCE_INITIALIZER;
+if (ovsthread_once_start(_disable)) {
+VLOG_INFO("DPDK Disabled - Use other_config:dpdk-init to enable");
+ovsthread_once_done(_disable);
+}
 }
 }
 
diff --git a/tests/ofproto-macros.at b/tests/ofproto-macros.at
index 547b8..faff5b0a8 100644
--- a/tests/ofproto-macros.at
+++ b/tests/ofproto-macros.at
@@ -331,7 +331,7 @@ m4_define([_OVS_VSWITCHD_START],
 /ofproto|INFO|using datapath ID/d
 /netdev_linux|INFO|.*device has unknown hardware address family/d
 /ofproto|INFO|datapath ID changed to fedcba9876543210/d
-/dpdk|INFO|DPDK Disabled - to change this requires a restart./d']])
+/dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable/d']])
 ])
 
 # OVS_VSWITCHD_START([vsctl-args], [vsctl-output], [=override],
-- 
2.11.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/4] datapath: Fix formatting typo.

2017-01-08 Thread Daniele Di Proietto
2017-01-08 17:30 GMT-08:00 nickcooper-zhangtonghao :
> Signed-off-by: nickcooper-zhangtonghao 

Thanks, I changed the prefix to netdev-dpdk (instead of datapath) and
pushed this to master

> ---
>  lib/netdev-dpdk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 625f425..376aa4d 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -738,7 +738,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>
>  memset(_addr, 0x0, sizeof(eth_addr));
>  rte_eth_macaddr_get(dev->port_id, _addr);
> -VLOG_INFO_RL(, "Port %d: "ETH_ADDR_FMT"",
> +VLOG_INFO_RL(, "Port %d: "ETH_ADDR_FMT,
>  dev->port_id, ETH_ADDR_BYTES_ARGS(eth_addr.addr_bytes));
>
>  memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
> --
> 1.8.3.1
>
>
>
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 16/18] dpif-netdev: Use hmap for poll_list in pmd threads.

2017-01-08 Thread Daniele Di Proietto
A future commit will use this to determine if a queue is already
contained in a pmd thread.

To keep the behavior unaltered we now have to sort queues before
printing them in pmd_info_show_rxq().

Also this commit introduces 'struct polled_queue' that will be used
exclusively in the fast path, uses 'struct dp_netdev_rxq' from 'struct
rxq_poll' and uses 'rx' for 'netdev_rxq' and 'rxq' for 'dp_netdev_rxq'.

Signed-off-by: Daniele Di Proietto 
---
 lib/dpif-netdev.c | 168 --
 1 file changed, 112 insertions(+), 56 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index f170f5c96..d996e3c9a 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -280,8 +280,12 @@ enum pmd_cycles_counter_type {
 
 /* Contained by struct dp_netdev_port's 'rxqs' member.  */
 struct dp_netdev_rxq {
-struct netdev_rxq *rxq;
-unsigned core_id;   /* Сore to which this queue is pinned. */
+struct dp_netdev_port *port;
+struct netdev_rxq *rx;
+unsigned core_id;  /* Core to which this queue should be
+  pinned. OVS_CORE_UNSPEC if the
+  queue doesn't need to be pinned to a
+  particular core. */
 };
 
 /* A port in a netdev-based datapath. */
@@ -415,11 +419,15 @@ struct dp_netdev_pmd_cycles {
 atomic_ullong n[PMD_N_CYCLES];
 };
 
+struct polled_queue {
+struct netdev_rxq *rx;
+odp_port_t port_no;
+};
+
 /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
 struct rxq_poll {
-struct dp_netdev_port *port;
-struct netdev_rxq *rx;
-struct ovs_list node;
+struct dp_netdev_rxq *rxq;
+struct hmap_node node;
 };
 
 /* Contained by struct dp_netdev_pmd_thread's 'send_port_cache',
@@ -500,9 +508,7 @@ struct dp_netdev_pmd_thread {
 
 struct ovs_mutex port_mutex;/* Mutex for 'poll_list' and 'tx_ports'. */
 /* List of rx queues to poll. */
-struct ovs_list poll_list OVS_GUARDED;
-/* Number of elements in 'poll_list' */
-int poll_cnt;
+struct hmap poll_list OVS_GUARDED;
 /* Map of 'tx_port's used for transmission.  Written by the main thread,
  * read by the pmd thread. */
 struct hmap tx_ports OVS_GUARDED;
@@ -586,8 +592,8 @@ static void dp_netdev_add_port_to_pmds(struct dp_netdev *dp,
 static void dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
  struct dp_netdev_port *port);
 static void dp_netdev_add_rxq_to_pmd(struct dp_netdev_pmd_thread *pmd,
- struct dp_netdev_port *port,
- struct netdev_rxq *rx);
+ struct dp_netdev_rxq *rxq)
+OVS_REQUIRES(pmd->port_mutex);
 static struct dp_netdev_pmd_thread *
 dp_netdev_less_loaded_pmd_on_numa(struct dp_netdev *dp, int numa_id);
 static void dp_netdev_reset_pmd_threads(struct dp_netdev *dp)
@@ -783,12 +789,56 @@ pmd_info_clear_stats(struct ds *reply OVS_UNUSED,
 }
 }
 
+static int
+compare_poll_list(const void *a_, const void *b_)
+{
+const struct rxq_poll *a = a_;
+const struct rxq_poll *b = b_;
+
+const char *namea = netdev_rxq_get_name(a->rxq->rx);
+const char *nameb = netdev_rxq_get_name(b->rxq->rx);
+
+int cmp = strcmp(namea, nameb);
+if (!cmp) {
+return netdev_rxq_get_queue_id(a->rxq->rx)
+   - netdev_rxq_get_queue_id(b->rxq->rx);
+} else {
+return cmp;
+}
+}
+
+static void
+sorted_poll_list(struct dp_netdev_pmd_thread *pmd, struct rxq_poll **list,
+ size_t *n)
+{
+struct rxq_poll *ret, *poll;
+size_t i;
+
+*n = hmap_count(>poll_list);
+if (!*n) {
+ret = NULL;
+} else {
+ret = xcalloc(*n, sizeof *ret);
+i = 0;
+HMAP_FOR_EACH (poll, node, >poll_list) {
+ret[i] = *poll;
+i++;
+}
+ovs_assert(i == *n);
+}
+
+qsort(ret, *n, sizeof *ret, compare_poll_list);
+
+*list = ret;
+}
+
 static void
 pmd_info_show_rxq(struct ds *reply, struct dp_netdev_pmd_thread *pmd)
 {
 if (pmd->core_id != NON_PMD_CORE_ID) {
-struct rxq_poll *poll;
 const char *prev_name = NULL;
+struct rxq_poll *list;
+size_t i, n;
 
 ds_put_format(reply,
   "pmd thread numa_id %d core_id %u:\n\tisolated : %s\n",
@@ -796,21 +846,23 @@ pmd_info_show_rxq(struct ds *reply, struct 
dp_netdev_pmd_thread *pmd)
   ? "true" : "false");
 
 ovs_mutex_lock(>port_mutex);
-LIST_FOR_EACH (poll, node, >poll_list) {
-const char *name = netdev_get_name(poll->port->netdev);
+sorted_poll_list(pmd, , );
+for (i = 0; i < n; i++) {
+const char *name = netdev_rxq_get_name(list[i].rxq->rx);
 
 if (!prev_name 

[ovs-dev] [PATCH v3 18/18] ovs-numa: Remove unused functions.

2017-01-08 Thread Daniele Di Proietto
ovs-numa doesn't need to keep the state of the pmd threads, it is an
implementation detail of dpif-netdev.

Signed-off-by: Daniele Di Proietto 
---
 lib/ovs-numa.c | 175 -
 lib/ovs-numa.h |   7 ---
 2 files changed, 182 deletions(-)

diff --git a/lib/ovs-numa.c b/lib/ovs-numa.c
index 04225a958..2e038b745 100644
--- a/lib/ovs-numa.c
+++ b/lib/ovs-numa.c
@@ -70,8 +70,6 @@ struct cpu_core {
 struct ovs_list list_node; /* In 'numa_node->cores' list. */
 struct numa_node *numa;/* numa node containing the core. */
 unsigned core_id;  /* Core id. */
-bool available;/* If the core can be pinned. */
-bool pinned;   /* If a thread has been pinned to the core. */
 };
 
 /* Contains all 'struct numa_node's. */
@@ -119,7 +117,6 @@ insert_new_cpu_core(struct numa_node *n, unsigned core_id)
 ovs_list_insert(>cores, >list_node);
 c->core_id = core_id;
 c->numa = n;
-c->available = true;
 
 return c;
 }
@@ -342,18 +339,6 @@ ovs_numa_core_id_is_valid(unsigned core_id)
 return found_numa_and_core && core_id < ovs_numa_get_n_cores();
 }
 
-bool
-ovs_numa_core_is_pinned(unsigned core_id)
-{
-struct cpu_core *core = get_core_by_core_id(core_id);
-
-if (core) {
-return core->pinned;
-}
-
-return false;
-}
-
 /* Returns the number of numa nodes. */
 int
 ovs_numa_get_n_numas(void)
@@ -398,97 +383,6 @@ ovs_numa_get_n_cores_on_numa(int numa_id)
 return OVS_CORE_UNSPEC;
 }
 
-/* Returns the number of cpu cores that are available and unpinned
- * on numa node.  Returns OVS_CORE_UNSPEC if 'numa_id' is invalid. */
-int
-ovs_numa_get_n_unpinned_cores_on_numa(int numa_id)
-{
-struct numa_node *numa = get_numa_by_numa_id(numa_id);
-
-if (numa) {
-struct cpu_core *core;
-int count = 0;
-
-LIST_FOR_EACH(core, list_node, >cores) {
-if (core->available && !core->pinned) {
-count++;
-}
-}
-return count;
-}
-
-return OVS_CORE_UNSPEC;
-}
-
-/* Given 'core_id', tries to pin that core.  Returns true, if succeeds.
- * False, if the core has already been pinned, or if it is invalid or
- * not available. */
-bool
-ovs_numa_try_pin_core_specific(unsigned core_id)
-{
-struct cpu_core *core = get_core_by_core_id(core_id);
-
-if (core) {
-if (core->available && !core->pinned) {
-core->pinned = true;
-return true;
-}
-}
-
-return false;
-}
-
-/* Searches through all cores for an unpinned and available core.  Returns
- * the 'core_id' if found and sets the 'core->pinned' to true.  Otherwise,
- * returns OVS_CORE_UNSPEC. */
-unsigned
-ovs_numa_get_unpinned_core_any(void)
-{
-struct cpu_core *core;
-
-HMAP_FOR_EACH(core, hmap_node, _cpu_cores) {
-if (core->available && !core->pinned) {
-core->pinned = true;
-return core->core_id;
-}
-}
-
-return OVS_CORE_UNSPEC;
-}
-
-/* Searches through all cores on numa node with 'numa_id' for an
- * unpinned and available core.  Returns the core_id if found and
- * sets the 'core->pinned' to true.  Otherwise, returns OVS_CORE_UNSPEC. */
-unsigned
-ovs_numa_get_unpinned_core_on_numa(int numa_id)
-{
-struct numa_node *numa = get_numa_by_numa_id(numa_id);
-
-if (numa) {
-struct cpu_core *core;
-
-LIST_FOR_EACH(core, list_node, >cores) {
-if (core->available && !core->pinned) {
-core->pinned = true;
-return core->core_id;
-}
-}
-}
-
-return OVS_CORE_UNSPEC;
-}
-
-/* Unpins the core with 'core_id'. */
-void
-ovs_numa_unpin_core(unsigned core_id)
-{
-struct cpu_core *core = get_core_by_core_id(core_id);
-
-if (core) {
-core->pinned = false;
-}
-}
-
 static struct ovs_numa_dump *
 ovs_numa_dump_create(void)
 {
@@ -654,75 +548,6 @@ ovs_numa_dump_destroy(struct ovs_numa_dump *dump)
 free(dump);
 }
 
-/* Reads the cpu mask configuration from 'cmask' and sets the
- * 'available' of corresponding cores.  For unspecified cores,
- * sets 'available' to false. */
-void
-ovs_numa_set_cpu_mask(const char *cmask)
-{
-int core_id = 0;
-int i;
-int end_idx;
-
-if (!found_numa_and_core) {
-return;
-}
-
-/* If no mask specified, resets the 'available' to true for all cores. */
-if (!cmask) {
-struct cpu_core *core;
-
-HMAP_FOR_EACH(core, hmap_node, _cpu_cores) {
-core->available = true;
-}
-
-return;
-}
-
-/* Ignore leading 0x. */
-end_idx = 0;
-if (!strncmp(cmask, "0x", 2) || !strncmp(cmask, "0X", 2)) {
-end_idx = 2;
-}
-
-for (i = strlen(cmask) - 1; i >= end_idx; i--) {
-char hex = toupper((unsigned char)cmask[i]);
-int bin, j;
-
-if (hex >= '0' && hex <= '9') {
-bin = hex - '0';
-

[ovs-dev] [PATCH v3 11/18] dpctl: Avoid making assumptions on pmd threads.

2017-01-08 Thread Daniele Di Proietto
Currently dpctl depends on ovs-numa module to delete and create flows on
different pmd threads for pmd devices.

The next commits will move away the pmd threads state from ovs-numa to
dpif-netdev, so the ovs-numa interface will not be supported.

Also, the assignment between ports and thread is an implementation
detail of dpif-netdev, dpctl shouldn't know anything about it.

This commit changes the dpif_flow_put() and dpif_flow_del() calls to
iterate over all the pmd threads, if pmd_id is PMD_ID_NULL.

A simple test is added.

Signed-off-by: Daniele Di Proietto 
---
 lib/dpctl.c   | 107 
 lib/dpif-netdev.c | 180 +-
 lib/dpif.c|   6 +-
 lib/dpif.h|  12 +++-
 tests/pmd.at  |  44 +
 5 files changed, 194 insertions(+), 155 deletions(-)

diff --git a/lib/dpctl.c b/lib/dpctl.c
index 7011b183d..23837ce74 100644
--- a/lib/dpctl.c
+++ b/lib/dpctl.c
@@ -40,7 +40,6 @@
 #include "netlink.h"
 #include "odp-util.h"
 #include "openvswitch/ofpbuf.h"
-#include "ovs-numa.h"
 #include "packets.h"
 #include "openvswitch/shash.h"
 #include "simap.h"
@@ -876,43 +875,12 @@ out_freefilter:
 return error;
 }
 
-/* Extracts the in_port from the parsed keys, and returns the reference
- * to the 'struct netdev *' of the dpif port.  On error, returns NULL.
- * Users must call 'netdev_close()' after finish using the returned
- * reference. */
-static struct netdev *
-get_in_port_netdev_from_key(struct dpif *dpif, const struct ofpbuf *key)
-{
-const struct nlattr *in_port_nla;
-struct netdev *dev = NULL;
-
-in_port_nla = nl_attr_find(key, 0, OVS_KEY_ATTR_IN_PORT);
-if (in_port_nla) {
-struct dpif_port dpif_port;
-odp_port_t port_no;
-int error;
-
-port_no = ODP_PORT_C(nl_attr_get_u32(in_port_nla));
-error = dpif_port_query_by_number(dpif, port_no, _port);
-if (error) {
-goto out;
-}
-
-netdev_open(dpif_port.name, dpif_port.type, );
-dpif_port_destroy(_port);
-}
-
-out:
-return dev;
-}
-
 static int
 dpctl_put_flow(int argc, const char *argv[], enum dpif_flow_put_flags flags,
struct dpctl_params *dpctl_p)
 {
 const char *key_s = argv[argc - 2];
 const char *actions_s = argv[argc - 1];
-struct netdev *in_port_netdev = NULL;
 struct dpif_flow_stats stats;
 struct dpif_port dpif_port;
 struct dpif_port_dump port_dump;
@@ -968,39 +936,15 @@ dpctl_put_flow(int argc, const char *argv[], enum 
dpif_flow_put_flags flags,
 goto out_freeactions;
 }
 
-/* For DPDK interface, applies the operation to all pmd threads
- * on the same numa node. */
-in_port_netdev = get_in_port_netdev_from_key(dpif, );
-if (in_port_netdev && netdev_is_pmd(in_port_netdev)) {
-int numa_id;
-
-numa_id = netdev_get_numa_id(in_port_netdev);
-if (ovs_numa_numa_id_is_valid(numa_id)) {
-struct ovs_numa_dump *dump = ovs_numa_dump_cores_on_numa(numa_id);
-struct ovs_numa_info *iter;
-
-FOR_EACH_CORE_ON_NUMA (iter, dump) {
-if (ovs_numa_core_is_pinned(iter->core_id)) {
-error = dpif_flow_put(dpif, flags,
-  key.data, key.size,
-  mask.size == 0 ? NULL : mask.data,
-  mask.size, actions.data,
-  actions.size, ufid_present ?  : 
NULL,
-  iter->core_id, 
dpctl_p->print_statistics ?  : NULL);
-}
-}
-ovs_numa_dump_destroy(dump);
-} else {
-error = EINVAL;
-}
-} else {
-error = dpif_flow_put(dpif, flags,
-  key.data, key.size,
-  mask.size == 0 ? NULL : mask.data,
-  mask.size, actions.data,
-  actions.size, ufid_present ?  : NULL,
-  PMD_ID_NULL, dpctl_p->print_statistics ?  
: NULL);
-}
+/* The flow will be added on all pmds currently in the datapath. */
+error = dpif_flow_put(dpif, flags,
+  key.data, key.size,
+  mask.size == 0 ? NULL : mask.data,
+  mask.size, actions.data,
+  actions.size, ufid_present ?  : NULL,
+  PMD_ID_NULL,
+  dpctl_p->print_statistics ?  : NULL);
+
 if (error) {
 dpctl_error(dpctl_p, error, "updating flow table");
 goto out_freeactions;
@@ -1021,7 +965,6 @@ out_freekeymask:
 ofpbuf_uninit();
 ofpbuf_uninit();
 dpif_close(dpif);
-netdev_close(in_port_netdev);
 return error;
 }
 
@@ -1110,7 +1053,6 @@ static int
 dpctl_del_flow(int argc, 

[ovs-dev] [PATCH v3 15/18] ovs-numa: Add per numa and global counts in dump.

2017-01-08 Thread Daniele Di Proietto
They will be used by a future commit.

Suggested-by: Ilya Maximets 
Signed-off-by: Daniele Di Proietto 
---
 lib/ovs-numa.c | 96 +-
 lib/ovs-numa.h | 18 +--
 2 files changed, 77 insertions(+), 37 deletions(-)

diff --git a/lib/ovs-numa.c b/lib/ovs-numa.c
index f8a37b1ea..04225a958 100644
--- a/lib/ovs-numa.c
+++ b/lib/ovs-numa.c
@@ -489,25 +489,53 @@ ovs_numa_unpin_core(unsigned core_id)
 }
 }
 
+static struct ovs_numa_dump *
+ovs_numa_dump_create(void)
+{
+struct ovs_numa_dump *dump = xmalloc(sizeof *dump);
+
+hmap_init(>cores);
+hmap_init(>numas);
+
+return dump;
+}
+
+static void
+ovs_numa_dump_add(struct ovs_numa_dump *dump, int numa_id, int core_id)
+{
+struct ovs_numa_info_core *c = xzalloc(sizeof *c);
+struct ovs_numa_info_numa *n;
+
+c->numa_id = numa_id;
+c->core_id = core_id;
+hmap_insert(>cores, >hmap_node, hash_2words(numa_id, core_id));
+
+HMAP_FOR_EACH_WITH_HASH (n, hmap_node, hash_int(numa_id, 0),
+ >numas) {
+if (n->numa_id == numa_id) {
+n->n_cores++;
+return;
+}
+}
+
+n = xzalloc(sizeof *n);
+n->numa_id = numa_id;
+n->n_cores = 1;
+hmap_insert(>numas, >hmap_node, hash_int(numa_id, 0));
+}
+
 /* Given the 'numa_id', returns dump of all cores on the numa node. */
 struct ovs_numa_dump *
 ovs_numa_dump_cores_on_numa(int numa_id)
 {
-struct ovs_numa_dump *dump = xmalloc(sizeof *dump);
+struct ovs_numa_dump *dump = ovs_numa_dump_create();
 struct numa_node *numa = get_numa_by_numa_id(numa_id);
 
-hmap_init(>dump);
-
 if (numa) {
 struct cpu_core *core;
 
-LIST_FOR_EACH(core, list_node, >cores) {
-struct ovs_numa_info *info = xmalloc(sizeof *info);
-
-info->numa_id = numa->numa_id;
-info->core_id = core->core_id;
-hmap_insert(>dump, >hmap_node,
-hash_2words(info->numa_id, info->core_id));
+LIST_FOR_EACH (core, list_node, >cores) {
+ovs_numa_dump_add(dump, numa->numa_id, core->core_id);
 }
 }
 
@@ -517,12 +545,10 @@ ovs_numa_dump_cores_on_numa(int numa_id)
 struct ovs_numa_dump *
 ovs_numa_dump_cores_with_cmask(const char *cmask)
 {
-struct ovs_numa_dump *dump = xmalloc(sizeof *dump);
+struct ovs_numa_dump *dump = ovs_numa_dump_create();
 int core_id = 0;
 int end_idx;
 
-hmap_init(>dump);
-
 /* Ignore leading 0x. */
 end_idx = 0;
 if (!strncmp(cmask, "0x", 2) || !strncmp(cmask, "0X", 2)) {
@@ -547,12 +573,9 @@ ovs_numa_dump_cores_with_cmask(const char *cmask)
 struct cpu_core *core = get_core_by_core_id(core_id);
 
 if (core) {
-struct ovs_numa_info *info = xmalloc(sizeof *info);
-
-info->numa_id = core->numa->numa_id;
-info->core_id = core->core_id;
-hmap_insert(>dump, >hmap_node,
-hash_2words(info->numa_id, info->core_id));
+ovs_numa_dump_add(dump,
+  core->numa->numa_id,
+  core->core_id);
 }
 }
 
@@ -566,11 +589,9 @@ ovs_numa_dump_cores_with_cmask(const char *cmask)
 struct ovs_numa_dump *
 ovs_numa_dump_n_cores_per_numa(int cores_per_numa)
 {
-struct ovs_numa_dump *dump = xmalloc(sizeof *dump);
+struct ovs_numa_dump *dump = ovs_numa_dump_create();
 const struct numa_node *n;
 
-hmap_init(>dump);
-
 HMAP_FOR_EACH (n, hmap_node, _numa_nodes) {
 const struct cpu_core *core;
 int i = 0;
@@ -580,12 +601,7 @@ ovs_numa_dump_n_cores_per_numa(int cores_per_numa)
 break;
 }
 
-struct ovs_numa_info *info = xmalloc(sizeof *info);
-
-info->numa_id = core->numa->numa_id;
-info->core_id = core->core_id;
-hmap_insert(>dump, >hmap_node,
-hash_2words(info->numa_id, info->core_id));
+ovs_numa_dump_add(dump, core->numa->numa_id, core->core_id);
 }
 }
 
@@ -596,10 +612,10 @@ bool
 ovs_numa_dump_contains_core(const struct ovs_numa_dump *dump,
 int numa_id, unsigned core_id)
 {
-struct ovs_numa_info *core;
+struct ovs_numa_info_core *core;
 
 HMAP_FOR_EACH_WITH_HASH (core, hmap_node, hash_2words(numa_id, core_id),
- >dump) {
+ >cores) {
 if (core->core_id == core_id && core->numa_id == numa_id) {
 return true;
 }
@@ -608,20 +624,32 @@ ovs_numa_dump_contains_core(const struct ovs_numa_dump 
*dump,
 return false;
 }
 
+size_t
+ovs_numa_dump_count(const struct ovs_numa_dump *dump)
+{
+return hmap_count(>cores);
+}
+
 void
 ovs_numa_dump_destroy(struct 

[ovs-dev] [PATCH v3 14/18] ovs-numa: Don't use hmap_first_with_hash().

2017-01-08 Thread Daniele Di Proietto
I think it's better to iterate the hmap than to use
hmap_first_with_hash(), because it handles hash collisions.

Signed-off-by: Daniele Di Proietto 
---
 lib/ovs-numa.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/lib/ovs-numa.c b/lib/ovs-numa.c
index 61c31cf69..f8a37b1ea 100644
--- a/lib/ovs-numa.c
+++ b/lib/ovs-numa.c
@@ -241,30 +241,32 @@ discover_numa_and_core(void)
 static struct cpu_core*
 get_core_by_core_id(unsigned core_id)
 {
-struct cpu_core *core = NULL;
+struct cpu_core *core;
 
-if (ovs_numa_core_id_is_valid(core_id)) {
-core = CONTAINER_OF(hmap_first_with_hash(_cpu_cores,
- hash_int(core_id, 0)),
-struct cpu_core, hmap_node);
+HMAP_FOR_EACH_WITH_HASH (core, hmap_node, hash_int(core_id, 0),
+ _cpu_cores) {
+if (core->core_id == core_id) {
+return core;
+}
 }
 
-return core;
+return NULL;
 }
 
 /* Gets 'struct numa_node' by 'numa_id'. */
 static struct numa_node*
 get_numa_by_numa_id(int numa_id)
 {
-struct numa_node *numa = NULL;
+struct numa_node *numa;
 
-if (ovs_numa_numa_id_is_valid(numa_id)) {
-numa = CONTAINER_OF(hmap_first_with_hash(_numa_nodes,
- hash_int(numa_id, 0)),
-struct numa_node, hmap_node);
+HMAP_FOR_EACH_WITH_HASH (numa, hmap_node, hash_int(numa_id, 0),
+ _numa_nodes) {
+if (numa->numa_id == numa_id) {
+return numa;
+}
 }
 
-return numa;
+return NULL;
 }
 
 
-- 
2.11.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 12/18] ovs-numa: New ovs_numa_dump_contains_core() function.

2017-01-08 Thread Daniele Di Proietto
It will be used by a future commit.  struct ovs_numa_dump now uses an
hmap instead of a list to make ovs_numa_dump_contains_core() more
efficient.

Signed-off-by: Daniele Di Proietto 
---
 lib/ovs-numa.c | 25 ++---
 lib/ovs-numa.h | 10 ++
 2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/lib/ovs-numa.c b/lib/ovs-numa.c
index e1e7068a2..85f392a91 100644
--- a/lib/ovs-numa.c
+++ b/lib/ovs-numa.c
@@ -494,7 +494,7 @@ ovs_numa_dump_cores_on_numa(int numa_id)
 struct ovs_numa_dump *dump = xmalloc(sizeof *dump);
 struct numa_node *numa = get_numa_by_numa_id(numa_id);
 
-ovs_list_init(>dump);
+hmap_init(>dump);
 
 if (numa) {
 struct cpu_core *core;
@@ -504,13 +504,30 @@ ovs_numa_dump_cores_on_numa(int numa_id)
 
 info->numa_id = numa->numa_id;
 info->core_id = core->core_id;
-ovs_list_insert(>dump, >list_node);
+hmap_insert(>dump, >hmap_node,
+hash_2words(info->numa_id, info->core_id));
 }
 }
 
 return dump;
 }
 
+bool
+ovs_numa_dump_contains_core(const struct ovs_numa_dump *dump,
+int numa_id, unsigned core_id)
+{
+struct ovs_numa_info *core;
+
+HMAP_FOR_EACH_WITH_HASH (core, hmap_node, hash_2words(numa_id, core_id),
+ >dump) {
+if (core->core_id == core_id && core->numa_id == numa_id) {
+return true;
+}
+}
+
+return false;
+}
+
 void
 ovs_numa_dump_destroy(struct ovs_numa_dump *dump)
 {
@@ -520,10 +537,12 @@ ovs_numa_dump_destroy(struct ovs_numa_dump *dump)
 return;
 }
 
-LIST_FOR_EACH_POP (iter, list_node, >dump) {
+HMAP_FOR_EACH_POP (iter, hmap_node, >dump) {
 free(iter);
 }
 
+hmap_destroy(>dump);
+
 free(dump);
 }
 
diff --git a/lib/ovs-numa.h b/lib/ovs-numa.h
index be836b2ca..c0eae07d8 100644
--- a/lib/ovs-numa.h
+++ b/lib/ovs-numa.h
@@ -21,19 +21,19 @@
 #include 
 
 #include "compiler.h"
-#include "openvswitch/list.h"
+#include "openvswitch/hmap.h"
 
 #define OVS_CORE_UNSPEC INT_MAX
 #define OVS_NUMA_UNSPEC INT_MAX
 
 /* Dump of a list of 'struct ovs_numa_info'. */
 struct ovs_numa_dump {
-struct ovs_list dump;
+struct hmap dump;
 };
 
 /* A numa_id - core_id pair. */
 struct ovs_numa_info {
-struct ovs_list list_node;
+struct hmap_node hmap_node;
 int numa_id;
 unsigned core_id;
 };
@@ -54,10 +54,12 @@ unsigned ovs_numa_get_unpinned_core_any(void);
 unsigned ovs_numa_get_unpinned_core_on_numa(int numa_id);
 void ovs_numa_unpin_core(unsigned core_id);
 struct ovs_numa_dump *ovs_numa_dump_cores_on_numa(int numa_id);
+bool ovs_numa_dump_contains_core(const struct ovs_numa_dump *,
+ int numa_id, unsigned core_id);
 void ovs_numa_dump_destroy(struct ovs_numa_dump *);
 int ovs_numa_thread_setaffinity_core(unsigned core_id);
 
 #define FOR_EACH_CORE_ON_NUMA(ITER, DUMP)\
-LIST_FOR_EACH((ITER), list_node, &(DUMP)->dump)
+HMAP_FOR_EACH((ITER), hmap_node, &(DUMP)->dump)
 
 #endif /* ovs-numa.h */
-- 
2.11.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 10/18] dpif-netdev: Make 'static_tx_qid' const.

2017-01-08 Thread Daniele Di Proietto
Since previous commit, 'static_tx_qid' doesn't need to be atomic and is
actually never touched (except for initialization), so it can be made
const.

Signed-off-by: Daniele Di Proietto 
---
 lib/dpif-netdev.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 432bac814..436f945b7 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -496,7 +496,7 @@ struct dp_netdev_pmd_thread {
 /* Queue id used by this pmd thread to send packets on all netdevs if
  * XPS disabled for this netdev. All static_tx_qid's are unique and less
  * than 'ovs_numa_get_n_cores() + 1'. */
-atomic_int static_tx_qid;
+const int static_tx_qid;
 
 struct ovs_mutex port_mutex;/* Mutex for 'poll_list' and 'tx_ports'. */
 /* List of rx queues to poll. */
@@ -3286,10 +3286,9 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread 
*pmd, struct dp_netdev *dp,
 pmd->numa_id = numa_id;
 pmd->poll_cnt = 0;
 
-atomic_init(>static_tx_qid,
-(core_id == NON_PMD_CORE_ID)
-? ovs_numa_get_n_cores()
-: get_n_pmd_threads(dp));
+*CONST_CAST(int *, >static_tx_qid) = (core_id == NON_PMD_CORE_ID)
+  ? ovs_numa_get_n_cores()
+  : get_n_pmd_threads(dp);
 
 ovs_refcount_init(>ref_cnt);
 latch_init(>exit_latch);
@@ -4394,7 +4393,7 @@ dp_execute_cb(void *aux_, struct dp_packet_batch 
*packets_,
 if (dynamic_txqs) {
 tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p, now);
 } else {
-atomic_read_relaxed(>static_tx_qid, _qid);
+tx_qid = pmd->static_tx_qid;
 }
 
 netdev_send(p->port->netdev, tx_qid, packets_, may_steal,
-- 
2.11.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 09/18] dpif-netdev: Create pmd threads for every numa node.

2017-01-08 Thread Daniele Di Proietto
A lot of the complexity in the code that handles pmd threads and ports
in dpif-netdev is due to the fact that we postpone the creation of pmd
threads on a numa node until we have a port that needs to be polled on
that particular node.

Since the previous commit, a pmd thread with no ports will not consume
any CPU, so it seems easier to create all the threads at once.

This will also make future commits easier.

Signed-off-by: Daniele Di Proietto 
---
 lib/dpif-netdev.c | 208 ++
 tests/pmd.at  |   2 +-
 2 files changed, 69 insertions(+), 141 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index dc24e72dc..432bac814 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -575,8 +575,8 @@ static struct dp_netdev_pmd_thread 
*dp_netdev_get_pmd(struct dp_netdev *dp,
 static struct dp_netdev_pmd_thread *
 dp_netdev_pmd_get_next(struct dp_netdev *dp, struct cmap_position *pos);
 static void dp_netdev_destroy_all_pmds(struct dp_netdev *dp);
-static void dp_netdev_del_pmds_on_numa(struct dp_netdev *dp, int numa_id);
-static void dp_netdev_set_pmds_on_numa(struct dp_netdev *dp, int numa_id)
+static void dp_netdev_stop_pmds(struct dp_netdev *dp);
+static void dp_netdev_start_pmds(struct dp_netdev *dp)
 OVS_REQUIRES(dp->port_mutex);
 static void dp_netdev_pmd_clear_ports(struct dp_netdev_pmd_thread *pmd);
 static void dp_netdev_del_port_from_all_pmds(struct dp_netdev *dp,
@@ -1092,19 +1092,20 @@ dp_netdev_free(struct dp_netdev *dp)
 
 shash_find_and_delete(_netdevs, dp->name);
 
-dp_netdev_destroy_all_pmds(dp);
-ovs_mutex_destroy(>non_pmd_mutex);
-ovsthread_key_delete(dp->per_pmd_key);
-
-conntrack_destroy(>conntrack);
-
 ovs_mutex_lock(>port_mutex);
 HMAP_FOR_EACH_SAFE (port, next, node, >ports) {
 do_del_port(dp, port);
 }
 ovs_mutex_unlock(>port_mutex);
+dp_netdev_destroy_all_pmds(dp);
 cmap_destroy(>poll_threads);
 
+ovs_mutex_destroy(>non_pmd_mutex);
+ovsthread_key_delete(dp->per_pmd_key);
+
+conntrack_destroy(>conntrack);
+
+
 seq_destroy(dp->reconfigure_seq);
 
 seq_destroy(dp->port_seq);
@@ -1348,10 +1349,7 @@ do_add_port(struct dp_netdev *dp, const char *devname, 
const char *type,
 }
 
 if (netdev_is_pmd(port->netdev)) {
-int numa_id = netdev_get_numa_id(port->netdev);
-
-ovs_assert(ovs_numa_numa_id_is_valid(numa_id));
-dp_netdev_set_pmds_on_numa(dp, numa_id);
+dp_netdev_start_pmds(dp);
 }
 
 dp_netdev_add_port_to_pmds(dp, port);
@@ -1493,45 +1491,16 @@ get_n_pmd_threads(struct dp_netdev *dp)
 return cmap_count(>poll_threads) - 1;
 }
 
-static int
-get_n_pmd_threads_on_numa(struct dp_netdev *dp, int numa_id)
-{
-struct dp_netdev_pmd_thread *pmd;
-int n_pmds = 0;
-
-CMAP_FOR_EACH (pmd, node, >poll_threads) {
-if (pmd->numa_id == numa_id) {
-n_pmds++;
-}
-}
-
-return n_pmds;
-}
-
-/* Returns 'true' if there is a port with pmd netdev and the netdev is on
- * numa node 'numa_id' or its rx queue assigned to core on that numa node . */
+/* Returns 'true' if there is a port with pmd netdev. */
 static bool
-has_pmd_rxq_for_numa(struct dp_netdev *dp, int numa_id)
+has_pmd_port(struct dp_netdev *dp)
 OVS_REQUIRES(dp->port_mutex)
 {
 struct dp_netdev_port *port;
 
 HMAP_FOR_EACH (port, node, >ports) {
 if (netdev_is_pmd(port->netdev)) {
-int i;
-
-if (netdev_get_numa_id(port->netdev) == numa_id) {
-return true;
-}
-
-for (i = 0; i < port->n_rxq; i++) {
-unsigned core_id = port->rxqs[i].core_id;
-
-if (core_id != OVS_CORE_UNSPEC
-&& ovs_numa_get_numa_id(core_id) == numa_id) {
-return true;
-}
-}
+return true;
 }
 }
 
@@ -1549,14 +1518,9 @@ do_del_port(struct dp_netdev *dp, struct dp_netdev_port 
*port)
 dp_netdev_del_port_from_all_pmds(dp, port);
 
 if (netdev_is_pmd(port->netdev)) {
-int numa_id = netdev_get_numa_id(port->netdev);
-
-/* PMD threads can not be on invalid numa node. */
-ovs_assert(ovs_numa_numa_id_is_valid(numa_id));
-/* If there is no netdev on the numa node, deletes the pmd threads
- * for that numa. */
-if (!has_pmd_rxq_for_numa(dp, numa_id)) {
-dp_netdev_del_pmds_on_numa(dp, numa_id);
+/* If there is no pmd netdev, delete the pmd threads */
+if (!has_pmd_port(dp)) {
+dp_netdev_stop_pmds(dp);
 }
 }
 
@@ -3407,18 +3371,22 @@ dp_netdev_del_pmd(struct dp_netdev *dp, struct 
dp_netdev_pmd_thread *pmd)
 dp_netdev_pmd_unref(pmd);
 }
 
-/* Destroys all pmd threads. */
+/* Destroys all pmd threads, but not the non pmd thread. */
 static void
-dp_netdev_destroy_all_pmds(struct dp_netdev *dp)
+dp_netdev_stop_pmds(struct 

[ovs-dev] [PATCH v3 13/18] ovs-numa: Add new dump types.

2017-01-08 Thread Daniele Di Proietto
They will be used by a future commit.

This patch introduces some code duplication which will be removed in a
future commit.

Signed-off-by: Daniele Di Proietto 
---
 lib/ovs-numa.c | 78 ++
 lib/ovs-numa.h |  4 ++-
 2 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/lib/ovs-numa.c b/lib/ovs-numa.c
index 85f392a91..61c31cf69 100644
--- a/lib/ovs-numa.c
+++ b/lib/ovs-numa.c
@@ -512,6 +512,84 @@ ovs_numa_dump_cores_on_numa(int numa_id)
 return dump;
 }
 
+struct ovs_numa_dump *
+ovs_numa_dump_cores_with_cmask(const char *cmask)
+{
+struct ovs_numa_dump *dump = xmalloc(sizeof *dump);
+int core_id = 0;
+int end_idx;
+
+hmap_init(>dump);
+
+/* Ignore leading 0x. */
+end_idx = 0;
+if (!strncmp(cmask, "0x", 2) || !strncmp(cmask, "0X", 2)) {
+end_idx = 2;
+}
+
+for (int i = strlen(cmask) - 1; i >= end_idx; i--) {
+char hex = toupper((unsigned char) cmask[i]);
+int bin, j;
+
+if (hex >= '0' && hex <= '9') {
+bin = hex - '0';
+} else if (hex >= 'A' && hex <= 'F') {
+bin = hex - 'A' + 10;
+} else {
+VLOG_WARN("Invalid cpu mask: %c", cmask[i]);
+bin = 0;
+}
+
+for (j = 0; j < 4; j++) {
+if ((bin >> j) & 0x1) {
+struct cpu_core *core = get_core_by_core_id(core_id);
+
+if (core) {
+struct ovs_numa_info *info = xmalloc(sizeof *info);
+
+info->numa_id = core->numa->numa_id;
+info->core_id = core->core_id;
+hmap_insert(>dump, >hmap_node,
+hash_2words(info->numa_id, info->core_id));
+}
+}
+
+core_id++;
+}
+}
+
+return dump;
+}
+
+struct ovs_numa_dump *
+ovs_numa_dump_n_cores_per_numa(int cores_per_numa)
+{
+struct ovs_numa_dump *dump = xmalloc(sizeof *dump);
+const struct numa_node *n;
+
+hmap_init(>dump);
+
+HMAP_FOR_EACH (n, hmap_node, _numa_nodes) {
+const struct cpu_core *core;
+int i = 0;
+
+LIST_FOR_EACH (core, list_node, >cores) {
+if (i++ >= cores_per_numa) {
+break;
+}
+
+struct ovs_numa_info *info = xmalloc(sizeof *info);
+
+info->numa_id = core->numa->numa_id;
+info->core_id = core->core_id;
+hmap_insert(>dump, >hmap_node,
+hash_2words(info->numa_id, info->core_id));
+}
+}
+
+return dump;
+}
+
 bool
 ovs_numa_dump_contains_core(const struct ovs_numa_dump *dump,
 int numa_id, unsigned core_id)
diff --git a/lib/ovs-numa.h b/lib/ovs-numa.h
index c0eae07d8..62bdb225f 100644
--- a/lib/ovs-numa.h
+++ b/lib/ovs-numa.h
@@ -54,12 +54,14 @@ unsigned ovs_numa_get_unpinned_core_any(void);
 unsigned ovs_numa_get_unpinned_core_on_numa(int numa_id);
 void ovs_numa_unpin_core(unsigned core_id);
 struct ovs_numa_dump *ovs_numa_dump_cores_on_numa(int numa_id);
+struct ovs_numa_dump *ovs_numa_dump_cores_with_cmask(const char *cmask);
+struct ovs_numa_dump *ovs_numa_dump_n_cores_per_numa(int n);
 bool ovs_numa_dump_contains_core(const struct ovs_numa_dump *,
  int numa_id, unsigned core_id);
 void ovs_numa_dump_destroy(struct ovs_numa_dump *);
 int ovs_numa_thread_setaffinity_core(unsigned core_id);
 
-#define FOR_EACH_CORE_ON_NUMA(ITER, DUMP)\
+#define FOR_EACH_CORE_ON_DUMP(ITER, DUMP)\
 HMAP_FOR_EACH((ITER), hmap_node, &(DUMP)->dump)
 
 #endif /* ovs-numa.h */
-- 
2.11.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 08/18] dpif-netdev: Block pmd threads if there are no ports.

2017-01-08 Thread Daniele Di Proietto
There's no reason for a pmd thread to perform its main loop if there are
no queues in its poll_list.

This commit introduces a seq object on which the pmd thread can be
blocked, if there are no queues.

When the main thread wants to reload a pmd threads it must now change
the seq object (in case it's blocked) and set 'reload' to true.

This is useful to avoid wasting CPU cycles and is also necessary for a
future commit.

Signed-off-by: Daniele Di Proietto 
---
 lib/dpif-netdev.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 0d47a3286..dc24e72dc 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -485,6 +485,8 @@ struct dp_netdev_pmd_thread {
 unsigned long long last_cycles;
 
 struct latch exit_latch;/* For terminating the pmd thread. */
+struct seq *reload_seq;
+uint64_t last_reload_seq;
 atomic_bool reload; /* Do we need to reload ports? */
 pthread_t thread;
 unsigned core_id;   /* CPU core id of this pmd thread. */
@@ -1209,6 +1211,7 @@ dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
 }
 
 ovs_mutex_lock(>cond_mutex);
+seq_change(pmd->reload_seq);
 atomic_store_relaxed(>reload, true);
 ovs_mutex_cond_wait(>cond, >cond_mutex);
 ovs_mutex_unlock(>cond_mutex);
@@ -3148,6 +3151,14 @@ reload:
 netdev_rxq_get_queue_id(poll_list[i].rx));
 }
 
+if (!poll_cnt) {
+while (seq_read(pmd->reload_seq) == pmd->last_reload_seq) {
+seq_wait(pmd->reload_seq, pmd->last_reload_seq);
+poll_block();
+}
+lc = 1025;
+}
+
 for (;;) {
 for (i = 0; i < poll_cnt; i++) {
 dp_netdev_process_rxq_port(pmd, poll_list[i].port, 
poll_list[i].rx);
@@ -3223,6 +3234,7 @@ dp_netdev_pmd_reload_done(struct dp_netdev_pmd_thread 
*pmd)
 {
 ovs_mutex_lock(>cond_mutex);
 atomic_store_relaxed(>reload, false);
+pmd->last_reload_seq = seq_read(pmd->reload_seq);
 xpthread_cond_signal(>cond);
 ovs_mutex_unlock(>cond_mutex);
 }
@@ -3317,6 +3329,8 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, 
struct dp_netdev *dp,
 
 ovs_refcount_init(>ref_cnt);
 latch_init(>exit_latch);
+pmd->reload_seq = seq_create();
+pmd->last_reload_seq = seq_read(pmd->reload_seq);
 atomic_init(>reload, false);
 xpthread_cond_init(>cond, NULL);
 ovs_mutex_init(>cond_mutex);
@@ -3356,6 +3370,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
 cmap_destroy(>flow_table);
 ovs_mutex_destroy(>flow_mutex);
 latch_destroy(>exit_latch);
+seq_destroy(pmd->reload_seq);
 xpthread_cond_destroy(>cond);
 ovs_mutex_destroy(>cond_mutex);
 ovs_mutex_destroy(>port_mutex);
-- 
2.11.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 07/18] dpif-netdev: Use a boolean instead of pmd->port_seq.

2017-01-08 Thread Daniele Di Proietto
There's no need for a sequence number, since the main thread has to wait
for the pmd thread, so there's no chance that an update will be
undetected.

A seq object will be introduced for another purpose in the next commit,
and changing this to boolean makes the code more readable.

Signed-off-by: Daniele Di Proietto 
---
 lib/dpif-netdev.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 004b28dc8..0d47a3286 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -485,7 +485,7 @@ struct dp_netdev_pmd_thread {
 unsigned long long last_cycles;
 
 struct latch exit_latch;/* For terminating the pmd thread. */
-atomic_uint change_seq; /* For reloading pmd ports. */
+atomic_bool reload; /* Do we need to reload ports? */
 pthread_t thread;
 unsigned core_id;   /* CPU core id of this pmd thread. */
 int numa_id;/* numa node id of this pmd thread. */
@@ -526,8 +526,6 @@ struct dp_netdev_pmd_thread {
 uint64_t cycles_zero[PMD_N_CYCLES];
 };
 
-#define PMD_INITIAL_SEQ 1
-
 /* Interface to netdev-based datapath. */
 struct dpif_netdev {
 struct dpif dpif;
@@ -1201,8 +1199,6 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct 
dpif_dp_stats *stats)
 static void
 dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
 {
-int old_seq;
-
 if (pmd->core_id == NON_PMD_CORE_ID) {
 ovs_mutex_lock(>dp->non_pmd_mutex);
 ovs_mutex_lock(>port_mutex);
@@ -1213,7 +1209,7 @@ dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
 }
 
 ovs_mutex_lock(>cond_mutex);
-atomic_add_relaxed(>change_seq, 1, _seq);
+atomic_store_relaxed(>reload, true);
 ovs_mutex_cond_wait(>cond, >cond_mutex);
 ovs_mutex_unlock(>cond_mutex);
 }
@@ -3131,7 +3127,6 @@ pmd_thread_main(void *f_)
 struct dp_netdev_pmd_thread *pmd = f_;
 unsigned int lc = 0;
 struct rxq_poll *poll_list;
-unsigned int port_seq = PMD_INITIAL_SEQ;
 bool exiting;
 int poll_cnt;
 int i;
@@ -3159,7 +3154,7 @@ reload:
 }
 
 if (lc++ > 1024) {
-unsigned int seq;
+bool reload;
 
 lc = 0;
 
@@ -3169,9 +3164,8 @@ reload:
 emc_cache_slow_sweep(>flow_cache);
 }
 
-atomic_read_relaxed(>change_seq, );
-if (seq != port_seq) {
-port_seq = seq;
+atomic_read_relaxed(>reload, );
+if (reload) {
 break;
 }
 }
@@ -3228,6 +3222,7 @@ static void
 dp_netdev_pmd_reload_done(struct dp_netdev_pmd_thread *pmd)
 {
 ovs_mutex_lock(>cond_mutex);
+atomic_store_relaxed(>reload, false);
 xpthread_cond_signal(>cond);
 ovs_mutex_unlock(>cond_mutex);
 }
@@ -3322,7 +3317,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, 
struct dp_netdev *dp,
 
 ovs_refcount_init(>ref_cnt);
 latch_init(>exit_latch);
-atomic_init(>change_seq, PMD_INITIAL_SEQ);
+atomic_init(>reload, false);
 xpthread_cond_init(>cond, NULL);
 ovs_mutex_init(>cond_mutex);
 ovs_mutex_init(>flow_mutex);
-- 
2.11.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 06/18] netdev-dpdk: Refactor construct and destruct.

2017-01-08 Thread Daniele Di Proietto
Some refactoring for _construct() and _destruct() methods:
* Rename netdev_dpdk_init() to common_construct(). init() has a
  different meaning in the netdev context.
* Remove DPDK_DEV_ETH and DPDK_DEV_VHOST checks in common_construct()
  and move them to different functions
* Introduce common_destruct().
* Avoid taking 'dev->mutex' in construct and destruct: we're guaranteed
  to be the only thread with access to the object.

Signed-off-by: Daniele Di Proietto 
---
 lib/netdev-dpdk.c | 86 ++-
 1 file changed, 41 insertions(+), 45 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index d6315357b..45320e370 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -827,29 +827,20 @@ netdev_dpdk_alloc_txq(unsigned int n_txqs)
 }
 
 static int
-netdev_dpdk_init(struct netdev *netdev, unsigned int port_no,
- enum dpdk_dev_type type)
+common_construct(struct netdev *netdev, unsigned int port_no,
+ enum dpdk_dev_type type, int socket_id)
 OVS_REQUIRES(dpdk_mutex)
 {
 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-int sid;
-int err = 0;
 
 ovs_mutex_init(>mutex);
-ovs_mutex_lock(>mutex);
 
 rte_spinlock_init(>stats_lock);
 
 /* If the 'sid' is negative, it means that the kernel fails
  * to obtain the pci numa info.  In that situation, always
  * use 'SOCKET0'. */
-if (type == DPDK_DEV_ETH && rte_eth_dev_is_valid_port(dev->port_id)) {
-sid = rte_eth_dev_socket_id(port_no);
-} else {
-sid = rte_lcore_to_socket_id(rte_get_master_lcore());
-}
-
-dev->socket_id = sid < 0 ? SOCKET0 : sid;
+dev->socket_id = socket_id < 0 ? SOCKET0 : socket_id;
 dev->requested_socket_id = dev->socket_id;
 dev->port_id = port_no;
 dev->type = type;
@@ -880,21 +871,11 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int 
port_no,
 
 dev->flags = NETDEV_UP | NETDEV_PROMISC;
 
-if (type == DPDK_DEV_VHOST) {
-dev->tx_q = netdev_dpdk_alloc_txq(OVS_VHOST_MAX_QUEUE_NUM);
-if (!dev->tx_q) {
-err = ENOMEM;
-goto unlock;
-}
-}
-
 ovs_list_push_back(_list, >list_node);
 
 netdev_request_reconfigure(netdev);
 
-unlock:
-ovs_mutex_unlock(>mutex);
-return err;
+return 0;
 }
 
 /* dev_name must be the prefix followed by a positive decimal number.
@@ -919,6 +900,21 @@ dpdk_dev_parse_name(const char dev_name[], const char 
prefix[],
 }
 
 static int
+vhost_common_construct(struct netdev *netdev)
+OVS_REQUIRES(dpdk_mutex)
+{
+int socket_id = rte_lcore_to_socket_id(rte_get_master_lcore());
+struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+
+dev->tx_q = netdev_dpdk_alloc_txq(OVS_VHOST_MAX_QUEUE_NUM);
+if (!dev->tx_q) {
+return ENOMEM;
+}
+
+return common_construct(netdev, -1, DPDK_DEV_VHOST, socket_id);
+}
+
+static int
 netdev_dpdk_vhost_construct(struct netdev *netdev)
 {
 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
@@ -952,7 +948,7 @@ netdev_dpdk_vhost_construct(struct netdev *netdev)
 VLOG_INFO("Socket %s created for vhost-user port %s\n",
   dev->vhost_id, name);
 }
-err = netdev_dpdk_init(netdev, -1, DPDK_DEV_VHOST);
+err = vhost_common_construct(netdev);
 
 ovs_mutex_unlock(_mutex);
 return err;
@@ -964,7 +960,7 @@ netdev_dpdk_vhost_client_construct(struct netdev *netdev)
 int err;
 
 ovs_mutex_lock(_mutex);
-err = netdev_dpdk_init(netdev, -1, DPDK_DEV_VHOST);
+err = vhost_common_construct(netdev);
 ovs_mutex_unlock(_mutex);
 return err;
 }
@@ -975,29 +971,36 @@ netdev_dpdk_construct(struct netdev *netdev)
 int err;
 
 ovs_mutex_lock(_mutex);
-err = netdev_dpdk_init(netdev, -1, DPDK_DEV_ETH);
+err = common_construct(netdev, -1, DPDK_DEV_ETH, SOCKET0);
 ovs_mutex_unlock(_mutex);
 return err;
 }
 
 static void
+common_destruct(struct netdev_dpdk *dev)
+OVS_REQUIRES(dpdk_mutex)
+OVS_EXCLUDED(dev->mutex)
+{
+rte_free(dev->tx_q);
+dpdk_mp_put(dev->dpdk_mp);
+
+ovs_list_remove(>list_node);
+free(ovsrcu_get_protected(struct ingress_policer *,
+  >ingress_policer));
+ovs_mutex_destroy(>mutex);
+}
+
+static void
 netdev_dpdk_destruct(struct netdev *netdev)
 {
 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
 
 ovs_mutex_lock(_mutex);
-ovs_mutex_lock(>mutex);
 
 rte_eth_dev_stop(dev->port_id);
 free(dev->devargs);
-free(ovsrcu_get_protected(struct ingress_policer *,
-  >ingress_policer));
+common_destruct(dev);
 
-rte_free(dev->tx_q);
-ovs_list_remove(>list_node);
-dpdk_mp_put(dev->dpdk_mp);
-
-ovs_mutex_unlock(>mutex);
 ovs_mutex_unlock(_mutex);
 }
 
@@ -1020,7 +1023,6 @@ netdev_dpdk_vhost_destruct(struct netdev *netdev)
 char *vhost_id;
 
 ovs_mutex_lock(_mutex);
-

[ovs-dev] [PATCH v3 03/18] dpif-netdev: Don't try to output on a device without txqs.

2017-01-08 Thread Daniele Di Proietto
Tunnel devices have 0 txqs and don't support netdev_send().  While
netdev_send() simply returns EOPNOTSUPP, the XPS logic is still executed
on output, and that might be confused by devices with no txqs.

It seems better to have different structures in the fast path for ports
that support netdev_{push,pop}_header (tunnel devices), and ports that
support netdev_send.  With this we can also remove a branch in
netdev_send().

This is also necessary for a future commit, which starts DPDK devices
without txqs.

Signed-off-by: Daniele Di Proietto 
---
 lib/dpif-netdev.c | 73 +++
 lib/netdev.c  | 35 ++
 lib/netdev.h  |  1 +
 3 files changed, 73 insertions(+), 36 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index f600cab00..004b28dc8 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -422,7 +422,8 @@ struct rxq_poll {
 struct ovs_list node;
 };
 
-/* Contained by struct dp_netdev_pmd_thread's 'port_cache' or 'tx_ports'. */
+/* Contained by struct dp_netdev_pmd_thread's 'send_port_cache',
+ * 'tnl_port_cache' or 'tx_ports'. */
 struct tx_port {
 struct dp_netdev_port *port;
 int qid;
@@ -504,11 +505,18 @@ struct dp_netdev_pmd_thread {
  * read by the pmd thread. */
 struct hmap tx_ports OVS_GUARDED;
 
-/* Map of 'tx_port' used in the fast path. This is a thread-local copy of
- * 'tx_ports'. The instance for cpu core NON_PMD_CORE_ID can be accessed
- * by multiple threads, and thusly need to be protected by 'non_pmd_mutex'.
- * Every other instance will only be accessed by its own pmd thread. */
-struct hmap port_cache;
+/* These are thread-local copies of 'tx_ports'.  One contains only tunnel
+ * ports (that support push_tunnel/pop_tunnel), the other contains ports
+ * with at least one txq (that support send).  A port can be in both.
+ *
+ * There are two separate maps to make sure that we don't try to execute
+ * OUTPUT on a device which has 0 txqs or PUSH/POP on a non-tunnel device.
+ *
+ * The instances for cpu core NON_PMD_CORE_ID can be accessed by multiple
+ * threads, and thusly need to be protected by 'non_pmd_mutex'.  Every
+ * other instance will only be accessed by its own pmd thread. */
+struct hmap tnl_port_cache;
+struct hmap send_port_cache;
 
 /* Only a pmd thread can write on its own 'cycles' and 'stats'.
  * The main thread keeps 'stats_zero' and 'cycles_zero' as base
@@ -3058,7 +3066,10 @@ pmd_free_cached_ports(struct dp_netdev_pmd_thread *pmd)
 /* Free all used tx queue ids. */
 dpif_netdev_xps_revalidate_pmd(pmd, 0, true);
 
-HMAP_FOR_EACH_POP (tx_port_cached, node, >port_cache) {
+HMAP_FOR_EACH_POP (tx_port_cached, node, >tnl_port_cache) {
+free(tx_port_cached);
+}
+HMAP_FOR_EACH_POP (tx_port_cached, node, >send_port_cache) {
 free(tx_port_cached);
 }
 }
@@ -3072,12 +3083,21 @@ pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
 struct tx_port *tx_port, *tx_port_cached;
 
 pmd_free_cached_ports(pmd);
-hmap_shrink(>port_cache);
+hmap_shrink(>send_port_cache);
+hmap_shrink(>tnl_port_cache);
 
 HMAP_FOR_EACH (tx_port, node, >tx_ports) {
-tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
-hmap_insert(>port_cache, _port_cached->node,
-hash_port_no(tx_port_cached->port->port_no));
+if (netdev_has_tunnel_push_pop(tx_port->port->netdev)) {
+tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
+hmap_insert(>tnl_port_cache, _port_cached->node,
+hash_port_no(tx_port_cached->port->port_no));
+}
+
+if (netdev_n_txq(tx_port->port->netdev)) {
+tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
+hmap_insert(>send_port_cache, _port_cached->node,
+hash_port_no(tx_port_cached->port->port_no));
+}
 }
 }
 
@@ -3312,7 +3332,8 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, 
struct dp_netdev *dp,
 pmd->next_optimization = time_msec() + DPCLS_OPTIMIZATION_INTERVAL;
 ovs_list_init(>poll_list);
 hmap_init(>tx_ports);
-hmap_init(>port_cache);
+hmap_init(>tnl_port_cache);
+hmap_init(>send_port_cache);
 /* init the 'flow_cache' since there is no
  * actual thread created for NON_PMD_CORE_ID. */
 if (core_id == NON_PMD_CORE_ID) {
@@ -3328,7 +3349,8 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
 struct dpcls *cls;
 
 dp_netdev_pmd_flow_flush(pmd);
-hmap_destroy(>port_cache);
+hmap_destroy(>send_port_cache);
+hmap_destroy(>tnl_port_cache);
 hmap_destroy(>tx_ports);
 /* All flows (including their dpcls_rules) have been deleted already */
 CMAP_FOR_EACH (cls, node, >classifiers) {
@@ -3595,7 +3617,9 @@ static void
 

[ovs-dev] [PATCH v3 05/18] netdev-dpdk: Start also dpdkr devices only once on port-add.

2017-01-08 Thread Daniele Di Proietto
Since commit 55e075e65ef9("netdev-dpdk: Arbitrary 'dpdk' port naming"),
we don't call rte_eth_start() from netdev_open() anymore, we only call
it from netdev_reconfigure().  This commit does that also for 'dpdkr'
devices, and remove some useless code.

Calling rte_eth_start() also from netdev_open() was unnecessary and
wasteful. Not doing it reduces code duplication and makes adding a port
faster (~900ms before the patch, ~400ms after).

Another reason why this is useful is that some DPDK driver might have
problems with reconfiguration. For example, until DPDK commit
8618d19b52b1("net/vmxnet3: reallocate shared memzone on re-config"),
vmxnet3 didn't support being restarted with a different number of
queues.

Technically, the netdev interface changed because before opening rxqs or
calling netdev_send() the user must check if reconfiguration is
required.  This patch also documents that, even though no change to the
userspace datapath (the only user) is required.

Lastly, this patch makes sure the errors returned by ofproto_port_add
(which includes the first port reconfiguration) are reported back to the
database.

Signed-off-by: Daniele Di Proietto 
---
 lib/netdev-dpdk.c | 70 ---
 lib/netdev.c  |  6 -
 vswitchd/bridge.c |  2 ++
 3 files changed, 38 insertions(+), 40 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 2df3e220c..d6315357b 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -746,10 +746,6 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
 int diag;
 int n_rxq, n_txq;
 
-if (!rte_eth_dev_is_valid_port(dev->port_id)) {
-return ENODEV;
-}
-
 rte_eth_dev_info_get(dev->port_id, );
 
 n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
@@ -858,30 +854,23 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int 
port_no,
 dev->port_id = port_no;
 dev->type = type;
 dev->flags = 0;
-dev->requested_mtu = dev->mtu = ETHER_MTU;
+dev->requested_mtu = ETHER_MTU;
 dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
 ovsrcu_index_init(>vid, -1);
 dev->vhost_reconfigured = false;
 
-err = netdev_dpdk_mempool_configure(dev);
-if (err) {
-goto unlock;
-}
-
 ovsrcu_init(>qos_conf, NULL);
 
 ovsrcu_init(>ingress_policer, NULL);
 dev->policer_rate = 0;
 dev->policer_burst = 0;
 
-netdev->n_rxq = NR_QUEUE;
-netdev->n_txq = NR_QUEUE;
-dev->requested_n_rxq = netdev->n_rxq;
-dev->requested_n_txq = netdev->n_txq;
-dev->rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
-dev->txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
-dev->requested_rxq_size = dev->rxq_size;
-dev->requested_txq_size = dev->txq_size;
+netdev->n_rxq = 0;
+netdev->n_txq = 0;
+dev->requested_n_rxq = NR_QUEUE;
+dev->requested_n_txq = NR_QUEUE;
+dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
+dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
 
 /* Initialize the flow control to NULL */
 memset(>fc_conf, 0, sizeof dev->fc_conf);
@@ -891,25 +880,18 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int 
port_no,
 
 dev->flags = NETDEV_UP | NETDEV_PROMISC;
 
-if (type == DPDK_DEV_ETH) {
-if (rte_eth_dev_is_valid_port(dev->port_id)) {
-err = dpdk_eth_dev_init(dev);
-if (err) {
-goto unlock;
-}
-}
-dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
-} else {
+if (type == DPDK_DEV_VHOST) {
 dev->tx_q = netdev_dpdk_alloc_txq(OVS_VHOST_MAX_QUEUE_NUM);
-}
-
-if (!dev->tx_q) {
-err = ENOMEM;
-goto unlock;
+if (!dev->tx_q) {
+err = ENOMEM;
+goto unlock;
+}
 }
 
 ovs_list_push_back(_list, >list_node);
 
+netdev_request_reconfigure(netdev);
+
 unlock:
 ovs_mutex_unlock(>mutex);
 return err;
@@ -3168,7 +3150,7 @@ out:
 return err;
 }
 
-static void
+static int
 dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
 OVS_REQUIRES(dev->mutex)
 {
@@ -3189,32 +3171,38 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
 }
 }
 
+if (!dev->dpdk_mp) {
+return ENOMEM;
+}
+
 if (netdev_dpdk_get_vid(dev) >= 0) {
 dev->vhost_reconfigured = true;
 }
+
+return 0;
 }
 
 static int
 netdev_dpdk_vhost_reconfigure(struct netdev *netdev)
 {
 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+int err;
 
 ovs_mutex_lock(>mutex);
-dpdk_vhost_reconfigure_helper(dev);
+err = dpdk_vhost_reconfigure_helper(dev);
 ovs_mutex_unlock(>mutex);
-return 0;
+
+return err;
 }
 
 static int
 netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
 {
 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-int err = 0;
+int err;
 
 ovs_mutex_lock(>mutex);
 
-dpdk_vhost_reconfigure_helper(dev);
-
 /* Configure vHost client mode if requested and if the 

[ovs-dev] [PATCH v3 04/18] netdev-dpdk: Don't call rte_dev_stop() in update_flags().

2017-01-08 Thread Daniele Di Proietto
Calling rte_eth_dev_stop() while the device is running causes a crash.

We could use rte_eth_dev_set_link_down(), but not every PMD implements
that, and I found one NIC where that has no effect.

Instead, this commit checks if the device has the NETDEV_UP flag when
transmitting or receiving (similarly to what we do for vhostuser). I
didn't notice any performance difference with this check in case the
device is up.

An alternative would be to remove the device queues from the pmd threads
tx and receive cache, but that requires reconfiguration and I'd prefer
to avoid it, because the change can come from OpenFlow.

Signed-off-by: Daniele Di Proietto 
---
 lib/netdev-dpdk.c | 28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 8bb908691..2df3e220c 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -783,8 +783,6 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
 mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
 dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
 
-dev->flags = NETDEV_UP | NETDEV_PROMISC;
-
 /* Get the Flow control configuration for DPDK-ETH */
 diag = rte_eth_dev_flow_ctrl_get(dev->port_id, >fc_conf);
 if (diag) {
@@ -890,6 +888,9 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int 
port_no,
 
 /* Initilize the hardware offload flags to 0 */
 dev->hw_ol_features = 0;
+
+dev->flags = NETDEV_UP | NETDEV_PROMISC;
+
 if (type == DPDK_DEV_ETH) {
 if (rte_eth_dev_is_valid_port(dev->port_id)) {
 err = dpdk_eth_dev_init(dev);
@@ -900,8 +901,6 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int 
port_no,
 dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
 } else {
 dev->tx_q = netdev_dpdk_alloc_txq(OVS_VHOST_MAX_QUEUE_NUM);
-/* Enable DPDK_DEV_VHOST device and set promiscuous mode flag. */
-dev->flags = NETDEV_UP | NETDEV_PROMISC;
 }
 
 if (!dev->tx_q) {
@@ -1591,6 +1590,10 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct 
dp_packet_batch *batch)
 int nb_rx;
 int dropped = 0;
 
+if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
+return EAGAIN;
+}
+
 nb_rx = rte_eth_rx_burst(rx->port_id, rxq->queue_id,
  (struct rte_mbuf **) batch->packets,
  NETDEV_MAX_BURST);
@@ -1821,6 +1824,11 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
struct dp_packet_batch *batch, bool may_steal,
bool concurrent_txq)
 {
+if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
+dp_packet_delete_batch(batch, may_steal);
+return;
+}
+
 if (OVS_UNLIKELY(concurrent_txq)) {
 qid = qid % dev->up.n_txq;
 rte_spinlock_lock(>tx_q[qid].tx_lock);
@@ -2285,8 +2293,6 @@ netdev_dpdk_update_flags__(struct netdev_dpdk *dev,
enum netdev_flags *old_flagsp)
 OVS_REQUIRES(dev->mutex)
 {
-int err;
-
 if ((off | on) & ~(NETDEV_UP | NETDEV_PROMISC)) {
 return EINVAL;
 }
@@ -2300,20 +2306,10 @@ netdev_dpdk_update_flags__(struct netdev_dpdk *dev,
 }
 
 if (dev->type == DPDK_DEV_ETH) {
-if (dev->flags & NETDEV_UP) {
-err = rte_eth_dev_start(dev->port_id);
-if (err)
-return -err;
-}
-
 if (dev->flags & NETDEV_PROMISC) {
 rte_eth_promiscuous_enable(dev->port_id);
 }
 
-if (!(dev->flags & NETDEV_UP)) {
-rte_eth_dev_stop(dev->port_id);
-}
-
 netdev_change_seq_changed(>up);
 } else {
 /* If DPDK_DEV_VHOST device's NETDEV_UP flag was changed and vhost is
-- 
2.11.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 01/18] dpif-netdev: Fix memory leak.

2017-01-08 Thread Daniele Di Proietto
We keep all the per-port classifiers around, since they can be reused,
but when a pmd thread is destroyed we should free them.

Found using valgrind.

Fixes: 3453b4d62a98("dpif-netdev: dpcls per in_port with sorted
subtables")

Signed-off-by: Daniele Di Proietto 
---
 lib/dpif-netdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d1f9661a2..9003f703d 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -,6 +,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
 /* All flows (including their dpcls_rules) have been deleted already */
 CMAP_FOR_EACH (cls, node, >classifiers) {
 dpcls_destroy(cls);
+ovsrcu_postpone(free, cls);
 }
 cmap_destroy(>classifiers);
 cmap_destroy(>flow_table);
-- 
2.11.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 02/18] dpif-netdev: Take non_pmd_mutex to access tx cached ports.

2017-01-08 Thread Daniele Di Proietto
As documented in dp_netdev_pmd_thread, we must take non_pmd_mutex to
access the tx port caches for the non pmd thread.

Found by inspection.

Signed-off-by: Daniele Di Proietto 
---
 lib/dpif-netdev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 9003f703d..f600cab00 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -3353,8 +3353,10 @@ dp_netdev_del_pmd(struct dp_netdev *dp, struct 
dp_netdev_pmd_thread *pmd)
 /* NON_PMD_CORE_ID doesn't have a thread, so we don't have to synchronize,
  * but extra cleanup is necessary */
 if (pmd->core_id == NON_PMD_CORE_ID) {
+ovs_mutex_lock(>non_pmd_mutex);
 emc_cache_uninit(>flow_cache);
 pmd_free_cached_ports(pmd);
+ovs_mutex_unlock(>non_pmd_mutex);
 } else {
 latch_set(>exit_latch);
 dp_netdev_reload_pmd__(pmd);
-- 
2.11.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH 4/4] datapath: Uses the OVS_CORE_UNSPEC instead of magic numbers.

2017-01-08 Thread nickcooper-zhangtonghao
This patch uses OVS_CORE_UNSPEC for the queue unpinned instead
of "-1". More important, the "-1" casted to unsigned int is
equal to NON_PMD_CORE_ID. We make the distinction between them.

Signed-off-by: nickcooper-zhangtonghao 
---
 lib/dpif-netdev.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 0b73056..99e4d35 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -1293,7 +1293,7 @@ port_create(const char *devname, const char *type,
  devname, ovs_strerror(errno));
 goto out_rxq_close;
 }
-port->rxqs[i].core_id = -1;
+port->rxqs[i].core_id = OVS_CORE_UNSPEC;
 n_open_rxqs++;
 }
 
@@ -1517,7 +1517,7 @@ has_pmd_rxq_for_numa(struct dp_netdev *dp, int numa_id)
 for (i = 0; i < port->n_rxq; i++) {
 unsigned core_id = port->rxqs[i].core_id;
 
-if (core_id != -1
+if (core_id != OVS_CORE_UNSPEC
 && ovs_numa_get_numa_id(core_id) == numa_id) {
 return true;
 }
@@ -2704,7 +2704,7 @@ parse_affinity_list(const char *affinity_list, unsigned 
*core_ids, int n_rxq)
 int error = 0;
 
 for (i = 0; i < n_rxq; i++) {
-core_ids[i] = -1;
+core_ids[i] = OVS_CORE_UNSPEC;
 }
 
 if (!affinity_list) {
@@ -3617,7 +3617,7 @@ dp_netdev_add_port_rx_to_pmds(struct dp_netdev *dp,
 
 for (i = 0; i < port->n_rxq; i++) {
 if (pinned) {
-if (port->rxqs[i].core_id == -1) {
+if (port->rxqs[i].core_id == OVS_CORE_UNSPEC) {
 continue;
 }
 pmd = dp_netdev_get_pmd(dp, port->rxqs[i].core_id);
@@ -3631,7 +3631,7 @@ dp_netdev_add_port_rx_to_pmds(struct dp_netdev *dp,
 pmd->isolated = true;
 dp_netdev_pmd_unref(pmd);
 } else {
-if (port->rxqs[i].core_id != -1) {
+if (port->rxqs[i].core_id != OVS_CORE_UNSPEC) {
 continue;
 }
 pmd = dp_netdev_less_loaded_pmd_on_numa(dp, numa_id);
@@ -3760,7 +3760,7 @@ dp_netdev_reset_pmd_threads(struct dp_netdev *dp)
 for (i = 0; i < port->n_rxq; i++) {
 unsigned core_id = port->rxqs[i].core_id;
 
-if (core_id != -1) {
+if (core_id != OVS_CORE_UNSPEC) {
 numa_id = ovs_numa_get_numa_id(core_id);
 hmapx_add(, (void *) numa_id);
 }
-- 
1.8.3.1



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH 3/4] datapath: Uses the NR_QUEUE instead of magic numbers.

2017-01-08 Thread nickcooper-zhangtonghao
The NR_QUEUE is defined in "lib/dpif-netdev.h", netdev-dpdk
uses it instead of magic number. netdev-dummy should be
in the same case.

Signed-off-by: nickcooper-zhangtonghao 
---
 lib/netdev-dummy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c
index d75e597..8d9c805 100644
--- a/lib/netdev-dummy.c
+++ b/lib/netdev-dummy.c
@@ -868,8 +868,8 @@ netdev_dummy_set_config(struct netdev *netdev_, const 
struct smap *args)
 goto exit;
 }
 
-new_n_rxq = MAX(smap_get_int(args, "n_rxq", 1), 1);
-new_n_txq = MAX(smap_get_int(args, "n_txq", 1), 1);
+new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
+new_n_txq = MAX(smap_get_int(args, "n_txq", NR_QUEUE), 1);
 new_numa_id = smap_get_int(args, "numa_id", 0);
 if (new_n_rxq != netdev->requested_n_rxq
 || new_n_txq != netdev->requested_n_txq
-- 
1.8.3.1



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH 1/4] datapath: Fix formatting typo.

2017-01-08 Thread nickcooper-zhangtonghao
Signed-off-by: nickcooper-zhangtonghao 
---
 lib/netdev-dpdk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 625f425..376aa4d 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -738,7 +738,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
 
 memset(_addr, 0x0, sizeof(eth_addr));
 rte_eth_macaddr_get(dev->port_id, _addr);
-VLOG_INFO_RL(, "Port %d: "ETH_ADDR_FMT"",
+VLOG_INFO_RL(, "Port %d: "ETH_ADDR_FMT,
 dev->port_id, ETH_ADDR_BYTES_ARGS(eth_addr.addr_bytes));
 
 memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
-- 
1.8.3.1




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] Sync on PTAP, EXT-382 and NSH

2017-01-08 Thread Jan Scheurich

Hi Yi,

I fully agree that support for NSH has been dragging along for to long. 
The prime reason for this (in addition to the dependency on the L3 
tunneling) have been the mentioned conceptual problems with the current 
patches. Our initiative is the attempt to put NSH on a solid basis in 
OF/OVS and make faster progress by bundling all forces that agree on the 
solution documented in the Google doc.


We wouldn't pursue this if we saw a chance of upstreaming the NSH 
patches as they are now. We still hope that you can agree on the 
proposed approach and help making NSH happen soon.


Please find some more answers below.

BR, Jan


On 2017-01-03 01:59, Yang, Yi Y wrote:


Jan, we can’t always be waiting endlessly, L3 patchset has been in 
Linux kernel no matter you like it or not, we’re just to make sure ovs 
can work with it, VxLAN-gpe is not only for NSH.


If you think L3 patchset in Linux kernel has any issue, please send 
out your fix patches as soon as possible, it can be ported to ovs, 
this isn’t an excuse we wait endlessly. I don’t think we need to wait it.


About several NSH-related issues you mentioned, I don’t think they are 
big issues,


Jan: “The NXM fields for NSH are used both as packet match fields and 
as packet metadata fields (after decap). This ambiguity leads to 
problems, latest when dealing with NSH in NSH packets.”


Yi: VxLAN tunnel metadata used the same way, isn’t it? What problems 
do you mean? I believe we can fix them even if they are really 
so-called “problems”.


Tunnel metadata do not have that issue. They are only valid after the 
tunnel packet has been decapsulated or before it is encapsulated, but 
never when it is encapsulated. However, if you use OXM fields both for 
matching present packet headers and for keeping their content as packet 
metadata after decapsulation, both the meaning of matching on them and 
manipulating them is essentially undefined. This is most obvious for a 
packet with two NSH headers in a row (hierarchical SFC).


Jan: “They introduce push/pop_nsh OpenFlow actions without dealing 
with the resulting non-Ethernet packets in the pipeline. The behavior 
is not at all well defined.”


Yi: L3 patchset is just for this, isn’t it? Your new proposal will 
also depend on L3 patchset, right?


No, the L3 patch set merely allows to connect L3 ports to an Ethernet 
only (non-PTAP) OF pipeline. The packets in the OF pipeline are 
logically always L2 packets, even though the representation inside 
ofproto-dpif may temporarily omit the Ethernet header from the packet. 
But the L2 fields dl_src, dl_dst and dl_type can be matched on and 
manipulated by the OF controller at any time. When a packet is sent to 
an Ethernet port, the L2 header containing these fields is always present.


There is nothing in the L3 tunneling patch set that would allow an OF 
controller to modify the packet type in the pipeline. For this we need 
the concept of packet-type aware pipeline and the packet_type match field.


The new proposal based on PTAP does indeed depend on "versatile" tunnel 
ports, as e.g. already implemented in the kernel datapath. Zoltan and I 
are very close to having the user-space parts of the L3 tunneling patch 
sets adapted to the packet_type concept. These replace the user-space 
patches in your latest L3 tunneling patch set. Then we'd have work 
packages 1,2,3 and 6 of the document in place and we could focus in 
parallel on 4 (PTAP), 5 (EXT-382) and NSH MD1 (7 and 8).


Jan: “The re-use of the Geneve tunnel metada fields for NSH MD2 TLVs 
is problematic because

a) it again mixes packet metadata and header fields and
b) it couldn't handle NSH MD2 in Geneve tunnels.”

Yi: You have to admit this is the existing best solution for MD type 
2, it is not perfect, but it is ready for use. I don’t think people 
will use GENEVE for NSH now, we can modify it to adapt to such use 
case if people really would like to do that way.


Using Tunnel TLV Metadata fields for NSH MD2 packet headers is 
conceptually wrong and not general as it would break if one carried NSH 
over Geneve. We should not upstream a broken solution that will have to 
be redone (in an incompatible way) sooner or later.


Jan, I don’t think the new proposal fixed the above issues you 
mentioned, on the contrary, it will make things more complicated. Why 
don’t we go fats path instead take a from-a-scratch way?


I don't really see where things are getting more complicated. Large 
parts of the solution actually carry over easily. From the OF controller 
perspective, the NSH pipeline can be just as simple as before, as we 
have tried to demonstrate with the example flows in the document.



*From:*Jan Scheurich [mailto:jan.scheur...@web.de]
*Sent:* Friday, December 30, 2016 6:34 PM
*To:* Yang, Yi Y ; Jan Scheurich 
; Zoltán Balogh 
; Jiri Benc (jb...@redhat.com) 
; Pravin Shelar ; 

[ovs-dev] [PATCH v8 7/8] ovn: avoid snat recirc only on gateway routers

2017-01-08 Thread Mickey Spiegel
Currently, for performance reasons on gateway routers, ct_snat
that does not specify an IP address does not immediately trigger
recirculation.  On gateway routers, ct_snat that does not specify
an IP address happens in the UNSNAT pipeline stage, which is
followed by the DNAT pipeline stage that triggers recirculation
for all packets.  This DNAT pipeline stage recirculation takes
care of the recirculation needs of UNSNAT as well as other cases
such as UNDNAT.

On distributed routers, UNDNAT is handled in the egress pipeline
stage, separately from DNAT in the ingress pipeline stages.  The
DNAT pipeline stage only triggers recirculation for some packets.
Due to this difference in design, UNSNAT needs to trigger its own
recirculation.

This patch restricts the logic that avoids recirculation for
ct_snat, so that it only applies to datapaths representing
gateway routers.

Signed-off-by: Mickey Spiegel 
---
 include/ovn/actions.h  |  3 +++
 ovn/controller/lflow.c | 10 ++
 ovn/lib/actions.c  | 15 +--
 tests/ovn.at   |  2 +-
 4 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/ovn/actions.h b/include/ovn/actions.h
index 0bf6145..0451c08 100644
--- a/include/ovn/actions.h
+++ b/include/ovn/actions.h
@@ -417,6 +417,9 @@ struct ovnact_encode_params {
 /* 'true' if the flow is for a switch. */
 bool is_switch;
 
+/* 'true' if the flow is for a gateway router. */
+bool is_gateway_router;
+
 /* A map from a port name to its connection tracking zone. */
 const struct simap *ct_zones;
 
diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
index 3d7633e..281 100644
--- a/ovn/controller/lflow.c
+++ b/ovn/controller/lflow.c
@@ -107,6 +107,15 @@ is_switch(const struct sbrec_datapath_binding *ldp)
 
 }
 
+static bool
+is_gateway_router(const struct sbrec_datapath_binding *ldp,
+  const struct hmap *local_datapaths)
+{
+struct local_datapath *ld =
+get_local_datapath(local_datapaths, ldp->tunnel_key);
+return ld ? ld->has_local_l3gateway : false;
+}
+
 /* Adds the logical flows from the Logical_Flow table to flow tables. */
 static void
 add_logical_flows(struct controller_ctx *ctx, const struct lport_index *lports,
@@ -220,6 +229,7 @@ consider_logical_flow(const struct lport_index *lports,
 .lookup_port = lookup_port_cb,
 .aux = ,
 .is_switch = is_switch(ldp),
+.is_gateway_router = is_gateway_router(ldp, local_datapaths),
 .ct_zones = ct_zones,
 .group_table = group_table,
 
diff --git a/ovn/lib/actions.c b/ovn/lib/actions.c
index 686ecc5..3da3dbe 100644
--- a/ovn/lib/actions.c
+++ b/ovn/lib/actions.c
@@ -788,12 +788,15 @@ encode_ct_nat(const struct ovnact_ct_nat *cn,
 ct = ofpacts->header;
 if (cn->ip) {
 ct->flags |= NX_CT_F_COMMIT;
-} else if (snat) {
-/* XXX: For performance reasons, we try to prevent additional
- * recirculations.  So far, ct_snat which is used in a gateway router
- * does not need a recirculation. ct_snat(IP) does need a
- * recirculation.  Should we consider a method to let the actions
- * specify whether an action needs recirculation if there more use
+} else if (snat && ep->is_gateway_router) {
+/* For performance reasons, we try to prevent additional
+ * recirculations.  ct_snat which is used in a gateway router
+ * does not need a recirculation.  ct_snat(IP) does need a
+ * recirculation.  ct_snat in a distributed router needs
+ * recirculation regardless of whether an IP address is
+ * specified.
+ * XXX Should we consider a method to let the actions specify
+ * whether an action needs recirculation if there are more use
  * cases?. */
 ct->recirc_table = NX_CT_RECIRC_NONE;
 }
diff --git a/tests/ovn.at b/tests/ovn.at
index 5e5d5c2..caa9773 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -852,7 +852,7 @@ ct_dnat();
 
 # ct_snat
 ct_snat;
-encodes as ct(zone=NXM_NX_REG12[0..15],nat)
+encodes as ct(table=27,zone=NXM_NX_REG12[0..15],nat)
 has prereqs ip
 ct_snat(192.168.1.2);
 encodes as 
ct(commit,table=27,zone=NXM_NX_REG12[0..15],nat(src=192.168.1.2))
-- 
1.9.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v8 6/8] ovn: move load balancing flows after NAT flows

2017-01-08 Thread Mickey Spiegel
This will make it easy for distributed NAT to reuse some of the
existing code for NAT flows, while leaving load balancing and defrag
as functionality specific to gateway routers.  There is no intent to
change any functionality in this patch.

Signed-off-by: Mickey Spiegel 
---
 ovn/northd/ovn-northd.c | 140 
 1 file changed, 70 insertions(+), 70 deletions(-)

diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
index d01c42a..59fd02e 100644
--- a/ovn/northd/ovn-northd.c
+++ b/ovn/northd/ovn-northd.c
@@ -4101,76 +4101,6 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap 
*ports,
 const char *lb_force_snat_ip = get_force_snat_ip(od, "lb",
  _ip);
 
-/* A set to hold all ips that need defragmentation and tracking. */
-struct sset all_ips = SSET_INITIALIZER(_ips);
-
-for (int i = 0; i < od->nbr->n_load_balancer; i++) {
-struct nbrec_load_balancer *lb = od->nbr->load_balancer[i];
-struct smap *vips = >vips;
-struct smap_node *node;
-
-SMAP_FOR_EACH (node, vips) {
-uint16_t port = 0;
-
-/* node->key contains IP:port or just IP. */
-char *ip_address = NULL;
-ip_address_and_port_from_lb_key(node->key, _address, );
-if (!ip_address) {
-continue;
-}
-
-if (!sset_contains(_ips, ip_address)) {
-sset_add(_ips, ip_address);
-}
-
-/* Higher priority rules are added for load-balancing in DNAT
- * table.  For every match (on a VIP[:port]), we add two flows
- * via add_router_lb_flow().  One flow is for specific matching
- * on ct.new with an action of "ct_lb($targets);".  The other
- * flow is for ct.est with an action of "ct_dnat;". */
-ds_clear();
-ds_put_format(, "ct_lb(%s);", node->value);
-
-ds_clear();
-ds_put_format(, "ip && ip4.dst == %s",
-  ip_address);
-free(ip_address);
-
-if (port) {
-if (lb->protocol && !strcmp(lb->protocol, "udp")) {
-ds_put_format(, " && udp && udp.dst == %d",
-  port);
-} else {
-ds_put_format(, " && tcp && tcp.dst == %d",
-  port);
-}
-add_router_lb_flow(lflows, od, , , 120,
-   lb_force_snat_ip);
-} else {
-add_router_lb_flow(lflows, od, , , 110,
-   lb_force_snat_ip);
-}
-}
-}
-
-/* If there are any load balancing rules, we should send the
- * packet to conntrack for defragmentation and tracking.  This helps
- * with two things.
- *
- * 1. With tracking, we can send only new connections to pick a
- *DNAT ip address from a group.
- * 2. If there are L4 ports in load balancing rules, we need the
- *defragmentation to match on L4 ports. */
-const char *ip_address;
-SSET_FOR_EACH(ip_address, _ips) {
-ds_clear();
-ds_put_format(, "ip && ip4.dst == %s", ip_address);
-ovn_lflow_add(lflows, od, S_ROUTER_IN_DEFRAG,
-  100, ds_cstr(), "ct_next;");
-}
-
-sset_destroy(_ips);
-
 for (int i = 0; i < od->nbr->n_nat; i++) {
 const struct nbrec_nat *nat;
 
@@ -4325,6 +4255,76 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap 
*ports,
 * routing in the openflow pipeline. */
 ovn_lflow_add(lflows, od, S_ROUTER_IN_DNAT, 50,
   "ip", "flags.loopback = 1; ct_dnat;");
+
+/* A set to hold all ips that need defragmentation and tracking. */
+struct sset all_ips = SSET_INITIALIZER(_ips);
+
+for (int i = 0; i < od->nbr->n_load_balancer; i++) {
+struct nbrec_load_balancer *lb = od->nbr->load_balancer[i];
+struct smap *vips = >vips;
+struct smap_node *node;
+
+SMAP_FOR_EACH (node, vips) {
+uint16_t port = 0;
+
+/* node->key contains IP:port or just IP. */
+char *ip_address = NULL;
+ip_address_and_port_from_lb_key(node->key, _address, );
+if (!ip_address) {
+continue;
+}
+
+if (!sset_contains(_ips, ip_address)) {
+sset_add(_ips, ip_address);
+}
+
+/* Higher priority rules are added for load-balancing in DNAT
+ 

[ovs-dev] [PATCH v8 4/8] ovn: Introduce distributed gateway port and "chassisredirect" port binding

2017-01-08 Thread Mickey Spiegel
Currently OVN distributed logical routers achieve reachability to
physical networks by passing through a "join" logical switch to a
centralized gateway router, which then connects to another logical
switch that has a localnet port connecting to the physical network.

This patch adds logical port and port binding abstractions that allow
an OVN distributed logical router to connect directly to a logical
switch that has a localnet port connecting to the physical network.
In this patch, this logical router port is called a "distributed
gateway port".

The primary design goal of distributed gateway ports is to allow as
much traffic as possible to be handled locally on the hypervisor
where a VM or container resides.  Whenever possible, packets from
the VM or container to the outside world should be processed
completely on that VM's or container's hypervisor, eventually
traversing a localnet port instance on that hypervisor to the
physical network.  Whenever possible, packets from the outside
world to a VM or container should be directed through the physical
network directly to the VM's or container's hypervisor, where the
packet will enter the integration bridge through a localnet port.

However, due to the implications of the use of L2 learning in the
physical network, as well as the need to support advanced features
such as one-to-many NAT (aka IP masquerading), where multiple
logical IP addresses spread across multiple chassis are mapped to
one external IP address, it will be necessary to handle some of the
logical router processing on a specific chassis in a centralized
manner.  For this reason, the user must associate a chassis with
each distributed gateway port.

In order to allow for the distributed processing of some packets,
distributed gateway ports need to be logical patch ports that
effectively reside on every hypervisor, rather than "l3gateway"
ports that are bound to a particular chassis.  However, the flows
associated with distributed gateway ports often need to be
associated with physical locations.  This is implemented in this
patch (and subsequent patches) by adding "is_chassis_resident()"
match conditions to several logical router flows.

While most of the physical location dependent aspects of distributed
gateway ports can be handled by restricting some flows to specific
chassis, one additional mechanism is required.  When a packet
leaves the ingress pipeline and the logical egress port is the
distributed gateway port, one of two different sets of actions is
required at table 32:
- If the packet can be handled locally on the sender's hypervisor
  (e.g. one-to-one NAT traffic), then the packet should just be
  resubmitted locally to table 33, in the normal manner for
  distributed logical patch ports.
- However, if the packet needs to be handled on the chassis
  associated with the distributed gateway port (e.g. one-to-many
  SNAT traffic or non-NAT traffic), then table 32 must send the
  packet on a tunnel port to that chassis.
In order to trigger the second set of actions, the
"chassisredirect" type of southbound port_binding is introduced.
Setting the logical egress port to the type "chassisredirect"
logical port is simply a way to indicate that although the packet
is destined for the distributed gateway port, it needs to be
redirected to a different chassis.  At table 32, packets with this
logical egress port are sent to a specific chassis, in the same
way that table 32 directs packets whose logical egress port is a
VIF or a type "l3gateway" port to different chassis.  Once the
packet arrives at that chassis, table 33 resets the logical egress
port to the value representing the distributed gateway port.  For
each distributed gateway port, there is one type "chassisredirect"
port, in addition to the distributed logical patch port
representing the distributed gateway port.

A "chassisredirect" port represents a particular instance, bound
to a specific chassis, of an otherwise distributed port.  A
"chassisredirect" port is associated with a chassis in the same
manner as a "l3gateway" port.  However, unlike "l3gateway" ports,
"chassisredirect" ports have no associated IP or MAC addresses,
and "chassisredirect" ports should never be used as the "inport".
Any pipeline stages that depend on port specific IP or MAC addresses
should be carried out in the context of the distributed gateway
port's logical patch port.

Although the abstraction represented by the "chassisredirect" port
binding is generalized, in this patch the "chassisredirect" port binding
is only created for NB logical router ports that specify the new
"redirect-chassis" option.  There is no explicit notion of a
"chassisredirect" port in the NB database.  The expectation is when
capabilities are implemented that take advantage of "chassisredirect"
ports (e.g. distributed gateway ports), flows specifying a
"chassisredirect" port as the outport will be added as part of that
capability.

Signed-off-by: Mickey Spiegel 

[ovs-dev] [PATCH v8 5/8] ovn: add egress loopback capability

2017-01-08 Thread Mickey Spiegel
This patch adds the capability to force loopback at the end of the
egress pipeline.  A new flags.force_egress_loopback symbol is defined,
along with corresponding flags bits.  When flags.force_egress_loopback
is set, at OFTABLE_LOG_TO_PHY, instead of the packet being sent out to
the peer patch port or out the outport, the packet is forced back to
the beginning of the ingress pipeline with inport = outport.  All
other registers are cleared, as if the packet just arrived on that
inport.

This capability is needed in order to implement some of the east/west
distributed NAT flows.

Note: The existing flags.loopback allows a packet to go from the end
of the ingress pipeline to the beginning of the egress pipeline with
outport = inport, which is different.

Initially, there are no tests incorporated in this patch.  This
functionality is tested in a subsequent distributed NAT flows patch.
Tests specific to egress loopback may be added once the capability
to inject a packet with one of the flags bits set is added.

Signed-off-by: Mickey Spiegel 
---
 ovn/controller/physical.c   | 38 ++
 ovn/lib/logical-fields.c|  8 
 ovn/lib/logical-fields.h| 14 ++
 ovn/northd/ovn-northd.8.xml |  4 +++-
 ovn/northd/ovn-northd.c |  2 ++
 ovn/ovn-sb.xml  |  2 +-
 ovn/utilities/ovn-trace.c   | 41 ++---
 7 files changed, 92 insertions(+), 17 deletions(-)

diff --git a/ovn/controller/physical.c b/ovn/controller/physical.c
index 07b7cd5..35f761d 100644
--- a/ovn/controller/physical.c
+++ b/ovn/controller/physical.c
@@ -183,7 +183,7 @@ get_zone_ids(const struct sbrec_port_binding *binding,
 }
 
 static void
-put_local_common_flows(uint32_t dp_key, uint32_t port_key,
+put_local_common_flows(uint32_t dp_key, uint32_t port_key, ofp_port_t ofport,
bool nested_container, const struct zone_ids *zone_ids,
struct ofpbuf *ofpacts_p, struct hmap *flow_table)
 {
@@ -259,6 +259,36 @@ put_local_common_flows(uint32_t dp_key, uint32_t port_key,
 put_stack(MFF_IN_PORT, ofpact_put_STACK_POP(ofpacts_p));
 ofctrl_add_flow(flow_table, OFTABLE_SAVE_INPORT, 100, 0,
 , ofpacts_p);
+
+/* Table 65, Priority 150.
+ * ===
+ *
+ * Send packets with MLF_FORCE_EGRESS_LOOPBACK flag back to the
+ * ingress pipeline with inport = outport. */
+
+match_init_catchall();
+ofpbuf_clear(ofpacts_p);
+match_set_metadata(, htonll(dp_key));
+match_set_reg(, MFF_LOG_OUTPORT - MFF_REG0, port_key);
+match_set_reg_masked(, MFF_LOG_FLAGS - MFF_REG0,
+ MLF_FORCE_EGRESS_LOOPBACK, MLF_FORCE_EGRESS_LOOPBACK);
+
+size_t clone_ofs = ofpacts_p->size;
+struct ofpact_nest *clone = ofpact_put_CLONE(ofpacts_p);
+put_load(ofport, MFF_IN_PORT, 0, 16, ofpacts_p);
+put_load(port_key, MFF_LOG_INPORT, 0, 32, ofpacts_p);
+put_load(0, MFF_LOG_OUTPORT, 0, 32, ofpacts_p);
+put_load(MLF_EGRESS_LOOPBACK_OCCURRED, MFF_LOG_FLAGS, 0, 32, ofpacts_p);
+for (int i = 0; i < MFF_N_LOG_REGS; i++) {
+put_load(0, MFF_LOG_REG0 + i, 0, 32, ofpacts_p);
+}
+put_resubmit(OFTABLE_LOG_INGRESS_PIPELINE, ofpacts_p);
+clone = ofpbuf_at_assert(ofpacts_p, clone_ofs, sizeof *clone);
+ofpacts_p->header = clone;
+ofpact_finish_CLONE(ofpacts_p, );
+
+ofctrl_add_flow(flow_table, OFTABLE_LOG_TO_PHY, 150, 0,
+, ofpacts_p);
 }
 
 static void
@@ -321,7 +351,7 @@ consider_port_binding(enum mf_field_id mff_ovn_geneve,
 }
 
 struct zone_ids binding_zones = get_zone_ids(binding, ct_zones);
-put_local_common_flows(dp_key, port_key, false, _zones,
+put_local_common_flows(dp_key, port_key, 0, false, _zones,
ofpacts_p, flow_table);
 
 match_init_catchall();
@@ -490,8 +520,8 @@ consider_port_binding(enum mf_field_id mff_ovn_geneve,
  */
 
 struct zone_ids zone_ids = get_zone_ids(binding, ct_zones);
-put_local_common_flows(dp_key, port_key, nested_container, _ids,
-   ofpacts_p, flow_table);
+put_local_common_flows(dp_key, port_key, ofport, nested_container,
+   _ids, ofpacts_p, flow_table);
 
 /* Table 0, Priority 150 and 100.
  * ==
diff --git a/ovn/lib/logical-fields.c b/ovn/lib/logical-fields.c
index fa134d6..c056e41 100644
--- a/ovn/lib/logical-fields.c
+++ b/ovn/lib/logical-fields.c
@@ -96,6 +96,14 @@ ovn_init_symtab(struct shash *symtab)
  MLF_FORCE_SNAT_FOR_LB_BIT);
 expr_symtab_add_subfield(symtab, "flags.force_snat_for_lb", NULL,
  flags_str);
+snprintf(flags_str, sizeof flags_str, "flags[%d]",
+ MLF_FORCE_EGRESS_LOOPBACK_BIT);
+expr_symtab_add_subfield(symtab, "flags.force_egress_loopback", NULL,
+ 

[ovs-dev] [PATCH v8 1/8] ovn: specify addresses of type "router" lsps as "router"

2017-01-08 Thread Mickey Spiegel
Currently in OVN, when a logical switch port of type "router" is
created, the MAC and optionally IP addresses of the peer logical
router port must be specified again as the addresses of the logical
switch port.

This patch allows the logical switch port's addresses to be
specified as the string "router", rather than explicitly copying the
logical router port's MAC and optionally IP addresses.  The router
addresses are used to populate the logical switch's destination
lookup, and to populate op->lsp_addrs in ovn-northd.c, which in turn
is used to generate logical switch ARP and ND replies.  Since ipam
already looks at logical router ports, the only ipam modification
necessary is to skip logical switch ports with addresses "router".

Signed-off-by: Mickey Spiegel 
Acked-by: Ben Pfaff 
---
 ovn/northd/ovn-northd.c   | 31 +--
 ovn/ovn-nb.xml| 22 ++
 ovn/utilities/ovn-nbctl.8.xml |  9 +
 ovn/utilities/ovn-nbctl.c |  1 +
 tests/ovn.at  |  3 ++-
 5 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
index a28327b..5ad544d 100644
--- a/ovn/northd/ovn-northd.c
+++ b/ovn/northd/ovn-northd.c
@@ -815,7 +815,7 @@ ipam_insert_lsp_addresses(struct ovn_datapath *od, struct 
ovn_port *op,
   char *address)
 {
 if (!od || !op || !address || !strcmp(address, "unknown")
-|| is_dynamic_lsp_address(address)) {
+|| !strcmp(address, "router") || is_dynamic_lsp_address(address)) {
 return;
 }
 
@@ -1205,7 +1205,8 @@ join_logical_ports(struct northd_context *ctx,
 op->lsp_addrs
 = xmalloc(sizeof *op->lsp_addrs * nbsp->n_addresses);
 for (size_t j = 0; j < nbsp->n_addresses; j++) {
-if (!strcmp(nbsp->addresses[j], "unknown")) {
+if (!strcmp(nbsp->addresses[j], "unknown")
+|| !strcmp(nbsp->addresses[j], "router")) {
 continue;
 }
 if (is_dynamic_lsp_address(nbsp->addresses[j])) {
@@ -1323,6 +1324,18 @@ join_logical_ports(struct northd_context *ctx,
 op->od->router_ports,
 sizeof *op->od->router_ports * (op->od->n_router_ports + 1));
 op->od->router_ports[op->od->n_router_ports++] = op;
+
+/* Fill op->lsp_addrs for op->nbsp->addresses[] with
+ * contents "router", which was skipped in the loop above. */
+for (size_t j = 0; j < op->nbsp->n_addresses; j++) {
+if (!strcmp(op->nbsp->addresses[j], "router")) {
+if (extract_lrp_networks(peer->nbrp,
+>lsp_addrs[op->n_lsp_addrs])) {
+op->n_lsp_addrs++;
+}
+break;
+}
+}
 } else if (op->nbrp && op->nbrp->peer) {
 struct ovn_port *peer = ovn_port_find(ports, op->nbrp->peer);
 if (peer) {
@@ -3123,6 +3136,20 @@ build_lswitch_flows(struct hmap *datapaths, struct hmap 
*ports,
 ds_put_format(, "outport = %s; output;", op->json_key);
 ovn_lflow_add(lflows, op->od, S_SWITCH_IN_L2_LKUP, 50,
   ds_cstr(), ds_cstr());
+} else if (!strcmp(op->nbsp->addresses[i], "router")) {
+if (!op->peer || !op->peer->nbrp
+|| !ovs_scan(op->peer->nbrp->mac,
+ETH_ADDR_SCAN_FMT, ETH_ADDR_SCAN_ARGS(mac))) {
+continue;
+}
+ds_clear();
+ds_put_format(, "eth.dst == "ETH_ADDR_FMT,
+  ETH_ADDR_ARGS(mac));
+
+ds_clear();
+ds_put_format(, "outport = %s; output;", op->json_key);
+ovn_lflow_add(lflows, op->od, S_SWITCH_IN_L2_LKUP, 50,
+  ds_cstr(), ds_cstr());
 } else {
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
 
diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml
index a3dc916..e52b29e 100644
--- a/ovn/ovn-nb.xml
+++ b/ovn/ovn-nb.xml
@@ -519,6 +519,28 @@
 
   
 
+  router
+  
+
+  Accepted only when  is router.
+  This indicates that the Ethernet, IPv4, and IPv6 addresses for
+  this logical switch port should be obtained from the connected
+  logical router port, as specified by router-port in
+  .
+
+
+
+  The resulting addresses are used to populate the logical
+  switch's destination lookup, and also for the logical switch
+  to generate ARP and ND replies.
+
+
+
+  Supported only 

[ovs-dev] [PATCH v8 2/8] ovn: document logical routers and logical patch ports in ovn-architecture

2017-01-08 Thread Mickey Spiegel
This patch adds a description of logical routers and logical patch ports,
including gateway routers, to ovn/ovn-architecture.7.xml.

Signed-off-by: Mickey Spiegel 
---
 ovn/ovn-architecture.7.xml | 148 ++---
 1 file changed, 140 insertions(+), 8 deletions(-)

diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml
index b049e51..d92f878 100644
--- a/ovn/ovn-architecture.7.xml
+++ b/ovn/ovn-architecture.7.xml
@@ -381,6 +381,36 @@
   switch.  Logical switches and routers are both implemented as logical
   datapaths.
 
+
+
+  
+Logical ports represent the points of connectivity in and
+out of logical switches and logical routers.  Some common types of
+logical ports are:
+  
+
+  
+
+  Logical ports representing VIFs.
+
+
+
+  Localnet ports represent the points of connectivity
+  between logical switches and the physical network.  They are
+  implemented as OVS patch ports between the integration bridge
+  and the separate Open vSwitch bridge that underlay physical
+  ports attach to.
+
+
+
+  Logical patch ports represent the points of
+  connectivity between logical switches and logical routers, and
+  in some cases between peer logical routers.  There is a pair of
+  logical patch ports at each such point of connectivity, one on
+  each side.
+
+  
+
   
 
   Life Cycle of a VIF
@@ -1040,17 +1070,119 @@
 is a container nested with a VM, then before sending the packet the
 actions push on a VLAN header with an appropriate VLAN ID.
   
-
-  
-If the logical egress port is a logical patch port, then table 65
-outputs to an OVS patch port that represents the logical patch port.
-The packet re-enters the OpenFlow flow table from the OVS patch port's
-peer in table 0, which identifies the logical datapath and logical
-input port based on the OVS patch port's OpenFlow port number.
-  
 
   
 
+  Logical Routers and Logical Patch Ports
+
+  
+Typically logical routers and logical patch ports do not have a
+physical location and effectively reside on every hypervisor.  This is
+the case for logical patch ports between logical routers and logical
+switches behind those logical routers, to which VMs (and VIFs) attach.
+  
+
+  
+Consider a packet sent from one virtual machine or container to another
+VM or container that resides on a different subnet.  The packet will
+traverse tables 0 to 65 as described in the previous section
+Architectural Physical Life Cycle of a Packet, using the
+logical datapath representing the logical switch that the sender is
+attached to.  At table 32, the packet will use the fallback flow that
+resubmits locally to table 33 on the same hypervisor.  In this case,
+all of the processing from table 0 to table 65 occurs on the hypervisor
+where the sender resides.
+  
+
+  
+When the packet reaches table 65, the logical egress port is a logical
+patch port.  The behavior at table 65 differs depending on the OVS
+version:
+  
+
+  
+
+  In OVS versions 2.6 and earlier, table 65 outputs to an OVS patch
+  port that represents the logical patch port.  The packet re-enters
+  the OpenFlow flow table from the OVS patch port's peer in table 0,
+  which identifies the logical datapath and logical input port based
+  on the OVS patch port's OpenFlow port number.
+
+
+
+  In OVS versions 2.7 and later, the packet is cloned and resubmitted
+  directly to OpenFlow flow table 16, setting the logical ingress
+  port to the peer logical patch port, and using the peer logical
+  patch port's logical datapath (that represents the logical router).
+
+  
+
+  
+The packet re-enters the ingress pipeline in order to traverse tables
+16 to 65 again, this time using the logical datapath representing the
+logical router.  The processing continues as described in the previous
+section Architectural Physical Life Cycle of a Packet.
+When the packet reachs table 65, the logical egress port will once
+again be a logical patch port.  In the same manner as described above,
+this logical patch port will cause the packet to be resubmitted to
+OpenFlow tables 16 to 65, this time using the logical datapath
+representing the logical switch that the destination VM or container
+is attached to.
+  
+
+  
+The packet traverses tables 16 to 65 a third and final time.  If the
+destination VM or container resides on a remote hypervisor, then table
+32 will send the packet on a tunnel port from the sender's hypervisor
+to the remote hypervisor.  Finally table 65 will output the packet
+directly to the destination VM or