[dpdk-dev] [PATCH] mbuf: fix atomic refcnt update synchronization

2016-09-03 Thread Ananyev, Konstantin
Hi,

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Linzhe Lee
> Sent: Saturday, September 3, 2016 3:05 AM
> To: Stephen Hemminger 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] mbuf: fix atomic refcnt update synchronization
> 
> yes,stephen.
> 
> my config file here: http://pastebin.com/N0RKGArh
> 
> 2016-09-03 0:51 GMT+08:00 Stephen Hemminger :
> > On Sat, 3 Sep 2016 00:31:50 +0800
> > Linzhe Lee  wrote:
> >
> >> Thanks for reply, Stephen.
> >>
> >>
> >>
> >> I'm in x86-64, my cpu is `Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz`.
> >>
> >>
> >>
> >> When allocation mbuf in program1, and transfer it to program2 for
> >> free via ring, the program1 might meet assert in allocate mbuf sometimes.
> >> (`RTE_ASSERT(rte_mbuf_refcnt_read(m) == 0);`)

If you believe there is a problem inside rte_mbuf code,
please provide a test program to reproduce the issue.
So far, I personally don't see any issue in the rte_mbuf code.
Konstantin


> >>
> >>
> >>
> >> but when I using gdb to check it, the refcnt field of mbuf is already
> >> zero. so I believe the problem came from the cache line problem or
> >> incorrect optimization.
> >>
> >>
> >>
> >> When apply this patch, the problem seems solved. I'm submitting it
> >> for your comments.
> >
> > Are you sure you have REFCNT_ATOMIC set?


[dpdk-dev] [RFC][PATCH 2/3] examples/vhost: Add vswitch command line options

2016-09-03 Thread Pankaj Chauhan
On 9/2/2016 8:14 PM, Maxime Coquelin wrote:
>
>
> On 08/27/2016 06:26 PM, Pankaj Chauhan wrote:
>> Add command line options for selecting switch implementation
>> and maximum ports for the vswitch.following are two new command
>> line options:
>>
>> --switch  [char string, Selects the switch imlementation]
>> --max-ports [int, selects maximum number of ports to support]
>>
>> For example:
>>
>> $ ./vhost-switch -c 3 -n 2 --socket-mem 1024 --huge-dir /hugetlbfs -- -p
>> 0x1 --dev-basename sock1 --switch "vmdq" --max-ports 3
>>
>> Signed-off-by: Pankaj Chauhan 
>> ---
>>  examples/vhost/main.c | 43 +++
>>  1 file changed, 43 insertions(+)
>>
>> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
>> index 92a9823..59cddb8 100644
>> --- a/examples/vhost/main.c
>> +++ b/examples/vhost/main.c
>> @@ -142,6 +142,10 @@ static uint32_t burst_rx_retry_num =
>> BURST_RX_RETRIES;
>>  /* Character device basename. Can be set by user. */
>>  static char dev_basename[MAX_BASENAME_SZ] = "vhost-net";
>>
>> +/* vswitch device name and maximum number of ports */
>> +static char switch_dev[MAX_BASENAME_SZ] = "vmdq";
>> +static uint32_t switch_max_ports = MAX_DEVICES;
>> +
>>  /* empty vmdq configuration structure. Filled in programatically */
>>  static struct rte_eth_conf vmdq_conf_default = {
>>  .rxmode = {
>> @@ -408,6 +412,22 @@ us_vhost_parse_basename(const char *q_arg)
>>  }
>>
>>  /*
>> + * Set switch device name.
>> + */
>> +static int
>> +us_vhost_parse_switch_name(const char *q_arg)
>> +{
>> +/* parse number string */
>> +
>> +if (strnlen(q_arg, MAX_BASENAME_SZ) > MAX_BASENAME_SZ)
>> +return -1;
>> +else
>> +snprintf((char*)_dev, MAX_BASENAME_SZ, "%s", q_arg);
> why casting?

yes not required, will remove it.
>> +
>> +return 0;
>> +}
>> +
>> +/*
>>   * Parse the portmask provided at run time.
>>   */
>>  static int
>> @@ -501,6 +521,8 @@ us_vhost_parse_args(int argc, char **argv)
>>  {"tx-csum", required_argument, NULL, 0},
>>  {"tso", required_argument, NULL, 0},
>>  {"client", no_argument, _mode, 1},
>> +{"switch", required_argument, NULL, 0},
>> +{"max-ports", required_argument, NULL, 0},
>>  {NULL, 0, 0, 0},
>>  };
>>
>> @@ -655,6 +677,27 @@ us_vhost_parse_args(int argc, char **argv)
>>  }
>>  }
>>
>> +/* Set vswitch_driver name */
>> +if (!strncmp(long_option[option_index].name, "switch",
>> MAX_LONG_OPT_SZ)) {
>> +if (us_vhost_parse_switch_name(optarg) == -1) {
>> +RTE_LOG(INFO, VHOST_CONFIG, "Invalid argument for
>> character switch dev (Max %d characters)\n", MAX_BASENAME_SZ);
> ERR may be morez appropriate.

I didn't get the comment, can you please help me understand.
> And the message may be a little too long.

I will shorten it, thanks.
>
>> +us_vhost_usage(prgname);
>> +return -1;
>> +}
>> +}
>> +
>> +/* Specify Max ports in vswitch. */
>> +if (!strncmp(long_option[option_index].name, "max-ports",
>> MAX_LONG_OPT_SZ)) {
>> +ret = parse_num_opt(optarg, INT32_MAX);
>> +if (ret == -1) {
>> +RTE_LOG(INFO, VHOST_CONFIG, "Invalid argument for
>> switch max ports [0-N]\n");
>> +us_vhost_usage(prgname);
>> +return -1;
>> +} else {
>> +switch_max_ports = ret;
>> +}
> The else is not needed as the 'if' returns.

Agreed, will fix it.
>> +}
>> +
>>  break;
>>
>>  /* Invalid option - print options. */
>>
>




[dpdk-dev] [RFC][PATCH 1/3] examples/vhost: Add vswitch (generic switch) framework

2016-09-03 Thread Pankaj Chauhan
On 9/2/2016 7:47 PM, Maxime Coquelin wrote:
>
> Hi Pankaj,
>
>   Sorry for the late review.

Hi Maxime,

No problem :) Thanks for the review, and it came just in time. I am 
almost done with testing the patchset for intel (with vmdq) and i was 
about to send the next version tomorrow.

Now that i've your review and views, i will incorporte them and send the 
tested patchset soon.

Thanks again for your time and effort for review.

>
> On 08/27/2016 06:26 PM, Pankaj Chauhan wrote:
>> Indroduce support for a generic framework for handling of switching
>> between
> s/Indroduce/Introduce/
>> physical and virtio devices. The vswitch framework introduces the
>> following
>> concept:
> Shouldn't you use vhost term instead of virtio?

Agreed, i will use vhost in next version
>>
>> 1. vswitch_dev: Vswitch device is a logical switch which can have
>> Phsyical and
> s/Phsyical/physical/
>> virtio devices. The devices are operated/used using standard DPDK API for
>> devices.
> Maybe mention that today, we don't use PMD API for vhost devices but
> use directly the vhost library API?

Agreed, will do it.
>>
>> 2. vswitch_port: Any physical or virtio device that is added to
>> vswitch. The
>> port can have it's own tx/rx functions for doing data transfer, which
>> are exposed
> s/it's/its/

sorry for typo, will fix it.
>> to the framework using generic function pointers
>> (vs_port->do_tx/do_rx). This way
>> the generic code can do tx/rx without understanding the type of device
>> (Physical or
>> virtio).
> Very good.
>>
>> 3. vswitch_ops: The ops is set of function pointers which are used to
>> do operations
>> like learning, unlearning, add/delete port, lookup_and_forward. The
>> user of vswitch
>> framework (vhost/main.[c,h])uses these function pointers to perform
>> above mentioned
>> operations, thus it remains agnostic of the underlying implmentation.
> s/implmentation/implementation/

Typo again my bad!, will fix it
>
>
>>
>> Different switching logics can implement their vswitch_device and
>> vswitch_ops, and
>> register with the framework. This framework makes vhost-switch
>> application scalable
>> in terms of:
>>
>> 1. Different switching logics (one of them is vmdq, vhost/vmdq.[c,h]
>> 2. Number of ports.
>> 3. Policies of selecting ports for rx and tx.
>>
>> Signed-off-by: Pankaj Chauhan 
>> ---
>>  examples/vhost/Makefile |   2 +-
>>  examples/vhost/vswitch_common.c | 467
>> 
>>  examples/vhost/vswitch_common.h | 175 +++
>>  3 files changed, 643 insertions(+), 1 deletion(-)
>>  create mode 100644 examples/vhost/vswitch_common.c
>>  create mode 100644 examples/vhost/vswitch_common.h
>>
>> diff --git a/examples/vhost/Makefile b/examples/vhost/Makefile
>> index e95c68a..458d166 100644
>> --- a/examples/vhost/Makefile
>> +++ b/examples/vhost/Makefile
>> @@ -48,7 +48,7 @@ else
>>  APP = vhost-switch
>>
>>  # all source are stored in SRCS-y
>> -SRCS-y := main.c
>> +SRCS-y := main.c vswitch_common.c
>>
>>  CFLAGS += -O2 -D_FILE_OFFSET_BITS=64
>>  CFLAGS += $(WERROR_FLAGS)
>> diff --git a/examples/vhost/vswitch_common.c
>> b/examples/vhost/vswitch_common.c
>> new file mode 100644
>> index 000..f0e07f2
>> --- /dev/null
>> +++ b/examples/vhost/vswitch_common.c
>> @@ -0,0 +1,467 @@
>> +/*-
>> + *   BSD LICENSE
>> + *
>> + *   Copyright(c) 2016 Freescale Semiconductor. All rights reserved.
>> + *   All rights reserved.
>> + *
>> + *   Redistribution and use in source and binary forms, with or without
>> + *   modification, are permitted provided that the following conditions
>> + *   are met:
>> + *
>> + * * Redistributions of source code must retain the above copyright
>> + *   notice, this list of conditions and the following disclaimer.
>> + * * Redistributions in binary form must reproduce the above
>> copyright
>> + *   notice, this list of conditions and the following disclaimer in
>> + *   the documentation and/or other materials provided with the
>> + *   distribution.
>> + * * Neither the name of Freescale Semiconductor nor the names of
>> its
>> + *   contributors may be used to endorse or promote products derived
>> + *   from this software without specific prior written permission.
>> + *
>> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
>> FITNESS FOR
>> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
>> COPYRIGHT
>> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
>> INCIDENTAL,
>> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>> USE,
>> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
>> ON ANY
>> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
>> + *   (INCLUDING 

[dpdk-dev] Loading external PMD

2016-09-03 Thread faust1002
Hello,
I want to write my own PMD for testing and debugging purposes. I 
compiled it as shared library and I was going to load it using "-d" 
option. Unfortunately, it didn't work.
I walked through DPDK source code I found out that "-d" options does 
hardly anything (please correct me if I am wrong). Could you explain me 
what was initial purpose of "-d" option?
If the option is obsolete / not implemented / whatever, this information 
should be placed in documentation. Current description is IHMO misleading.
I wonder if there is any other way of using my own PMD expect linking 
binary with it.
Best regards


[dpdk-dev] [dpdk-users] ixgbe drop all the packet

2016-09-03 Thread wei wang
it seems that set fdir port conf cause the problem,just add fdir conf
in the port_conf, don't add any fdir rules.

set port conf like this in the l2fwd code.

static const struct rte_eth_conf port_conf = {
.rxmode = {
.split_hdr_size = 0,
.header_split   = 0, /**< Header Split disabled */
.hw_ip_checksum = 0, /**< IP checksum offload disabled */
.hw_vlan_filter = 0, /**< VLAN filtering disabled */
.jumbo_frame= 0, /**< Jumbo Frame Support disabled */
.hw_strip_crc   = 0, /**< CRC stripped by hardware */
},

.txmode = {
.mq_mode = ETH_MQ_TX_NONE,
},

.fdir_conf = {
 .mode = RTE_FDIR_MODE_PERFECT,
 .pballoc = RTE_FDIR_PBALLOC_64K,
 .status = RTE_FDIR_REPORT_STATUS,
 .mask = {
 .vlan_tci_mask = 0x0,
 .ipv4_mask = {
 .src_ip = 0x0,
 .dst_ip = 0x,
 },
 .ipv6_mask = {
 .src_ip = {0x0, 0x0, 0x0, 0x0},
 .dst_ip = {0x, 0x, 0x, 0x},
 },
 .src_port_mask = 0x0,
 .dst_port_mask = 0x0,
 .mac_addr_byte_mask = 0xFF,
 .tunnel_type_mask = 0x0,
 .tunnel_id_mask = 0x0,
 },
 .drop_queue = 0,

 },
}

if there is no fdir conf, the program at this test case was normal.

2016-09-02 18:16 GMT+08:00 wei wang :
> The issue can be reproduced with example program l2fwd by modifying it
> to sleep 1 second before launching thread.
>
>
> code like this:
> 
> check_all_ports_link_status(nb_ports, l2fwd_enabled_port_mask);
>
> sleep(1);
> /* launch per-lcore init on every lcore */
> rte_eal_mp_remote_launch(l2fwd_launch_one_lcore, NULL, CALL_MASTER);
> RTE_LCORE_FOREACH_SLAVE(lcore_id) {
> if (rte_eal_wait_lcore(lcore_id) < 0)
> return -1;
> }
> 


[dpdk-dev] [PATCH] mbuf: fix atomic refcnt update synchronization

2016-09-03 Thread Linzhe Lee
yes,stephen.

my config file here: http://pastebin.com/N0RKGArh

2016-09-03 0:51 GMT+08:00 Stephen Hemminger :
> On Sat, 3 Sep 2016 00:31:50 +0800
> Linzhe Lee  wrote:
>
>> Thanks for reply, Stephen.
>>
>>
>>
>> I'm in x86-64, my cpu is `Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz`.
>>
>>
>>
>> When allocation mbuf in program1, and transfer it to program2 for free
>> via ring, the program1 might meet assert in allocate mbuf sometimes.
>> (`RTE_ASSERT(rte_mbuf_refcnt_read(m) == 0);`)
>>
>>
>>
>> but when I using gdb to check it, the refcnt field of mbuf is already
>> zero. so I believe the problem came from the cache line problem or
>> incorrect optimization.
>>
>>
>>
>> When apply this patch, the problem seems solved. I'm submitting it for
>> your comments.
>
> Are you sure you have REFCNT_ATOMIC set?


[dpdk-dev] [PATCH] mbuf: fix atomic refcnt update synchronization

2016-09-03 Thread Linzhe Lee
Thanks for reply, Stephen.



I'm in x86-64, my cpu is `Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz`.



When allocation mbuf in program1, and transfer it to program2 for free
via ring, the program1 might meet assert in allocate mbuf sometimes.
(`RTE_ASSERT(rte_mbuf_refcnt_read(m) == 0);`)



but when I using gdb to check it, the refcnt field of mbuf is already
zero. so I believe the problem came from the cache line problem or
incorrect optimization.



When apply this patch, the problem seems solved. I'm submitting it for
your comments.


2016-09-03 0:12 GMT+08:00 Stephen Hemminger :
> On Fri,  2 Sep 2016 13:25:06 +0800
> lilinzhe  wrote:
>
>> From: ??? 
>>
>> chagne atomic ref update to always call atomic_add
>>
>> when mbuf is allocated by cpu1 and freed by cpu2. cpu1 cache may not be 
>> updated by such a set operation.
>> causes refcnt reads incorrect values.
>
> What architecture are you dealing with? On X86 memory is cache coherent.
>
> Doing atomic operation all the time on each mbuf free would significantly
> slow down performance.
>


[dpdk-dev] [PATCH v2 4/4] hash: modify lookup bulk pipeline

2016-09-03 Thread Pablo de Lara
From: Byron Marohn 

This patch replaces the pipelined rte_hash lookup mechanism with a
loop-and-jump model, which performs significantly better,
especially for smaller table sizes and smaller table occupancies.

Signed-off-by: Byron Marohn 
Signed-off-by: Saikrishna Edupuganti 
Signed-off-by: Pablo de Lara 
---
 lib/librte_hash/rte_cuckoo_hash.c | 377 --
 lib/librte_hash/rte_cuckoo_hash.h |   3 +-
 2 files changed, 117 insertions(+), 263 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c 
b/lib/librte_hash/rte_cuckoo_hash.c
index eab28a1..47b5beb 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -913,43 +913,8 @@ rte_hash_get_key_with_position(const struct rte_hash *h, 
const int32_t position,
return 0;
 }

-/* Lookup bulk stage 0: Prefetch input key */
 static inline void
-lookup_stage0(unsigned *idx, uint64_t *lookup_mask,
-   const void * const *keys)
-{
-   *idx = __builtin_ctzl(*lookup_mask);
-   if (*lookup_mask == 0)
-   *idx = 0;
-
-   rte_prefetch0(keys[*idx]);
-   *lookup_mask &= ~(1llu << *idx);
-}
-
-/*
- * Lookup bulk stage 1: Calculate primary/secondary hashes
- * and prefetch primary/secondary buckets
- */
-static inline void
-lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
-   const struct rte_hash_bucket **primary_bkt,
-   const struct rte_hash_bucket **secondary_bkt,
-   hash_sig_t *hash_vals, const void * const *keys,
-   const struct rte_hash *h)
-{
-   *prim_hash = rte_hash_hash(h, keys[idx]);
-   hash_vals[idx] = *prim_hash;
-   *sec_hash = rte_hash_secondary_hash(*prim_hash);
-
-   *primary_bkt = >buckets[*prim_hash & h->bucket_bitmask];
-   *secondary_bkt = >buckets[*sec_hash & h->bucket_bitmask];
-
-   rte_prefetch0(*primary_bkt);
-   rte_prefetch0(*secondary_bkt);
-}
-
-static inline void
-compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+compare_signatures(uint32_t *prim_hash_matches, uint32_t *sec_hash_matches,
const struct rte_hash_bucket *prim_bkt,
const struct rte_hash_bucket *sec_bkt,
hash_sig_t prim_hash, hash_sig_t sec_hash,
@@ -960,11 +925,11 @@ compare_signatures(unsigned *prim_hash_matches, unsigned 
*sec_hash_matches,
switch (sig_cmp_fn) {
 #ifdef RTE_MACHINE_CPUFLAG_AVX2
case RTE_HASH_COMPARE_AVX2:
-   *prim_hash_matches |= 
_mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+   *prim_hash_matches = 
_mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
_mm256_load_si256(
(__m256i const *)prim_bkt->sig_current),
_mm256_set1_epi32(prim_hash)));
-   *sec_hash_matches |= 
_mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+   *sec_hash_matches = 
_mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
_mm256_load_si256(
(__m256i const *)sec_bkt->sig_current),
_mm256_set1_epi32(sec_hash)));
@@ -973,7 +938,7 @@ compare_signatures(unsigned *prim_hash_matches, unsigned 
*sec_hash_matches,
 #ifdef RTE_MACHINE_CPUFLAG_SSE2
case RTE_HASH_COMPARE_SSE:
/* Compare the first 4 signatures in the bucket */
-   *prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+   *prim_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
_mm_load_si128(
(__m128i const *)prim_bkt->sig_current),
_mm_set1_epi32(prim_hash)));
@@ -982,7 +947,7 @@ compare_signatures(unsigned *prim_hash_matches, unsigned 
*sec_hash_matches,
(__m128i const 
*)_bkt->sig_current[4]),
_mm_set1_epi32(prim_hash << 4;
/* Compare the first 4 signatures in the bucket */
-   *sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+   *sec_hash_matches = _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
_mm_load_si128(
(__m128i const *)sec_bkt->sig_current),
_mm_set1_epi32(sec_hash)));
@@ -1003,244 +968,134 @@ compare_signatures(unsigned *prim_hash_matches, 
unsigned *sec_hash_matches,

 }

-/*
- * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
- * and prefetch first key slot
- */
+#define PREFETCH_OFFSET 4
 static inline void
-lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
-   const struct rte_hash_bucket *prim_bkt,
-   const struct rte_hash_bucket *sec_bkt,
- 

[dpdk-dev] [PATCH v2 3/4] hash: add vectorized comparison

2016-09-03 Thread Pablo de Lara
From: Byron Marohn 

In lookup bulk function, the signatures of all entries
are compared against the signature of the key that is being looked up.
Now that all the signatures are together, they can be compared
with vector instructions (SSE, AVX2), achieving higher lookup performance.

Also, entries per bucket are increased to 8 when using processors
with AVX2, as 256 bits can be compared at once, which is the size of
8x32-bit signatures.

Signed-off-by: Byron Marohn 
Signed-off-by: Saikrishna Edupuganti 
Signed-off-by: Pablo de Lara 
---
 lib/librte_hash/rte_cuckoo_hash.c | 73 ---
 lib/librte_hash/rte_cuckoo_hash.h | 12 ++-
 2 files changed, 79 insertions(+), 6 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c 
b/lib/librte_hash/rte_cuckoo_hash.c
index 9d507b6..eab28a1 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -283,6 +283,15 @@ rte_hash_create(const struct rte_hash_parameters *params)
h->free_slots = r;
h->hw_trans_mem_support = hw_trans_mem_support;

+#if defined(RTE_ARCH_X86)
+   if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+   h->sig_cmp_fn = RTE_HASH_COMPARE_AVX2;
+   else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE2))
+   h->sig_cmp_fn = RTE_HASH_COMPARE_SSE;
+   else
+#endif
+   h->sig_cmp_fn = RTE_HASH_COMPARE_SCALAR;
+
/* Turn on multi-writer only with explicit flat from user and TM
 * support.
 */
@@ -939,6 +948,61 @@ lookup_stage1(unsigned idx, hash_sig_t *prim_hash, 
hash_sig_t *sec_hash,
rte_prefetch0(*secondary_bkt);
 }

+static inline void
+compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+   const struct rte_hash_bucket *prim_bkt,
+   const struct rte_hash_bucket *sec_bkt,
+   hash_sig_t prim_hash, hash_sig_t sec_hash,
+   enum rte_hash_sig_compare_function sig_cmp_fn)
+{
+   unsigned i;
+
+   switch (sig_cmp_fn) {
+#ifdef RTE_MACHINE_CPUFLAG_AVX2
+   case RTE_HASH_COMPARE_AVX2:
+   *prim_hash_matches |= 
_mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+   _mm256_load_si256(
+   (__m256i const *)prim_bkt->sig_current),
+   _mm256_set1_epi32(prim_hash)));
+   *sec_hash_matches |= 
_mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+   _mm256_load_si256(
+   (__m256i const *)sec_bkt->sig_current),
+   _mm256_set1_epi32(sec_hash)));
+   break;
+#endif
+#ifdef RTE_MACHINE_CPUFLAG_SSE2
+   case RTE_HASH_COMPARE_SSE:
+   /* Compare the first 4 signatures in the bucket */
+   *prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+   _mm_load_si128(
+   (__m128i const *)prim_bkt->sig_current),
+   _mm_set1_epi32(prim_hash)));
+   *prim_hash_matches |= (_mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+   _mm_load_si128(
+   (__m128i const 
*)_bkt->sig_current[4]),
+   _mm_set1_epi32(prim_hash << 4;
+   /* Compare the first 4 signatures in the bucket */
+   *sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+   _mm_load_si128(
+   (__m128i const *)sec_bkt->sig_current),
+   _mm_set1_epi32(sec_hash)));
+   *sec_hash_matches |= (_mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+   _mm_load_si128(
+   (__m128i const 
*)_bkt->sig_current[4]),
+   _mm_set1_epi32(sec_hash << 4;
+   break;
+#endif
+   default:
+   for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+   *prim_hash_matches |=
+   ((prim_hash == prim_bkt->sig_current[i]) << i);
+   *sec_hash_matches |=
+   ((sec_hash == sec_bkt->sig_current[i]) << i);
+   }
+   }
+
+}
+
 /*
  * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
  * and prefetch first key slot
@@ -951,15 +1015,14 @@ lookup_stage2(unsigned idx, hash_sig_t prim_hash, 
hash_sig_t sec_hash,
uint64_t *extra_hits_mask, const void *keys,
const struct rte_hash *h)
 {
-   unsigned prim_hash_matches, sec_hash_matches, key_idx, i;
+   unsigned prim_hash_matches, sec_hash_matches, key_idx;
unsigned total_hash_matches;

prim_hash_matches = 1 << 

[dpdk-dev] [PATCH v2 0/4] Cuckoo hash lookup enhancements

2016-09-03 Thread Pablo de Lara
This patchset improves lookup performance on the current hash library
by changing the existing lookup bulk pipeline, with an improved pipeline,
based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
Also, x86 vectorized intrinsics are used to improve performance when comparing 
signatures.

First patch modifies the order of the bucket structure.
Currently, the buckets store all the signatures together (current and 
alternative).
In order to be able to perform a vectorized signature comparison,
all current signatures have to be together, so the order of the bucket has been 
changed,
having separated all the current signatures from the alternative signatures.

Second patch introduces x86 vectorized intrinsics.
When performing a lookup bulk operation, all current signatures in a bucket
are compared against the signature of the key being looked up.
Now that they all are together, a vectorized comparison can be performed,
which takes less instructions to be carried out.
In case of having a machine with AVX2, number of entries per bucket are
increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with 
8x32-bit integers,
which are the 8 signatures on the bucket.

Third (and last) patch modifies the current pipeline of the lookup bulk 
function.
The new pipeline is based on a loop-and-jump model. The two key improvements 
are:

- Better prefetching: in this case, first 4 keys to be looked up are prefetched,
  and after that, the rest of the keys are prefetched at the time the 
calculation
  of the signatures are being performed. This gives more time for the CPU to
  prefetch the data requesting before actually need it, which result in less
  cache misses and therefore, higher throughput.

- Lower performance penalty when using fallback: the lookup bulk algorithm
  assumes that most times there will not be a collision in a bucket, but it 
might
  happen that two or more signatures are equal, which means that more than one
  key comparison might be necessary. In that case, only the key of the first 
hit is prefetched,
  like in the current implementation. The difference now is that if this 
comparison
  results in a miss, the information of the other keys to be compared has been 
stored,
  unlike the current implementation, which needs to perform an entire simple 
lookup again.

This patchset depends on the following patchset:
"Hash library fixes" (http://dpdk.org/ml/archives/dev/2016-August/045780.html)

Changes in v2:
- Increased entries per bucket from 4 to 8 for all cases,
  so it is not architecture dependent any longer.
- Replaced compile-time signature comparison function election
  with run-time election, so best optimization available
  will be used from a single binary.
- Reordered the hash structure, so all the fields used by lookup
  are in the same cache line (first).


Byron Marohn (3):
  hash: reorganize bucket structure
  hash: add vectorized comparison
  hash: modify lookup bulk pipeline

Pablo de Lara (1):
  hash: reorder hash structure

 lib/librte_hash/rte_cuckoo_hash.c | 455 ++
 lib/librte_hash/rte_cuckoo_hash.h |  44 ++--
 lib/librte_hash/rte_cuckoo_hash_x86.h |  20 +-
 3 files changed, 221 insertions(+), 298 deletions(-)

-- 
2.7.4