On 24.11.2017 19:08, Bodireddy, Bhanuprakash wrote: >> On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote: >>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941. >>>> >>>> Padding and aligning of dp_netdev_pmd_thread structure members is >>>> useless, broken in a several ways and only greatly degrades >>>> maintainability and extensibility of the structure. >>> >>> The idea of my earlier patch was to mark the cache lines and reduce the >> holes while still maintaining the grouping of related members in this >> structure. >> >> Some of the grouping aspects looks strange. For example, it looks illogical >> that >> 'exit_latch' is grouped with 'flow_table' but not the 'reload_seq' and other >> reload related stuff. It looks strange that statistics and counters spread >> across >> different groups. So, IMHO, it's not well grouped. > > I had to strike a fine balance and some members may be placed in a different > group > due to their sizes and importance. Let me think if I can make it better. > >> >>> Also cache line marking is a good practice to make some one extra cautious >> when extending or editing important structures . >>> Most importantly I was experimenting with prefetching on this structure >> and needed cache line markers for it. >>> >>> I see that you are on ARM (I don't have HW to test) and want to know if this >> commit has any negative affect and any numbers would be appreciated. >> >> Basic VM-VM testing shows stable 0.5% perfromance improvement with >> revert applied. > > I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86. > >> Padding adds 560 additional bytes of holes. > As the cache line in ARM is 128 , it created holes, I can find a workaround > to handle this. > >> >>> More comments inline. >>> >>>> >>>> Issues: >>>> >>>> 1. It's not working because all the instances of struct >>>> dp_netdev_pmd_thread allocated only by usual malloc. All the >>>> memory is not aligned to cachelines -> structure almost never >>>> starts at aligned memory address. This means that any further >>>> paddings and alignments inside the structure are completely >>>> useless. Fo example: >>>> >>>> Breakpoint 1, pmd_thread_main >>>> (gdb) p pmd >>>> $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20 >>>> (gdb) p &pmd->cacheline1 >>>> $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60 >>>> (gdb) p &pmd->cacheline0 >>>> $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20 >>>> (gdb) p &pmd->flow_cache >>>> $53 = (struct emc_cache *) 0x1b1afe0 >>>> >>>> All of the above addresses shifted from cacheline start by 32B. >>> >>> If you see below all the addresses are 64 byte aligned. >>> >>> (gdb) p pmd >>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040 >>> (gdb) p &pmd->cacheline0 >>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040 >>> (gdb) p &pmd->cacheline1 >>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080 >>> (gdb) p &pmd->flow_cache >>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0 >>> (gdb) p &pmd->flow_table >>> $5 = (struct cmap *) 0x7fc1e9fba100 >>> (gdb) p &pmd->stats >>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140 >>> (gdb) p &pmd->port_mutex >>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180 >>> (gdb) p &pmd->poll_list >>> $8 = (struct hmap *) 0x7fc1e9fba1c0 >>> (gdb) p &pmd->tnl_port_cache >>> $9 = (struct hmap *) 0x7fc1e9fba200 >>> (gdb) p &pmd->stats_zero >>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240 >>> >>> I tried using xzalloc_cacheline instead of default xzalloc() here. I >>> tried tens of times and always found that the address is >>> 64 byte aligned and it should start at the beginning of cache line on X86. >>> Not sure why the comment " (The memory returned will not be at the start >> of a cache line, though, so don't assume such alignment.)" says otherwise? >> >> Yes, you will always get aligned addressess on your x86 Linux system that >> supports >> posix_memalign() call. The comment says what it says because it will make >> some memory allocation tricks in case posix_memalign() is not available >> (Windows, some MacOS, maybe some Linux systems (not sure)) and the >> address will not be aligned it this case. > > I also verified the other case when posix_memalign isn't available and even > in that case > it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a > patch to use > xzalloc_cacheline for allocating the memory.
I don't know how you tested this, because it is impossible: 1. OVS allocates some memory: base = xmalloc(...); 2. Rounds it up to the cache line start: payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE); 3. Returns the pointer increased by 8 bytes: return (char *) payload + MEM_ALIGN; So, unless you redefined MEM_ALIGN to zero, you will never get aligned memory address while allocating by xmalloc_cacheline() on system without posix_memalign(). > >> >>> >>>> >>>> Can we fix it properly? NO. >>>> OVS currently doesn't have appropriate API to allocate aligned >>>> memory. The best candidate is 'xmalloc_cacheline()' but it >>>> clearly states that "The memory returned will not be at the >>>> start of a cache line, though, so don't assume such alignment". >>>> And also, this function will never return aligned memory on >>>> Windows or MacOS. >>>> >>>> 2. CACHE_LINE_SIZE is not constant. Different architectures have >>>> different cache line sizes, but the code assumes that >>>> CACHE_LINE_SIZE is always equal to 64 bytes. All the structure >>>> members are grouped by 64 bytes and padded to CACHE_LINE_SIZE. >>>> This leads to a huge holes in a structures if CACHE_LINE_SIZE >>>> differs from 64. This is opposite to portability. If I want >>>> good performance of cmap I need to have CACHE_LINE_SIZE equal >>>> to the real cache line size, but I will have huge holes in the >>>> structures. If you'll take a look to struct rte_mbuf from DPDK >>>> you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and >>>> RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure. >>> >>> I understand that ARM and few other processors (like OCTEON) have 128 >> bytes cache lines. >>> But again curious of performance impact in your case with this new >> alignment. >>> >>>> >>>> 3. Sizes of system/libc defined types are not constant for all the >>>> systems. For example, sizeof(pthread_mutex_t) == 48 on my >>>> ARMv8 machine, but only 40 on x86. The difference could be >>>> much bigger on Windows or MacOS systems. But the code assumes >>>> that sizeof(struct ovs_mutex) is always 48 bytes. This may lead >>>> to broken alignment/big holes in case of padding/wrong comments >>>> about amount of free pad bytes. >>> >>> This isn't an issue as you would have already mentioned and more about >> issue with the comment that reads the pad bytes. >>> In case of ARM it would be just 8 pad bytes instead of 16 on X86. >>> >>> union { >>> struct { >>> struct ovs_mutex port_mutex; /* 4849984 48 */ >>> }; /* 48 */ >>> uint8_t pad13[64]; /* 64 */ >>> }; / >>> >> >> It's not only about 'port_mutex'. If you'll take a look at 'flow_mutex', you >> will >> see, that it even not padded. So, increasing the size of 'flow_mutex' >> leads to shifting of all the following padded blocks and no other blocks >> will be >> properly aligned even if the structure allocated on the aligned memory. >> >>>> >>>> 4. Sizes of the many fileds in structure depends on defines like >>>> DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on. >>>> Any change in these defines or any change in any structure >>>> contained by thread should lead to the not so simple >>>> refactoring of the whole dp_netdev_pmd_thread structure. This >>>> greatly reduces maintainability and complicates development of >>>> a new features. >>> >>> I don't think it complicates development and instead I feel the commit >>> gives a clear indication to the developer that the members are grouped and >> aligned and marked with cacheline markers. >>> This makes the developer extra cautious when adding new members so that >> holes can be avoided. >> >> Starting rebase of the output batching patch-set I figured out that I need to >> remove 'unsigned long long last_cycles' and add 'struct >> dp_netdev_pmd_thread_ctx ctx' >> which is 8 bytes larger. Could you, please, suggest me where should I place >> that new structure member and what to do with a hole from 'last_cycles'? >> >> This is not a trivial question, because already poor grouping will become >> worse >> almost anyway. > > Aah, realized now that the batching series doesn't cleanly apply on master. > Let me check this and will send across the changes that should fix this. > > - Bhanuprakash > >>> >>> Cacheline marking the structure is a good practice and I am sure this >>> structure is significant and should be carefully extended in the future. >> >> Not so sure about that. >> >>> >>>> >>>> 5. There is no reason to align flow_cache member because it's >>>> too big and we usually access random entries by single thread >>>> only. >>>> >>> >>> I see your point. This patch wasn't done for performance and instead >>> more to have some order with this ever growing structure. During >>> testing I found that for some test cases aligning the flow_cache was giving >> me 100k+ performance consistently and so was added. >> >> This was a random performance boost. You achieved it without aligned >> memory allocation. >> Just a luck with you system environment. Using of mzalloc_cacheline will >> likely >> eliminate this performance difference or even degrade the performance. >> >>> >>>> So, the padding/alignment only creates some visibility of performance >>>> optimization but does nothing useful in reality. It only complicates >>>> maintenance and adds huge holes for non-x86 architectures and >>>> non-Linux systems. Performance improvement stated in a original >>>> commit message should be random and not valuable. I see no >> performance difference. >>> >>> I understand that this is causing issues with architecture having >>> different cache line sizes, but unfortunately majority of them have 64 byte >> cache lines so this change makes sense. >> >> I understand that 64 byte cache lines are spread a lot wider. I also have >> x86 as >> a target arch, but still, IMHO, OVS is a cross-platform application and it >> should >> not have platform dependent stuff which makes one architecture/platform >> better and worsen others. >> >>> >>> If you have performance data to prove that this causes sever perf >> degradation I can think of work arounds for ARM. >>> >>> - Bhanuprakash. >> >> >> P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have to apply >> following patch to avoid build time assert (I'll send it formally later): >> >> ----------------------------------------------------------------------------- >> diff --git a/lib/cmap.c b/lib/cmap.c >> index 35decea..5b15ecd 100644 >> --- a/lib/cmap.c >> +++ b/lib/cmap.c >> @@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink); >> /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */ #define CMAP_K >> ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE) >> >> -/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 on >> 64-bit. */ - >> #define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - (CMAP_K * >> CMAP_ENTRY_SIZE)) >> - >> /* A cuckoo hash bucket. Designed to be cache-aligned and exactly one >> cache >> * line long. */ >> struct cmap_bucket { >> + /* Padding to make cmap_bucket exactly one cache line long. */ >> + PADDED_MEMBERS(CACHE_LINE_SIZE, >> /* Allows readers to track in-progress changes. Initially zero, each >> * writer increments this value just before and just after each change >> (see >> * cmap_set_bucket()). Thus, a reader can ensure that it gets a >> consistent >> @@ -145,11 +144,7 @@ struct cmap_bucket { >> * slots. */ >> uint32_t hashes[CMAP_K]; >> struct cmap_node nodes[CMAP_K]; >> - >> - /* Padding to make cmap_bucket exactly one cache line long. */ >> -#if CMAP_PADDING > 0 >> - uint8_t pad[CMAP_PADDING]; >> -#endif >> + ); >> }; >> BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE); >> >> diff --git a/lib/util.h b/lib/util.h >> index 3c43c2c..514fdaa 100644 >> --- a/lib/util.h >> +++ b/lib/util.h >> @@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE { >> >> /* This system's cache line size, in bytes. >> * Being wrong hurts performance but not correctness. */ -#define >> CACHE_LINE_SIZE 64 >> +#define CACHE_LINE_SIZE 128 >> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE)); >> >> /* Cacheline marking is typically done using zero-sized array. >> ----------------------------------------------------------------------------- >> >> >> Best regards, Ilya Maximets. >> >>> >>>> >>>> Most of the above issues are also true for some other padded/aligned >>>> structures like 'struct netdev_dpdk'. They will be treated separately. >>>> >>>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodire...@intel.com> >>>> CC: Ben Pfaff <b...@ovn.org> >>>> Signed-off-by: Ilya Maximets <i.maxim...@samsung.com> >>>> --- >>>> lib/dpif-netdev.c | 160 >>>> +++++++++++++++++++++++----------------------------- >>>> -- >>>> 1 file changed, 69 insertions(+), 91 deletions(-) >>>> >>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index >>>> 0a62630..6784269 >>>> 100644 >>>> --- a/lib/dpif-netdev.c >>>> +++ b/lib/dpif-netdev.c >>>> @@ -547,31 +547,18 @@ struct tx_port { >>>> * actions in either case. >>>> * */ >>>> struct dp_netdev_pmd_thread { >>>> - PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, >> cacheline0, >>>> - struct dp_netdev *dp; >>>> - struct cmap_node node; /* In 'dp->poll_threads'. */ >>>> - pthread_cond_t cond; /* For synchronizing pmd thread >>>> - reload. */ >>>> - ); >>>> - >>>> - PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, >> cacheline1, >>>> - struct ovs_mutex cond_mutex; /* Mutex for condition variable. >>>> */ >>>> - pthread_t thread; >>>> - unsigned core_id; /* CPU core id of this pmd >>>> thread. */ >>>> - int numa_id; /* numa node id of this pmd >>>> thread. */ >>>> - ); >>>> + struct dp_netdev *dp; >>>> + struct ovs_refcount ref_cnt; /* Every reference must be >>>> refcount'ed. >> */ >>>> + struct cmap_node node; /* In 'dp->poll_threads'. */ >>>> + >>>> + pthread_cond_t cond; /* For synchronizing pmd thread >>>> reload. */ >>>> + struct ovs_mutex cond_mutex; /* Mutex for condition variable. */ >>>> >>>> /* Per thread exact-match cache. Note, the instance for cpu core >>>> * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly >>>> * need to be protected by 'non_pmd_mutex'. Every other instance >>>> * will only be accessed by its own pmd thread. */ >>>> - OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache; >>>> - struct ovs_refcount ref_cnt; /* Every reference must be >>>> refcount'ed. >> */ >>>> - >>>> - /* Queue id used by this pmd thread to send packets on all netdevs if >>>> - * XPS disabled for this netdev. All static_tx_qid's are unique and >>>> less >>>> - * than 'cmap_count(dp->poll_threads)'. */ >>>> - uint32_t static_tx_qid; >>>> + struct emc_cache flow_cache; >>>> >>>> /* Flow-Table and classifiers >>>> * >>>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread { >>>> * 'flow_mutex'. >>>> */ >>>> struct ovs_mutex flow_mutex; >>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>> - struct cmap flow_table OVS_GUARDED; /* Flow table. */ >>>> - >>>> - /* One classifier per in_port polled by the pmd */ >>>> - struct cmap classifiers; >>>> - /* Periodically sort subtable vectors according to hit >>>> frequencies */ >>>> - long long int next_optimization; >>>> - /* End of the next time interval for which processing cycles >>>> - are stored for each polled rxq. */ >>>> - long long int rxq_next_cycle_store; >>>> - >>>> - /* Cycles counters */ >>>> - struct dp_netdev_pmd_cycles cycles; >>>> - >>>> - /* Used to count cycles. See 'cycles_counter_end()'. */ >>>> - unsigned long long last_cycles; >>>> - struct latch exit_latch; /* For terminating the pmd >>>> thread. */ >>>> - ); >>>> - >>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>> - /* Statistics. */ >>>> - struct dp_netdev_pmd_stats stats; >>>> - >>>> - struct seq *reload_seq; >>>> - uint64_t last_reload_seq; >>>> - atomic_bool reload; /* Do we need to reload ports? */ >>>> - bool isolated; >>>> - >>>> - /* Set to true if the pmd thread needs to be reloaded. */ >>>> - bool need_reload; >>>> - /* 5 pad bytes. */ >>>> - ); >>>> - >>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>> - struct ovs_mutex port_mutex; /* Mutex for 'poll_list' >>>> - and 'tx_ports'. */ >>>> - /* 16 pad bytes. */ >>>> - ); >>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>> - /* List of rx queues to poll. */ >>>> - struct hmap poll_list OVS_GUARDED; >>>> - /* Map of 'tx_port's used for transmission. Written by the main >>>> - * thread, read by the pmd thread. */ >>>> - struct hmap tx_ports OVS_GUARDED; >>>> - ); >>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>> - /* These are thread-local copies of 'tx_ports'. One contains only >>>> - * tunnel ports (that support push_tunnel/pop_tunnel), the other >>>> - * contains ports with at least one txq (that support send). >>>> - * A port can be in both. >>>> - * >>>> - * There are two separate maps to make sure that we don't try to >>>> - * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a >>>> - * non-tunnel device. >>>> - * >>>> - * The instances for cpu core NON_PMD_CORE_ID can be accessed by >>>> - * multiple threads and thusly need to be protected by >>>> 'non_pmd_mutex'. >>>> - * Every other instance will only be accessed by its own pmd >>>> thread. >> */ >>>> - struct hmap tnl_port_cache; >>>> - struct hmap send_port_cache; >>>> - ); >>>> - >>>> - PADDED_MEMBERS(CACHE_LINE_SIZE, >>>> - /* Only a pmd thread can write on its own 'cycles' and 'stats'. >>>> - * The main thread keeps 'stats_zero' and 'cycles_zero' as base >>>> - * values and subtracts them from 'stats' and 'cycles' before >>>> - * reporting to the user */ >>>> - unsigned long long stats_zero[DP_N_STATS]; >>>> - uint64_t cycles_zero[PMD_N_CYCLES]; >>>> - /* 8 pad bytes. */ >>>> - ); >>>> + struct cmap flow_table OVS_GUARDED; /* Flow table. */ >>>> + >>>> + /* One classifier per in_port polled by the pmd */ >>>> + struct cmap classifiers; >>>> + /* Periodically sort subtable vectors according to hit frequencies */ >>>> + long long int next_optimization; >>>> + /* End of the next time interval for which processing cycles >>>> + are stored for each polled rxq. */ >>>> + long long int rxq_next_cycle_store; >>>> + >>>> + /* Statistics. */ >>>> + struct dp_netdev_pmd_stats stats; >>>> + >>>> + /* Cycles counters */ >>>> + struct dp_netdev_pmd_cycles cycles; >>>> + >>>> + /* Used to count cicles. See 'cycles_counter_end()' */ >>>> + unsigned long long last_cycles; >>>> + >>>> + struct latch exit_latch; /* For terminating the pmd thread. */ >>>> + struct seq *reload_seq; >>>> + uint64_t last_reload_seq; >>>> + atomic_bool reload; /* Do we need to reload ports? */ >>>> + pthread_t thread; >>>> + unsigned core_id; /* CPU core id of this pmd thread. */ >>>> + int numa_id; /* numa node id of this pmd thread. */ >>>> + bool isolated; >>>> + >>>> + /* Queue id used by this pmd thread to send packets on all netdevs if >>>> + * XPS disabled for this netdev. All static_tx_qid's are unique and >>>> less >>>> + * than 'cmap_count(dp->poll_threads)'. */ >>>> + uint32_t static_tx_qid; >>>> + >>>> + struct ovs_mutex port_mutex; /* Mutex for 'poll_list' and >>>> 'tx_ports'. >> */ >>>> + /* List of rx queues to poll. */ >>>> + struct hmap poll_list OVS_GUARDED; >>>> + /* Map of 'tx_port's used for transmission. Written by the main >>>> thread, >>>> + * read by the pmd thread. */ >>>> + struct hmap tx_ports OVS_GUARDED; >>>> + >>>> + /* These are thread-local copies of 'tx_ports'. One contains only >>>> tunnel >>>> + * ports (that support push_tunnel/pop_tunnel), the other contains >> ports >>>> + * with at least one txq (that support send). A port can be in both. >>>> + * >>>> + * There are two separate maps to make sure that we don't try to >> execute >>>> + * OUTPUT on a device which has 0 txqs or PUSH/POP on a >>>> + non-tunnel >>>> device. >>>> + * >>>> + * The instances for cpu core NON_PMD_CORE_ID can be accessed by >>>> multiple >>>> + * threads, and thusly need to be protected by 'non_pmd_mutex'. >> Every >>>> + * other instance will only be accessed by its own pmd thread. */ >>>> + struct hmap tnl_port_cache; >>>> + struct hmap send_port_cache; >>>> + >>>> + /* Only a pmd thread can write on its own 'cycles' and 'stats'. >>>> + * The main thread keeps 'stats_zero' and 'cycles_zero' as base >>>> + * values and subtracts them from 'stats' and 'cycles' before >>>> + * reporting to the user */ >>>> + unsigned long long stats_zero[DP_N_STATS]; >>>> + uint64_t cycles_zero[PMD_N_CYCLES]; >>>> + >>>> + /* Set to true if the pmd thread needs to be reloaded. */ >>>> + bool need_reload; >>>> }; >>>> >>>> /* Interface to netdev-based datapath. */ >>>> -- >>>> 2.7.4 >>> >>> >>> >>> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev