On 04/10/2021 16:58, Han Zhou wrote:


On Mon, Oct 4, 2021 at 2:31 AM Anton Ivanov <anton.iva...@cambridgegreys.com 
<mailto:anton.iva...@cambridgegreys.com>> wrote:
>
>
> On 03/10/2021 23:45, Han Zhou wrote:
> > The current implementation of parallel build in northd with dp-groups
> > enabled results in bad performance when the below assumption is not
> > true:
> >
> >   * 3. Most RW ops are not flow adds, but changes to the
> >   * od groups.
> >
> > In fact most (if not all) of the ovn port based flows don't share
> > dp-groups, so the assumption doesn't hold in the reality, and in a scale
> > test environment with ovn-k8s topology of 800 nodes, the parallel build
> > shows 5 times more time spent for one northd iteration than without
> > parallel build on a test machine with 24 cores (24 northd worker
> > threads). This is because of lock contension on the global rw lock
> > protecting the lflow hmap.
> >
> > This patch optimizes it by using an array of bash based locks instead of
> > a single global lock. It is similar to the approach prior to the commit
> > 8c980ce6, but with two major differences:
> >
> > 1) It uses a fixed length mutex array instead of the dynamic array of
> >     the struct hashrow_locks. It is equally efficient considering the low
> >     chance of contention in a large array of locks, but without the burden
> >     of resizing every time the hmap size changes. The uniqueness of the
> >     lock is guaranteed by combining the masks of both the hmap and the
> >     mutex array.
> >
> > 2) It fixes the corrupted hmap size after each round of parallel flow
> >     build. The hash based lock protects the list in each bucket, but
> >     doesn't protect the hmap size. The patch uses thread-local counters
> >     and aggregate them at the end of each iteration, which is lock free.
> >     This approach has lower cost than alternatively using atomic
> >     incrementing a global counter.
> >
> > This patch ends up with 8 times speedup than the current parallel build
> > with dp-group enabled for the same scale test (which is 30% faster than
> > without parallel).
> >
> > Test command: ovn-nbctl --print-wait-time --wait=sb sync
> >
> > Before:
> >
> > no parallel:
> > ovn-northd completion: 7807ms
> >
> > parallel:
> > ovn-northd completion: 41267ms
> >
> > After:
> >
> > no parallel: (no change)
> >
> > parallel:
> > ovn-northd completion: 5081ms
> > (8x faster than before, 30% faster than no parallel)
> >
> > Note: all the above tests are with dp-groups enabled)
> >
> > Signed-off-by: Han Zhou <hz...@ovn.org <mailto:hz...@ovn.org>>
> > ---
> > v1 -> v2: Addressed comments from Anton
> >      - Fixes the hmap size afte each round of prallel flow building.
> >      - Refactored lflow_hash_lock and removed unnecessary functions to avoid
> >        clang warnings.
> >
> >   northd/northd.c | 140 ++++++++++++++++++++++++------------------------
> >   1 file changed, 70 insertions(+), 70 deletions(-)
> >
> > diff --git a/northd/northd.c b/northd/northd.c
> > index afd812700..88ab0c929 100644
> > --- a/northd/northd.c
> > +++ b/northd/northd.c
> > @@ -4319,41 +4319,44 @@ ovn_dp_group_add_with_reference(struct ovn_lflow 
*lflow_ref,
> >       return true;
> >   }
> >
> > -/* Adds a row with the specified contents to the Logical_Flow table.
> > - * Version to use with dp_groups + parallel - when locking is required.
> > - *
> > - * Assumptions:
> > - *
> > - * 1. A large proportion of the operations are lookups (reads).
> > - * 2. RW operations are a small proportion of overall adds.
> > - * 3. Most RW ops are not flow adds, but changes to the
> > - * od groups.
> > +/* The lflow_hash_lock is a mutex array that protects updates to the shared
> > + * lflow table across threads when parallel lflow build and dp-group are 
both
> > + * enabled. To avoid high contention between threads, a big array of 
mutexes
> > + * are used instead of just one. This is possible because when parallel 
build
> > + * is used we only use hmap_insert_fast() to update the hmap, which would 
not
> > + * touch the bucket array but only the list in a single bucket. We only 
need to
> > + * make sure that when adding lflows to the same hash bucket, the same 
lock is
> > + * used, so that no two threads can add to the bucket at the same time.  
It is
> > + * ok that the same lock is used to protect multiple buckets, so a fixed 
sized
> > + * mutex array is used instead of 1-1 mapping to the hash buckets. This
> > + * simplies the implementation while effectively reduces lock contention
> > + * because the chance that different threads contending the same lock 
amongst
> > + * the big number of locks is very low. */
> > +#define LFLOW_HASH_LOCK_MASK 0xFFFF
> > +static struct ovs_mutex lflow_hash_locks[LFLOW_HASH_LOCK_MASK + 1];
> > +
> > +static void
> > +lflow_hash_lock_init(void)
> > +{
> > +    for (size_t i = 0; i < LFLOW_HASH_LOCK_MASK + 1; i++) {
> > +        ovs_mutex_init(&lflow_hash_locks[i]);
> > +    }
> > +}
> > +
> > +/* This thread-local var is used for parallel lflow building when 
dp-groups is
> > + * enabled. It maintains the number of lflows inserted by the current 
thread to
> > + * the shared lflow hmap in the current iteration. It is needed because the
> > + * lflow_hash_lock cannot protect current update of the hmap's size 
(hmap->n)
> > + * by different threads.
> >    *
> > - * Principles of operation:
> > - * 1. All accesses to the flow table are protected by a rwlock.
> > - * 2. By default, everyone grabs a rd lock so that multiple threads
> > - * can do lookups simultaneously.
> > - * 3. If a change to the lflow is needed, the rd lock is released and
> > - * a wr lock is acquired instead (the fact that POSIX does not have an
> > - * "upgrade" on locks is a major pain, but there is nothing we can do
> > - * - it's not available).
> > - * 4. WR lock operations in rd/wr locking have LOWER priority than RD.
> > - * That is by design and spec. So the code after a request for WR lock
> > - * may wait for a considerable amount of time until it is given a
> > - * change to run. That means that another thread may get there in the
> > - * meantime and change the data. Hence all wr operations MUST be coded
> > - * to ensure that they are not vulnerable to "someone pulled this from
> > - * under my feet". Re- reads, checks for presense, etc.
> > - * 5. Operations on the actual od_group hash map are protected by
> > - * per-flow locks. There is no need for these to be rd, mutex is more
> > - * appropriate. They are low contention as each protects only its flow
> > - * and only during modification which happen while holding a rd lock on
> > - * the flow table.
> > - */
> > -static struct ovs_rwlock flowtable_lock;
> > + * When all threads complete the tasks of an iteration, the counters of 
all the
> > + * threads are collected to fix the lflow hmap's size (by the function
> > + * fix_flow_map_size()).
> > + * */
> > +static thread_local size_t thread_lflow_counter = 0;
> >
> >   /* Adds a row with the specified contents to the Logical_Flow table.
> > - * Version to use when locking is NOT required.
> > + * Version to use when hash bucket locking is NOT required.
> >    */
> >   static struct ovn_lflow *
> >   do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od,
> > @@ -4361,7 +4364,6 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct 
ovn_datapath *od,
> >                    const char *match, const char *actions, const char 
*io_port,
> >                    const struct ovsdb_idl_row *stage_hint,
> >                    const char *where, const char *ctrl_meter)
> > -                 OVS_NO_THREAD_SAFETY_ANALYSIS
> >   {
> >
> >       struct ovn_lflow *old_lflow;
> > @@ -4390,14 +4392,14 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct 
ovn_datapath *od,
> >           hmap_insert(lflow_map, &lflow->hmap_node, hash);
> >       } else {
> >           hmap_insert_fast(lflow_map, &lflow->hmap_node, hash);
> > +        thread_lflow_counter++;
> >       }
> >       return lflow;
> >   }
> >
> >   /* Adds a row with the specified contents to the Logical_Flow table.
> > - * Version to use when locking IS required.
> > + * Version to use when hash bucket locking IS required.
> >    */
> > -
> >   static struct ovn_lflow *
> >   do_ovn_lflow_add_pd(struct hmap *lflow_map, struct ovn_datapath *od,
> >                       uint32_t hash, enum ovn_stage stage, uint16_t 
priority,
> > @@ -4405,44 +4407,16 @@ do_ovn_lflow_add_pd(struct hmap *lflow_map, struct 
ovn_datapath *od,
> >                       const char *io_port,
> >                       const struct ovsdb_idl_row *stage_hint,
> >                       const char *where, const char *ctrl_meter)
> > -                    OVS_NO_THREAD_SAFETY_ANALYSIS
> >   {
> >
> > -    struct ovn_lflow *old_lflow;
> >       struct ovn_lflow *lflow;
> > +    struct ovs_mutex *hash_lock =
> > +        &lflow_hash_locks[hash & lflow_map->mask & LFLOW_HASH_LOCK_MASK];
> >
> > -    /* Fast Path - try to amend an existing flow without asking
> > -     * for WR access to the whole flow table. Locking the actual
> > -     * hmapx for the particular flow's odg is low overhead as its
> > -     * contention is much lower.
> > -     */
> > -
> > -    ovs_rwlock_rdlock(&flowtable_lock);
> > -    old_lflow = ovn_lflow_find(lflow_map, NULL, stage, priority, match,
> > -                               actions, ctrl_meter, hash);
> > -    if (old_lflow) {
> > -        ovs_mutex_lock(&old_lflow->odg_lock);
> > -        hmapx_add(&old_lflow->od_group, od);
> > -  ovs_mutex_unlock(&old_lflow->odg_lock);
> > -    }
> > -    ovs_rwlock_unlock(&flowtable_lock);
> > -
> > -    if (old_lflow) {
> > -        return old_lflow;
> > -    }
> > -
> > -    ovs_rwlock_wrlock(&flowtable_lock);
> > -
> > -    /* We need to rerun the "if in flowtable" steps, because someone
> > -     * could have inserted it while we were waiting to acquire an
> > -     * wr lock. As we are now holding a wr lock on it nobody else is
> > -     * in the * "fast" portion of the code which is protected by the
> > -     * rwlock.
> > -     */
> > +    ovs_mutex_lock(hash_lock);
> >       lflow = do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
> > -                                 actions, io_port, stage_hint, where,
> > -                                 ctrl_meter);
> > -    ovs_rwlock_unlock(&flowtable_lock);
> > +                             actions, io_port, stage_hint, where, 
ctrl_meter);
> > +    ovs_mutex_unlock(hash_lock);
> >       return lflow;
> >   }
> >
> > @@ -12767,6 +12741,7 @@ struct lswitch_flow_build_info {
> >       char *svc_check_match;
> >       struct ds match;
> >       struct ds actions;
> > +    size_t thread_lflow_counter;
> >   };
> >
> >   /* Helper function to combine all lflow generation which is iterated by
> > @@ -12892,6 +12867,7 @@ build_lflows_thread(void *arg)
> >           if (stop_parallel_processing()) {
> >               return NULL;
> >           }
> > +        thread_lflow_counter = 0;
> >           if (lsi && workload) {
> >               /* Iterate over bucket ThreadID, ThreadID+size, ... */
> >               for (bnum = control->id;
> > @@ -12952,6 +12928,7 @@ build_lflows_thread(void *arg)
> >                   }
> >               }
> >           }
> > +        lsi->thread_lflow_counter = thread_lflow_counter;
> >           post_completed_work(control);
> >       }
> >       return NULL;
> > @@ -12994,6 +12971,26 @@ noop_callback(struct worker_pool *pool OVS_UNUSED,
> >       /* Do nothing */
> >   }
> >
> > +/* Fixes the hmap size (hmap->n) after parallel building the lflow_map when
> > + * dp-groups is enabled, because in that case all threads are updating the
> > + * global lflow hmap. Although the lflow_hash_lock prevents currently 
inserting
> > + * to the same hash bucket, the hmap->n is updated currently by all 
threads and
> > + * may not be accurate at the end of each iteration. This function 
collects the
> > + * thread-local lflow counters maintained by each thread and update the 
hmap
> > + * size with the aggregated value. This function must be called immediately
> > + * after the worker threads complete the tasks in each iteration before any
> > + * future operations on the lflow map. */
> > +static void
> > +fix_flow_map_size(struct hmap *lflow_map,
> > +                  struct lswitch_flow_build_info *lsiv,
> > +                  size_t n_lsiv)
> > +{
> > +    size_t total = 0;
> > +    for (size_t i = 0; i < n_lsiv; i++) {
> > +        total += lsiv[i].thread_lflow_counter;
> > +    }
> > +    lflow_map->n = total;
> > +}
> >
> >   static void
> >   build_lswitch_and_lrouter_flows(struct hmap *datapaths, struct hmap 
*ports,
> > @@ -13046,6 +13043,7 @@ build_lswitch_and_lrouter_flows(struct hmap 
*datapaths, struct hmap *ports,
> >               lsiv[index].lbs = lbs;
> >               lsiv[index].bfd_connections = bfd_connections;
> >               lsiv[index].svc_check_match = svc_check_match;
> > +            lsiv[index].thread_lflow_counter = 0;
> >               ds_init(&lsiv[index].match);
> >               ds_init(&lsiv[index].actions);
> >
> > @@ -13054,7 +13052,9 @@ build_lswitch_and_lrouter_flows(struct hmap 
*datapaths, struct hmap *ports,
> >
> >           /* Run thread pool. */
> >           if (use_logical_dp_groups) {
> > -  run_pool_callback(build_lflows_pool->pool, NULL, NULL, noop_callback);
> > +  run_pool_callback(build_lflows_pool->pool, NULL, NULL,
> > +                              noop_callback);
> > +            fix_flow_map_size(lflows, lsiv, build_lflows_pool->pool->size);
> >           } else {
> > run_pool_hash(build_lflows_pool->pool, lflows, lflow_segs);
> >           }
> > @@ -13221,7 +13221,7 @@ build_lflows(struct northd_context *ctx, struct 
hmap *datapaths,
> >
> >       if (use_parallel_build && use_logical_dp_groups &&
> >           needs_parallel_init) {
> > -        ovs_rwlock_init(&flowtable_lock);
> > +        lflow_hash_lock_init();
> >           needs_parallel_init = false;
> >           /* Disable parallel build on first run with dp_groups
> >            * to determine the correct sizing of hashes. */
>
>
> +1
>
> As I said before - I would like to get to the bottom of why do we get 
improvement on ovn-heater with rwlock despite these numbers.
>
Yes, there are still unknowns regarding the difference in our test results. It is possible that 
there are some differences in our test topology which results in different proportions of lflow 
update with DP groups v.s. with a single DP. However, this still doesn't explain why we observed so 
different results for the same "make check-perf" test. Those test cases have a bigger 
number of per-LSP flows, which all have single DPs, so the lock contention should be high. My 
result shows that the original parallel build spent 4x more time than without parallel, while in 
yours with parallel was better than without. Just to confirm, I was always using the 
TESTSUITEFLAGS="--rebuild". Were you doing the same?

> However, a fine grained lock is always better than a global lock (even a rw 
one). So a definite +1
>
Thanks for the review! May I consider your +1 for the series as "Acked-by"?

I am not a commiter, hence using +1 (good for me) instead of an official 
Acked-By which is counted by patchwork and AFAIK changes the patch state.

Brgds,

Han

> Brgds,
>
> --
> Anton R. Ivanov
> Cambridgegreys Limited. Registered in England. Company Number 10273661
> https://www.cambridgegreys.com/ <https://www.cambridgegreys.com/>
>

--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to