Re: [PATCH bpf-next v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup()

Marco Elver Thu, 08 Feb 2024 03:02:32 -0800

On Thu, Feb 08, 2024 at 08:37AM +0100, Marco Elver wrote:
> On Thu, 8 Feb 2024 at 00:58, Yonghong Song <yonghong.s...@linux.dev> wrote:
> > On 2/7/24 4:26 AM, Marco Elver wrote:
> > > In various performance profiles of kernels with BPF programs attached,
> > > bpf_local_storage_lookup() appears as a significant portion of CPU
> > > cycles spent. To enable the compiler generate more optimal code, turn
> > > bpf_local_storage_lookup() into a static inline function, where only the
> > > cache insertion code path is outlined
> > >
> > > Notably, outlining cache insertion helps avoid bloating callers by
> > > duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on
> > > architectures which do not inline spin_lock/unlock, such as x86), which
> > > would cause the compiler produce worse code by deciding to outline
> > > otherwise inlinable functions. The call overhead is neutral, because we
> > > make 2 calls either way: either calling raw_spin_lock_irqsave() and
> > > raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(),
> > > which calls raw_spin_lock_irqsave(), followed by a tail-call to
> > > raw_spin_unlock_irqsave() where the compiler can perform TCO and (in
> > > optimized uninstrumented builds) turns it into a plain jump. The call to
> > > __bpf_local_storage_insert_cache() can be elided entirely if
> > > cacheit_lockit is a false constant expression.
> > >
> > > Based on results from './benchs/run_bench_local_storage.sh' (21 trials,
> > > reboot between each trial; x86 defconfig + BPF, clang 16) this produces
> > > improvements in throughput and latency in the majority of cases, with an
> > > average (geomean) improvement of 8%:
> [...]
> > >   include/linux/bpf_local_storage.h             | 30 ++++++++++-
> > >   kernel/bpf/bpf_local_storage.c                | 52 +++++--------------
> > >   .../bpf/prog_tests/task_local_storage.c       |  6 ---
> > >   .../selftests/bpf/progs/cgrp_ls_recursion.c   | 26 ----------
> > >   .../selftests/bpf/progs/task_ls_recursion.c   | 17 ------
> > >   5 files changed, 41 insertions(+), 90 deletions(-)
> > >
> > > diff --git a/include/linux/bpf_local_storage.h 
> > > b/include/linux/bpf_local_storage.h
> > > index 173ec7f43ed1..dcddb0aef7d8 100644
> > > --- a/include/linux/bpf_local_storage.h
> > > +++ b/include/linux/bpf_local_storage.h
> > > @@ -129,10 +129,36 @@ bpf_local_storage_map_alloc(union bpf_attr *attr,
> > >                           struct bpf_local_storage_cache *cache,
> > >                           bool bpf_ma);
> > >
> > > -struct bpf_local_storage_data *
> > > +void __bpf_local_storage_insert_cache(struct bpf_local_storage 
> > > *local_storage,
> > > +                                   struct bpf_local_storage_map *smap,
> > > +                                   struct bpf_local_storage_elem *selem);
> > > +/* If cacheit_lockit is false, this lookup function is lockless */
> > > +static inline struct bpf_local_storage_data *
> > >   bpf_local_storage_lookup(struct bpf_local_storage *local_storage,
> > >                        struct bpf_local_storage_map *smap,
> > > -                      bool cacheit_lockit);
> > > +                      bool cacheit_lockit)
> > > +{
> > > +     struct bpf_local_storage_data *sdata;
> > > +     struct bpf_local_storage_elem *selem;
> > > +
> > > +     /* Fast path (cache hit) */
> > > +     sdata = rcu_dereference_check(local_storage->cache[smap->cache_idx],
> > > +                                   bpf_rcu_lock_held());
> > > +     if (sdata && rcu_access_pointer(sdata->smap) == smap)
> > > +             return sdata;
> >
> > I think we should focus on fast path (your v1 patch)
> > and I suppose most production environments
> > want to hit fast path in most times. In your production environment did
> > you see more than 16 local storage maps per entity (task/sk/inode)?
> 
> I think having more than 16 local storage maps isn't entirely unlikely
> as eBPF usage grows. But at the moment, it should be rare.
> 
> > In the fast path, the memory accesses are
> >    two from local_storage->cache[smap->cache_idx] and
> >    one from sdata->smap
> >
> >
> > > +
> > > +     /* Slow path (cache miss) */
> > > +     hlist_for_each_entry_rcu(selem, &local_storage->list, snode,
> > > +                               rcu_read_lock_trace_held())
> > > +             if (rcu_access_pointer(SDATA(selem)->smap) == smap)
> > > +                     break;
> >
> > But if we reach slow path here which means we have more than 16 local
> > storage maps, then traversing the list and getting SDATA(selem)->smap
> > will be very expensive, in addition to memory accesses in fast path.
> >
> > I suppose here we mostly care about socket local storage since it is
> > totally possible for a production workload to have millions of sockets.
> > To improve performance, fast path should hit in most cases.
> > If there are too many sk local storage maps, some kind of sharing
> > can be done so multiple applications might be using a single sk
> > local storage.
> >
> > Your above inlining/outlining analysis also show how tricky it is
> > for compilation optimization. Without profiling, it is totally
> > possible that compiler might do optimization differently in
> > the future.
> 
> Sure, but it's usually the case that we have to help the compiler a
> little to produce more optimal code - if the compiler becomes stupid
> in future, we need either fix the compiler or help it some more.
> 
> > So here is my suggestion, let us do inlining
> > for fast path and focus on performance of fast path.
> 
> The slow-path (iterate list w/o cache insertion) is still relatively
> small (it's a pointer-chasing loop and a compare), and I decided that
> it can be justified inlining it. Martin asked in v1 why there were
> slowdowns above 16 local maps, and I analyzed, and concluded that
> inlining most is needed to fix and does not hurt performance: in fact,
> the current version is better than v1 in all cases (even for 16 maps
> or below).
> 
> Let me know which version you prefer, and I'll change it. However,
> based on the results, I would prefer the current version.


FTR, these were the results going from v1 (before) -> v2 (after):

+---- Local Storage ----------------------
|
| + num_maps: 1
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  
+----------------------+----------------------
|   +- hits throughput                    | 38.593 M ops/s       | 39.068 M 
ops/s (+1.2%)
|   +- hits latency                       | 25.913 ns/op         | 25.598 ns/op 
  (-1.2%)
|   +- important_hits throughput          | 38.593 M ops/s       | 39.068 M 
ops/s (+1.2%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get 
+----------------------+----------------------
|   +- hits throughput                    | 44.406 M ops/s       | 44.926 M 
ops/s (+1.2%)
|   +- hits latency                       | 22.521 ns/op         | 22.259 ns/op 
  (-1.2%)
|   +- important_hits throughput          | 44.406 M ops/s       | 44.926 M 
ops/s (+1.2%)
|
| + num_maps: 10
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  
+----------------------+----------------------
|   +- hits throughput                    | 37.583 M ops/s       | 38.099 M 
ops/s (+1.4%)
|   +- hits latency                       | 26.609 ns/op         | 26.248 ns/op 
  (-1.4%)
|   +- important_hits throughput          | 3.758 M ops/s        | 3.810 M 
ops/s  (+1.4%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get 
+----------------------+----------------------
|   +- hits throughput                    | 40.698 M ops/s       | 41.145 M 
ops/s (+1.1%)
|   +- hits latency                       | 24.573 ns/op         | 24.307 ns/op 
  (-1.1%)
|   +- important_hits throughput          | 14.535 M ops/s       | 14.695 M 
ops/s (+1.1%)
|
| + num_maps: 16
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  
+----------------------+----------------------
|   +- hits throughput                    | 38.061 M ops/s       | 38.341 M 
ops/s (  ~  )
|   +- hits latency                       | 26.275 ns/op         | 26.083 ns/op 
  (  ~  )
|   +- important_hits throughput          | 2.379 M ops/s        | 2.396 M 
ops/s  (  ~  )
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get 
+----------------------+----------------------
|   +- hits throughput                    | 40.890 M ops/s       | 41.338 M 
ops/s (+1.1%)
|   +- hits latency                       | 24.458 ns/op         | 24.193 ns/op 
  (-1.1%)
|   +- important_hits throughput          | 13.010 M ops/s       | 13.153 M 
ops/s (+1.1%)
|
| + num_maps: 17
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  
+----------------------+----------------------
|   +- hits throughput                    | 31.799 M ops/s       | 32.756 M 
ops/s (+3.0%)
|   +- hits latency                       | 31.448 ns/op         | 30.530 ns/op 
  (-2.9%)
|   +- important_hits throughput          | 1.873 M ops/s        | 1.929 M 
ops/s  (+3.0%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get 
+----------------------+----------------------
|   +- hits throughput                    | 35.284 M ops/s       | 36.110 M 
ops/s (+2.3%)
|   +- hits latency                       | 28.343 ns/op         | 27.697 ns/op 
  (-2.3%)
|   +- important_hits throughput          | 10.742 M ops/s       | 10.993 M 
ops/s (+2.3%)
|
| + num_maps: 24
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  
+----------------------+----------------------
|   +- hits throughput                    | 17.947 M ops/s       | 19.937 M 
ops/s (+11.1%)
|   +- hits latency                       | 55.725 ns/op         | 50.166 ns/op 
  (-10.0%)
|   +- important_hits throughput          | 0.748 M ops/s        | 0.831 M 
ops/s  (+11.1%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get 
+----------------------+----------------------
|   +- hits throughput                    | 21.379 M ops/s       | 23.332 M 
ops/s (+9.1%)
|   +- hits latency                       | 46.775 ns/op         | 42.865 ns/op 
  (-8.4%)
|   +- important_hits throughput          | 6.014 M ops/s        | 6.564 M 
ops/s  (+9.1%)
|
| + num_maps: 32
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  
+----------------------+----------------------
|   +- hits throughput                    | 13.279 M ops/s       | 14.626 M 
ops/s (+10.1%)
|   +- hits latency                       | 75.317 ns/op         | 68.381 ns/op 
  (-9.2%)
|   +- important_hits throughput          | 0.416 M ops/s        | 0.458 M 
ops/s  (+10.2%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get 
+----------------------+----------------------
|   +- hits throughput                    | 16.444 M ops/s       | 17.906 M 
ops/s (+8.9%)
|   +- hits latency                       | 60.816 ns/op         | 55.865 ns/op 
  (-8.1%)
|   +- important_hits throughput          | 4.590 M ops/s        | 4.998 M 
ops/s  (+8.9%)
|
| + num_maps: 100
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  
+----------------------+----------------------
|   +- hits throughput                    | 4.912 M ops/s        | 5.528 M 
ops/s  (+12.5%)
|   +- hits latency                       | 207.291 ns/op        | 183.059 
ns/op  (-11.7%)
|   +- important_hits throughput          | 0.049 M ops/s        | 0.055 M 
ops/s  (+12.7%)
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get 
+----------------------+----------------------
|   +- hits throughput                    | 6.039 M ops/s        | 6.498 M 
ops/s  (+7.6%)
|   +- hits latency                       | 167.325 ns/op        | 152.877 
ns/op  (-8.6%)
|   +- important_hits throughput          | 1.577 M ops/s        | 1.697 M 
ops/s  (+7.6%)
|
| + num_maps: 1000
| :                                         <before>             | <after>
| +-+ local_storage cache sequential get  
+----------------------+----------------------
|   +- hits throughput                    | 0.342 M ops/s        | 0.354 M 
ops/s  (+3.6%)
|   +- hits latency                       | 2930.550 ns/op       | 2827.139 
ns/op (-3.5%)
|   +- important_hits throughput          | 0.000 M ops/s        | 0.000 M 
ops/s  (  ~  )
| :
| :                                         <before>             | <after>
| +-+ local_storage cache interleaved get 
+----------------------+----------------------
|   +- hits throughput                    | 0.413 M ops/s        | 0.403 M 
ops/s  (-2.5%)
|   +- hits latency                       | 2427.830 ns/op       | 2487.555 
ns/op (+2.5%)
|   +- important_hits throughput          | 0.104 M ops/s        | 0.101 M 
ops/s  (-2.6%)
|
| Geomean:
| hits throughput: 102.93%
| hits latency: 97.11%
| important_hits throughput: 102.77%

Re: [PATCH bpf-next v2] bpf: Allow compiler to inline most of bpf_local_storage_lookup()

Reply via email to