Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

Benoit Ganne (bganne) via lists.fd.io Thu, 15 Jul 2021 00:57:26 -0700

Looks good to me, thanks!

ben


> -----Original Message-----
> From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Damjan Marion
> via lists.fd.io
> Sent: mercredi 14 juillet 2021 19:48
> To: vpp-dev <vpp-dev@lists.fd.io>
> Cc: Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>; Govindarajan
> Mohandoss <govindarajan.mohand...@arm.com>; Tianyu Li <tianyu...@arm.com>;
> Nitin Saxena <nsax...@marvell.com>; Lijian Zhang <lijian.zh...@arm.com>
> Subject: Re: [vpp-dev] Prefetches improvement for VPP Arm generic image
> 
> 
> I spent a bit of time to look at this and come up with some reasonable
> solution.
> 
> First, 128-byte cacheline is not dead, recently announced Marvell Octeon
> 10 have 128 byte cacheline.
> 
> In current code cacheline size defines both amount of data prefetch
> instruction prefetches and
> alignment of data in data structures needed to avoid false sharing.
> 
> So I think ideally we should have following:
> 
> - on x86:
>   - number of bytes prefetch instruction prefetches set to 64
>   - data structures should be aligned to 64 bytes
>   - due the fact that there is adjacent cacehline prefetcher on x86 it may
> be worth
>     investigating if aligning to 128 brings some value
> 
> - on AArch64
>   - number of bytes prefetch instruction prefetches set to 64 or 128,
> based on multiarch variant running
>   - data structures should be aligned to 128 bytes as that value prevents
> false sharing for both 64 and 128 byte cacheline systems
> 
> Main problem is abuse of CLIB_PREFETCH() macro in our codebase.
> Original idea of it was good, somebody wanted to provide macro which
> transparently emits 1-4 prefetch
> instructions based on data size recognising that there may be systems with
> different cacheline size
> 
> Like:
>   CLIB_PREFETCH (p, sizeof (ip6_header_t), LOAD);
> 
> But reality is, most of the time we have:
>   CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, LOAD);
> 
> Where it is assumed that cacheline size is 64 and that just wasted
> resources on system with 128-byte cacheline.
> 
> Also, most of places in our codebase are perfectly fine with whatever
> cacheline size is, so I’m thinking about following:
> 
> 1. set CLIB_CACHE_LINE_BYTES to 64 on x86 and 128 on ARM, that will make
> sure false sharing is not happening
> 
> 2. introduce CLIB_CACHE_PREFETCH_BYTES which can be set to different value
> for different multiarch variant (64 for N1, 128 ThinderX2)
> 
> 3. modify CLIB_PREFETCH macro to use CLIB_CACHE_PREFETCH_BYTES to emit
> proper number of prefetch instructions for cases where data size is
> specified
> 
> 4. take the stub and replace all `CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES,
> LOAD);` with `clib_prefetch_load (p);`.
>    There may be exceptions but those lines typically mean: "i want to
> prefetch few (<=64) bytes at this address and i really don’t care what the
> cache line size is”.
> 
> 5. analyse remaining few cases where CLIB_PREFETCH() is used with size
> specified by CLIB_CACHE_LINE_BYTES.
> 
> Thoughts?
> 
> —
> Damjan
> 
> > On 06.07.2021., at 03:48, Lijian Zhang <lijian.zh...@arm.com> wrote:
> >
> > Thanks Damjan for your comments. Some replies in lines.
> > <snip>
> > Hi Lijian,
> >
> > It will be good to know if 128 byte cacheline is something ARM platforms
> will be using in the future or it is just historical.
> > [Lijian] Existing ThunderX1 and OcteonTX2 CPUs are 128 byte cache-line.
> To my knowledge, there may be more CPUs with 128 byte cache-line in the
> future.
> >
> > Cacheline size problem is not just about prefetching, even bigger issue
> is false sharing, so we need to address both.
> > [Lijian] Yes, there may be false-sharing issue when running VPP image
> with 64B definition on 128B cache-line CPUs. We will do some scalability
> testing with that case, and check the multi-core performance.
> >
> > Probably best solution is to have 2 VPP images, one for 128 and one for
> 64 byte cacheline size.
> > [Lijian] For native built image, that’s fine. But I’m not sure if it’s
> possible for cloud binaries installed via “apt-get install”.
> >
> > Going across the whole codebase and replacing prefetch macros is
> something we should definitely avoid.
> > [Lijian] I got your concerns on large scope replacement. My concern is
> when CLIB_PREFETCH() is used to prefetch packet content into cache as
> below example, cache-line (CLIB_CACHE_LINE_BYTES) seems to be assumed as
> 64 bytes always.
> > CLIB_PREFETCH (p2->data, 3 * CLIB_CACHE_LINE_BYTES, LOAD);
> >
> > —
> > Damjan
> >
> >
> > On 05.07.2021., at 07:28, Lijian Zhang <lijian.zh...@arm.com> wrote:
> >
> > Hi Damjan,
> > I committed several patches to address some issues around cache-line
> definitions in VPP.
> >
> > Patch [1.1] is to resolve the build error [2] on 64Byte cache line Arm
> CPUs, e.g., ThunderX2, NeoverseN1, caused by the commit
> (https://gerrit.fd.io/r/c/vpp/+/32996, build: remove unused files and
> sections).
> > It also supports building Arm generic image (with command of “make
> build-release”) with 128Byte cache line definition, and building native
> image with 64Byte cache line definition on some Arm CPUs, e.g., ThunderX2,
> NeoverseN1 (with command of “make build-release TARGET_PLATFORM=native”).
> >
> > Patch [1.5] is to set the default cache line definition in Arm generic
> image from 128Byte to 64Byte.
> > Setting cache line definition to 128Byte for Arm generic image is
> required for ThunderX1 (with 128Byte physical cache line), which is also
> the build machine in FD.io lab. I’m thinking for setting 64Byte cache line
> definition in VPP for Arm image, which will affect ThunderX1 and OcteonTX2
> CPUs. So it requires the confirmation by Marvell.
> >
> > Arm architecture CPUs have 128Byte or 64Byte physical designs. So no
> matter the cache line definition is 128Byte or 64Byte in VPP source code,
> the prefetch functions in generic image will not work properly on all Arm
> CPUs. Patches [1.2] [1.3] [1.4] are to resolve the issue.
> >
> > For example when running Arm generic image (cache-line-size is defined
> as 128B in Makefile for all Arm architectures) on 64Byte cache-line-size
> CPUs, e.g., Neoverse-N1, Ampere altra, ThunderX2.
> >
> > [3] shows the prefetch macro definitions in VPP. Using CLIB_PREFETCH(),
> you can prefetch data resides in multiple cache lines.
> > [4] shows some usage examples of the prefetch macros in VPP. When
> running Arm generic image (128B cache-line-size definition) on 64B cache-
> line CPUs (N1SDP for example), 4.2, 4.3 and 4.4 have issues.
> >        • For 4.2, the input for size parameter is 68. On N1SDP with 64B
> cache-line-size, there should be two prefetch instructions executed, but
> due to 68 is less than CLIB_CACHE_LINE_BYTES (128Byte definition in VPP),
> only the first prefetch instruction is executed.
> >        • For 4.3, if sizeof (ip0[0]) equals 68 or any other values
> larger than 64B, there will be the same issue as 4.2.
> >        • For 4.4, the code  is trying to prefetch the first 128B of
> packet content. It assumes  CLIB_CACHE_LINE_BYTES is 64B always. In Arm
> generic image, the input for size parameter is 256B, which will execute
> prefetches on unexpected cache-lines (expected prefetches on 64B-0 and
> 64B-1, but actually on B64-0 and B64-2) .
> > Packet content: [64B-0][64B-1][64B-2][64B-3]
> >
> > Our proposal is introduce a macro CLIB_N_CACHELINE_BYTES via VPP multi-
> arch feature (check patch [1.2]), to reflect the runtime CPU cache-line-
> size in Arm generic image, so that the prefetch instructions can be
> executed correctly.
> > Then for 4.4, we will need to modify the parameter for size, from
> 2*CLIB_CACHE_LINE_BYTES to 128B, to reflect the actual intention.
> >
> > Some additional macros [1.3] can be added for users to do prefetch based
> on number of cache-lines, besides number of bytes.
> >
> > Could you please suggest on the issue and proposal?
> >
> > [1]. Patches
> > [1.1] build: support 128B/64B cache line size in Arm image,
> https://gerrit.fd.io/r/c/vpp/+/32968/2
> > [1.2] vppinfra: refactor prefetch macro,
> https://gerrit.fd.io/r/c/vpp/+/32969/3
> > [1.3] vppinfra: fix functions to prefetch single line,
> https://gerrit.fd.io/r/c/vpp/+/32970/2
> > [1.4] misc: correct prefetch macro usage,
> https://gerrit.fd.io/r/c/vpp/+/32971/3
> > [1.5] build: set 64B cache line size in Arm image,
> https://gerrit.fd.io/r/c/vpp/+/32972/2
> >
> > [2]. Error message
> > src/plugins/dpdk/device/init.c:1916:3: error: static_assert failed due
> to requirement '128 == 1 << 6' "DPDK RTE CACHE LINE SIZE does not match
> with 1<<CLIB_LOG2_CACHE_LINE_BYTES"
> >   STATIC_ASSERT (RTE_CACHE_LINE_SIZE == 1 << CLIB_LOG2_CACHE_LINE_BYTES,
> >   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > /home/lijian/tasks/plsremove/src/vppinfra/error_bootstrap.h:111:34:
> note: expanded from macro 'STATIC_ASSERT'
> > #define STATIC_ASSERT(truth,...) _Static_assert(truth, __VA_ARGS__)
> >                                  ^              ~~~~~
> > 1 error generated.
> >
> > [3]. Prefetch macro definitions in VPP source code.
> > ‘’’
> > #define _CLIB_PREFETCH(n,size,type)
> \
> >   if ((size) > (n)*CLIB_CACHE_LINE_BYTES)
> \
> >     __builtin_prefetch (_addr + (n)*CLIB_CACHE_LINE_BYTES,       \
> >                                            CLIB_PREFETCH_##type,
> \
> >                                            /* locality */ 3);
> >
> > #define CLIB_PREFETCH(addr,size,type)               \
> > do {
> \
> >   void * _addr = (addr);                               \
> >
> \
> >   ASSERT ((size) <= 4*CLIB_CACHE_LINE_BYTES); \
> >   _CLIB_PREFETCH (0, size, type);                            \
> >   _CLIB_PREFETCH (1, size, type);                            \
> >   _CLIB_PREFETCH (2, size, type);                            \
> >   _CLIB_PREFETCH (3, size, type);                            \
> > } while (0)
> > ‘’’
> >
> > [4]
> > 4.1 CLIB_PREFETCH (p2->pre_data, 32, STORE);
> > 4.2 CLIB_PREFETCH (p2->pre_data, 68, STORE);
> > 4.3 CLIB_PREFETCH (b[4]->data, sizeof (ip0[0]), LOAD);
> > 4.4 CLIB_PREFETCH (p2->data, 2*CLIB_CACHE_LINE_BYTES, LOAD);
> >
> > Thanks.
> 
> 
> 
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#19776): https://lists.fd.io/g/vpp-dev/message/19776
Mute This Topic: https://lists.fd.io/mt/83991823/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] Prefetches improvement for VPP Arm generic image

Reply via email to