Re: [EXT] Re: [vpp-dev] 128 byte cache line support

Damjan Marion via Lists.Fd.Io Thu, 21 Mar 2019 08:17:55 -0700


> On 21 Mar 2019, at 16:04, Nitin Saxena <nsax...@marvell.com> wrote:
> 
> Hi Damjan,
> 
> From: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> <vpp-dev@lists.fd.io 
> <mailto:vpp-dev@lists.fd.io>> on behalf of Damjan Marion via Lists.Fd.Io 
> <dmarion=me....@lists.fd.io <mailto:dmarion=me....@lists.fd.io>>
> Sent: Thursday, March 21, 2019 6:03 PM
> To: Nitin Saxena
> Cc: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>
> Subject: Re: [EXT] Re: [vpp-dev] 128 byte cache line support
>  
> 
> 
> > On 21 Mar 2019, at 06:51, Nitin Saxena <nsax...@marvell.com 
> > <mailto:nsax...@marvell.com>> wrote:
> > 
> > Hi,
> > 
> > First all sorry for responding late to this mail chain. Please see my 
> > answers inline in blue
> > 
> > Thanks,
> > Nitin
> > 
> > 
> > From: Damjan Marion <dmar...@me.com <mailto:dmar...@me.com>>
> > Sent: Monday, March 18, 2019 4:48 PM
> > To: Honnappa Nagarahalli
> > Cc: vpp-dev; Nitin Saxena
> > Subject: [EXT] Re: [vpp-dev] 128 byte cache line support
> >  
> > External Email
> > 
> > 
> >> On 15 Mar 2019, at 04:52, Honnappa Nagarahalli 
> >> <honnappa.nagaraha...@arm.com <mailto:honnappa.nagaraha...@arm.com>> wrote:
> >> 
> >>  
> >> 
> >> Related to change 18278[1], I was wondering if there is really a benefit 
> >> of dealing with 128-byte cachelines like we do today.
> >> Compiling VPP with cacheline size set to 128 will basically just add 64 
> >> bytes of unused space at the end of each cacheline so
> >> vlib_buffer_t for example will grow from 128 bytes to 256 bytes, but we 
> >> will still need to prefetch 2 cachelines like we do by default.
> >> 
> >> [Nitin]: This is the existing model. In case of forwarding mainly first 
> >> vlib cache line size is being used. We are utilising existing hole (in 
> >> first vlib cache line) by putting packet parsing info (Size ==64B). This 
> >> has many benefits, one of them is to avoid ipv4-input-no-chksum() software 
> >> checks. It gives us ~20 cycles benefits on our platform. So I do not want 
> >> to lose that gain.
> 
> That sounds like a terribly bad idea, and it likely will never be upstreamed.
> vlib_buffer_t is 128-byte data structure, and it is perfect fit for 128-byte 
> cacheline size systems. I don't see a point in growing this to 256-byte.
> If you need more space, you can always grow headroom space for additional 
> cacheline and store whatever you want there.
> [Nitin]: In current upstream code, size of VLIB_BUFFER_HDR_SIZE is 256 byte 
> for 128B cache line target and not 128 byte. So we haven't done any changes 
> in the upstream code except introducing CLIB_MIN_CACHE_LINE_BYTES == 64 macro 
> and putting it across vlib_buffer_t as MARKER. So we are utilising existing 
> hole (unused) in first cache line of vlib_buffer_t.


I understood that, i’m proposing that we fix that so header size is always 128, 
as it should be.

> 
> >> Whta will happen if we just leave that to be 64?
> >> 
> >> [Nitin]: This will create L1D holes on 128B targets right?
> >> 
> >> Unutilized holes are not acceptable as it will waste L1D space and thereby 
> >> affecting performance. On the contrary we want to pack structures from  
> >> 2x64B to 1x128B cache line size to reduce number of pending prefetches in 
> >> core pipeline. VPP heavily prefetches LOAD/STORE version of 64B and our 
> >> effort is to reduce them for our target.
> >> 
> 
> Not sure what do you mean by L1D holes. My proposal is that we align all 
> per-thread data structures to 128 bytes, not to grow anything.
> [Nitin]: Ok I misunderstood your comment when you said "if we just leave that 
> to be 64". 
> 
> >> [Honnappa] Currently, ThunderX1 and Octeon TX have 128B cache line. What I 
> >> have heard from Marvel folks is 64B cache line setting in DPDK does not 
> >> work. I have not gone into details on what does not work exactly. May be 
> >> Nitin can elaborate.
> >> 
> > I’m curious to hear details…
> >> 
> >> 1. sometimes (and not very frequently) we will issue 2 prefetch 
> >> instructions for same cacheline, but I hope hardware is smart enough to 
> >> just ignore 2nd one
> >> 
> >> 2. we may face false sharing issues if first 64 bytes is touched by one 
> >> thread and another 64 bytes are touched by another one
> >> 
> >> Second one sounds to me like a real problem, but it can be solved by 
> >> aligning all per-thread data structures to 2 x cacheline size.
> >> 
> >> [Honnappa] Sorry, I don’t understand you here. Even if the data structure 
> >> is aligned on 128B (2 X 64B), 2 contiguous blocks of 64B data would be on 
> >> a single cache line.
> >> 
> > I wanted to say that we can align all per-thread data structures to 128 
> > bytes, even on systems which have 64 byte cacheline size.
> >> 
> >> Actually If i remember correctly, even on x86 some of hardware prefetchers 
> >> are dealing with blocks of 2 cachelines.
> >> 
> >> So unless I missed something, my proposal here is, instead of maintaining 
> >> special 128 byte images for some ARM64 machines,
> >> let’s just align all per-thread data structures to 128 and have just one 
> >> ARM image.
> >> 
> >> [Honnappa] When we run VPP compiled with 128B cache line size on platforms 
> >> with 64B cache line size, there is a performance degradation.
> >> 
> > Yeah, sure, what I’m suggesting here is how to address that perf 
> > degradation.
> > [Nitin]: Is this proposal for Intel as well? If yes then I am fine with the 
> > proposal but I think it will decrease performance on 64B architecture with 
> > existing code. 
> 
> I’m curious to hear why do you think so…
> [Nitin]: So if all per-thread data structures have single cache line MARKER 
> and they are always prefetched from this marker (with number of 
> CLIB_CACHE_LINE_BYTES as an argument and not with size of structure) then 
> yeah it should be fine.

I’m proposing that all per-thread data structures are aligned to 128 
independently of cache line size.
Code should prefetch what it is interested in, and in worst case it will issue 
two prefetches for the same cacheline
but I expect that your hardware is smart enough to just ignore 2nd prefetch.

> 
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> 
> View/Reply Online (#12597): https://lists.fd.io/g/vpp-dev/message/12597 
> <https://lists.fd.io/g/vpp-dev/message/12597>
> Mute This Topic: https://lists.fd.io/mt/30633927/675191 
> <https://lists.fd.io/mt/30633927/675191>
> Group Owner: vpp-dev+ow...@lists.fd.io <mailto:vpp-dev+ow...@lists.fd.io>
> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub 
> <https://lists.fd.io/g/vpp-dev/unsub>  [nsax...@marvell.com 
> <mailto:nsax...@marvell.com>]
> -=-=-=-=-=-=-=-=-=-=-=-

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12601): https://lists.fd.io/g/vpp-dev/message/12601
Mute This Topic: https://lists.fd.io/mt/30633927/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [EXT] Re: [vpp-dev] 128 byte cache line support

Reply via email to