On Tue, Apr 18, 2017 at 10:35:02PM +0530, Sunil Kovvuri wrote: > On Tue, Apr 18, 2017 at 8:18 PM, Catalin Marinas > <catalin.mari...@arm.com> wrote: > > On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote: > >> >> >> Do you have an explanation on the performance variation when > >> >> >> L1_CACHE_BYTES is changed? We'd need to understand how the > >> >> network stack > >> >> >> is affected by L1_CACHE_BYTES, in which context it uses it (is > >> >> it for > >> >> >> non-coherent DMA?). > >> >> > > >> >> > network stack use SKB_DATA_ALIGN to align. > >> >> > --- > >> >> > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ > >> >> > ~(SMP_CACHE_BYTES - 1)) > >> >> > > >> >> > #define SMP_CACHE_BYTES L1_CACHE_BYTES > >> >> > --- > >> >> > I think this is the reason of performance regression. > >> >> > > >> >> > >> >> Yes this is the reason for performance regression. Due to increases > >> >> L1 cache alignment the > >> >> object is coming from next kmalloc slab and skb->truesize is > >> >> changing from 2304 bytes to > >> >> 4352 bytes. This in turn increases sk_wmem_alloc which causes > >> >> queuing of less send buffers. > >> > >> With what traffic did you check 'skb->truesize' ? Increase from > >> 2304 to 4352 bytes doesn't seem to be real. I checked with ICMP > >> pkts with maximum size possible with 1500byte MTU and I don't see > >> such a bump. If the bump is observed with Iperf sending TCP packets > >> then I suggest to check if TSO is playing a part over here. > > > > I haven't checked truesize but I added some printks to __alloc_skb() (on > > a Juno platform) and the size argument to this function is 1720 on many > > occasions. With sizeof(struct skb_shared_info) of 320, the actual data > > allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a > > 128 byte cache size, it goes slightly over 2K, hence the 4K slab > > allocation. > > Understood but still in my opinion this '4K slab allocation' cannot be > considered as an issue with cache line size, there are many network > drivers out there which do receive buffer or page recycling to > minimize (sometimes almost to zero) the cost of buffer allocation.
The slab allocation shouldn't make much difference (unless you are running on a memory constrained system) but I don't understand how skb->truesize (which is almost half unused) affects the sk_wmem_alloc and its interaction with other bits in the network stack (e.g. tcp_limit_output_bytes). However, I do think it's worth investigating further to fully understand the issue. > >The 1720 figure surprised me a bit as well since I was > > expecting something close to 1500. > > > > The thing that worries me is that skb->data may be used as a buffer to > > DMA into. If that's the case, skb_shared_info is wrongly aligned based > > on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent > > platform. It should really be ARCH_DMA_MINALIGN. > > I didn't get this, if you see __alloc_skb() > > 229 size = SKB_DATA_ALIGN(size); > 230 size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > > both DMA buffer and skb_shared_info are aligned to a cacheline separately, > considering 128byte alignment guarantees 64byte alignment as well, how > will this lead to corruption ? It's the other way around: you align only to 64 byte while running on a platform with 128 byte cache lines and non-coherent DMA. > And if platform is non-DMA-coherent then again it's the driver which > should take of coherency by using appropriate map/unmap APIs and > should avoid any cacheline sharing btw DMA buffer and skb_shared_info. The problem is that the streaming DMA API can only work correctly on cacheline-aligned buffers (because of the cache invalidation it performs for DMA ops; even with clean&invalidate, the operation isn't always safe if a cacheline is shared between DMA and CPU buffers). In the skb case, we could have the data potentially sharing the last addresses of a DMA buffer with struct skb_shared_info. We may be able to get away with SKB_DATA_ALIGN not using ARCH_DMA_MINALIGN *if* skb_shared_info is *not* written before or during an inbound DMA transfer (though such tricks are arch specific). > > IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue > > if we go back to 64 byte cache lines. > > Yes, Cavium platform is DMA coherent and there is no issue with reverting back > to 64byte cachelines. But do we want to do this because some platform has a > performance issue and this is an easy way to solve it. IMHO there seems > to be many ways to solve performance degradation within the driver itself, and > if those doesn't work then probably it makes sense to revert this. My initial thought was to revert the change because it was causing a significant performance regression on certain SoC. But given that it took over a year for people to follow up, it doesn't seem too urgent, so we should rather try to understand the issue and potential side effects of moving back to a 64 byte cache line. -- Catalin