On Mon, Oct 10, 2022 at 10:58:57AM +0200, Mattias Rönnblom wrote: > On 2022-10-10 09:35, Morten Brørup wrote: > > Mattias, Konstantin, Honnappa, Stephen, > > > > In my patch for non-temporal memcpy, I have been aiming for using as much > > non-temporal store as possible. E.g. copying 16 byte to a 16 byte aligned > > address will be done using non-temporal store instructions. > > > > Now, I am seriously considering this alternative: > > > > Only using non-temporal stores for complete cache lines, and using normal > > stores for partial cache lines. > > > > This is how I've done it in the past, in DPDK applications. That was both to > simplify (and potentially optimize) the code somewhat, and because I had my > doubt there was any actual benefits from using non-temporal stores for the > beginning or the end of the memory block. > > That latter reason however, was pure conjecture. I think it would be great > if Intel, ARM, AMD, IBM etc. DPDK developers could dig in the manuals or go > find the appropriate CPU expert, to find out if that is true. > > More specifically, my question is: > > A) Consider a scenario where a core does a regular store against some cache > line, and then pretty much immediately does a non-temporal store against a > different address in the same cache line. How will this cache line be > treated? > > B) Consider the same scenario, but where no regular stores preceded (or > followed) the non-temporal store, and the non-temporal stores performed did > not cover the entirety of the cache line. > The best reference I am aware of for this for Intel CPUs is section 10.4.6.2 in Vol 1 of the Software Developers Manual[1].
The bit relevant to your scenarios above is: "If a program specifies a non-temporal store with one of these instruc- tions and the memory type of the destination region is write back (WB), write through (WT), or write combining (WC), the processor will do the following: • If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted. • The non-temporal data is written to memory with WC semantics" Hope this helps a little. Regards, /Bruce [1] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf#G11.44032

