> On Jan 4, 2018, at 4:03 AM, David Chisnall <thera...@freebsd.org> wrote:
> 
> On 3 Jan 2018, at 22:12, Nathan Whitehorn <nwhiteh...@freebsd.org> wrote:
>> 
>> On 01/03/18 13:37, Ed Schouten wrote:
>>> 2018-01-01 11:36 GMT+01:00 Konstantin Belousov <kostik...@gmail.com>:
>>>>>>> On x86, the CPUID instruction leaf 0x1 returns the information in
>>>>>>> %ebx register.
>>>>>> Hm, weird. Why don't we extend sysctl to include this info?
>>>> For the same reason we do not provide a sysctl to add two integers.
>>> I strongly agree with Kostik on this one. Why add stuff to the kernel,
>>> if userspace is already capable of extracting this? Adding that stuff
>>> to sysctl has the downside that it will effectively introduce yet
>>> another FreeBSDism, whereas something generic already exists.
>>> 
>> 
>> Well, kind of. The userspace version is platform-dependent and not always 
>> available: for example, on PPC, you can't do this from userland and we 
>> provide a sysctl machdep.cacheline_size to userland. It would be nice to 
>> have an MI API.
> 
> On ARMv8, similarly, sometimes the kernel needs to advertise the wrong size.  
> A few big.LITTLE cores have 64-byte cache lines on one cluster and 32-byte on 
> the other.  If you query the size from userspace while running on a 64-byte 
> cluster, then issue the zero-cache-line instruction while migrated to the 
> 32-byte cluster, you only clear half the size.  Linux works around this by 
> trapping and emulating the instruction to query the cache size and always 
> reporting the size for the smallest cache lines.  ARM tells people not to 
> build systems like this, but it doesn’t always stop them.  Trapping and 
> emulating is much slower than just providing the information in a shared 
> page, elf aux args vector, or even (often) a system call.
> 
> To give another example, Linux provides a very cheap way for a userspace 
> process to enquire which core it’s running on.  Some more recent 
> high-performance mallocs use this to have a second-layer per-core cache after 
> the per-thread cache for free blocks.  Unlike the per-thread cache, the 
> per-core cache does need a lock, but it’s very unlikely to be contended (it 
> will only be contended if either a thread is migrated in between checking and 
> locking, so acquires the wrong CPU’s lock, or if a thread is preempted in the 
> middle of middle of the very brief fill operation).  The author of the 
> SuperMalloc paper tried doing this with CPUID and found that it was slower by 
> a sufficient margin to almost entirely offset the benefits of the extra layer 
> of caching.  
> 
> Just because userspace can get at the information directly from the hardware 
> doesn’t mean that this is the most efficient or best way for userspace to get 
> at it.
> 
> Oh, and some of these things are useful in portable code, so having to write 
> some assembly for every target to get information that the kernel already 
> knows is wasteful.
> 
> David

This idea of Arm big.LITTLE systems having cache lines of different lengths 
really, really bothers me - how on earth is the cache coherency supposed to 
work in such a system? I doubt the usual cache coherency protocols would work - 
probably need a really MESSY protocol to deal with this config :-)

Jon.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to