Hello,

I was looking at rw lock code out of curiosity and noticed you always do
membar_enter which on MP-enabled amd64 kernel translates to mfence.
This makes the entire business a little bit slower.

Interestingly you already have relevant macros for amd64:
#define membar_enter_after_atomic() __membar("")
#define membar_exit_before_atomic() __membar("")

And there is even a default variant for archs which don't provide one.
I guess the switch should be easy.

Grabbing for reading is rw_cas to a higher value. On failure you
explicitly re-read from the lock  This is slower than necessary in
presence of concurrent read lock/unlock since cas returns the found
value and you can use that instead.

Also the read lock fast path does not have to descent to the slow path
after a single cas failure, but this probably does not matter right now.

The actual question I have here is if you played with adaptive spinning
instead of instantly putting threads to sleep at least for cases when
the kernel lock is not held. This can be as simplistic as just spinning
as long as the lock is owned by a running thread.

For cases where curproc holds the kernel lock you can perhaps drop it
for spinning purposes and reacquire later. Although I have no idea if
this one is going to help anything. Definitely worth testing imo.

A side note is that the global locks I found are not annotated in any
manner with respect to exclusivity of cacheline placement. In particular
netlock in a 6.2 kernel shares its chaceline with if_input_task_locked:

----------------
ffffffff81aca608 D if_input_task_locked
ffffffff81aca630 D netlock
----------------

The standard boiler plate to deal with it is to annotate with aligned()
and place the variable in a dedicated section. FreeBSD and NetBSD
contain the necessary crappery to copy-paste including linker script
support.

Cheers,
-- 
Mateusz Guzik <mjguzik gmail.com>

Reply via email to