On 4/24/19 1:01 PM, Peter Zijlstra wrote: > On Wed, Apr 24, 2019 at 12:49:05PM -0400, Waiman Long wrote: >> On 4/24/19 3:09 AM, Peter Zijlstra wrote: >>> On Tue, Apr 23, 2019 at 03:12:16PM -0400, Waiman Long wrote: >>>> That is true in general, but doing preempt_disable/enable across >>>> function boundary is ugly and prone to further problems down the road. >>> We do worse things in this code, and the thing Linus proposes is >>> actually quite simple, something like so: >>> >>> --- >>> --- a/kernel/locking/rwsem.c >>> +++ b/kernel/locking/rwsem.c >>> @@ -912,7 +904,7 @@ rwsem_down_read_slowpath(struct rw_semap >>> raw_spin_unlock_irq(&sem->wait_lock); >>> break; >>> } >>> - schedule(); >>> + schedule_preempt_disabled(); >>> lockevent_inc(rwsem_sleep_reader); >>> } >>> >>> @@ -1121,6 +1113,7 @@ static struct rw_semaphore *rwsem_downgr >>> */ >>> inline void __down_read(struct rw_semaphore *sem) >>> { >>> + preempt_disable(); >>> if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, >>> &sem->count) & RWSEM_READ_FAILED_MASK)) { >>> rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE); >>> @@ -1129,10 +1122,12 @@ inline void __down_read(struct rw_semaph >>> } else { >>> rwsem_set_reader_owned(sem); >>> } >>> + preempt_enable(); >>> } >>> >>> static inline int __down_read_killable(struct rw_semaphore *sem) >>> { >>> + preempt_disable(); >>> if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, >>> &sem->count) & RWSEM_READ_FAILED_MASK)) { >>> if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE))) >>> @@ -1142,6 +1137,7 @@ static inline int __down_read_killable(s >>> } else { >>> rwsem_set_reader_owned(sem); >>> } >>> + preempt_enable(); >>> return 0; >>> } >>> >> Making that change will help the slowpath to has less preemption points. > That doesn't matter, right? Either it blocks or it goes through quickly. > > If you're worried about a parituclar spot we can easily put in explicit > preemption points. > >> For an uncontended rwsem, this offers no real benefit. Adding >> preempt_disable() is more complicated than I originally thought. > I'm not sure I get your objection? > >> Maybe we are too paranoid about the possibility of a large number of >> preemptions happening just at the right moment. If p is the probably of >> a preemption in the middle of the inc-check-dec sequence, which I have >> already moved as close to each other as possible. We are talking a >> probability of p^32768. Since p will be really small, the compound >> probability will be infinitesimally small. > Sure; but we run on many millions of machines every second, so the > actual accumulated chance of it happening eventually is still fairly > significant. > >> So I would like to not do preemption now for the current patchset. We >> can restart the discussion later on if there is a real concern that it >> may actually happen. Please let me know if you still want to add >> preempt_disable() for the read lock. > I like provably correct schemes over prayers.
I am fine with adding preempt_disable(). I just want confirmation that you want to have that. > > As you noted, distros don't usually ship with PREEMPT=y and therefore > will not be bothered much by any of this. > > The old scheme basically worked by the fact that the total supported > reader count was higher than the number of addressable pages in the > system and therefore the overflow could not happen. > > We now transition to number of CPUs, and for that we pay a little price > with PREEMPT=y kernels. Either that or cmpxchg. I also thought about switching to a cmpxchg loop for PREEMPT=y kernel. Let start with just preempt_disable() for now. We can evaluate the cmpxchg loop alternative later on. Cheers, Longman