On 05/30/2016 02:06 PM, Peter Zijlstra wrote: >> We >> never rehash the hash. To rehash the hash on runtime we would need an >> empty futex hash and some locking in the fast path. > > Nah; you can do lockless rehashing. Little tricky, but entirely doable. > At worst we'd need an rcu_read_lock(), but with a little extra work we > can extend an existing preempt_disable section to cover the hash lookup > if it isn't already covered.
you also need ensure that all new users use the new hash but all old users go for the old one until you can switch them over. And it is likely we miss a corner case because it is futex after all. And still, we all that? You need an upper limit how often / until which size you do resize. And if you want to avoid collisions at all costs (because) go for the max value if you think you need it so there is no need to resize. And if the task does not pre-allocate why care at all? >> And rehash seems >> not to be required since we tried to come up with a sane default value >> and the user/RT task can set it to the current max value. > > I'm not sure this statement is true; given that the hash function > reduces the entire address space down to a few bits you're guaranteed to > have collisions at some point. A good hash will not have more than given > by the birthday paradox, but you cannot have less. > > So while the sizing can certainly reduce the chance on collisions, > nothing guarantees you will not have them, quite the opposite. Yes, I did not try to discuss the hash collision away. From RT's point of view the problems I am aware are of the following scenario: Task #1 pinned to CPU1 task #2 pinned to CPU2 but share the same bucket. Task #2 got a wakeup and should run but is blocked on the bucket lock - otherwise it could run. Usually it would PI-boost task#1 but task#1 got preempted itself by task and since task#1 prio is lower it can't boost its way free towards the lock and so so CPU #2 may remain idle for some time. The same thing can happen within a Task if you take my story from above and replace task with thread. Completely understood. >> So back to when does the collision happen. Since glibc visits the >> kernel only on contention we might learn about the collision when it is >> too late. We could have a lock operation by thread1 followed by lock >> operation by thread2 on different uaddr resulting in the same bucket. >> In this case we learn about this once the spin_lock() operation blocks. > > Yep, so if this is two ondemand futexes hitting the same bucket, we > don't care. But if this is an ondemand futex hitting a prealloc one, we > do care. > > And yes, this is late, but its the only possible time. And notification > is better than silent weird behaviour. We don't distinguish right now between "user asked to preallocate the hash" and "we created this hash because we had a locking job in kernel". The latter is the behaviour we used to have and the former is something like "this new preallocate thingy does not always work". >> Also marking a bucket as taken (on contention) might give false >> positive results since we know nothing about lock's lifetime (i.e. the >> lock might have been free()ed). >> >> But if I may bring some ideas from v1. In v1 we had "tickets / IDs" for >> the futex per thread. In v2 we don't have them anymore. We still have >> the "private" futex hash buckets but per process this time. >> We could introduce the "tickets / IDs" back and make them process wide. >> We could hide them in pthread_mutex_init() and pthread_mutex_destroy() >> since their IDs are no longer thread unique. I think I had something in >> glibc's pthread variable where we could store 16bit if I split another >> 32bit variable. > > Not sure you need a ticket (and I never got around to reading v1), a so v1 in short. To lock you do: futex(uaddr, FUTEX_LOCK_PI | PRIVATE, …) instead uaddr, we used tickets and called the mode attached: futex(ticket, FUTEX_LOCK_PI | PRIVATE | ATTACHED, …) and the ticket was the index in our array where we kept the hash buckets. That means each user space futex has its own hash bucket without contention. But since we had the tickets unique in every thread it was hard to handle. Keeping them the same process wide should make things simple again. > simple twin to PREALLOCATE_HASH to call on mutex_destroy would be > sufficient I think. It could free the hash bucket. You do PREALLOCATE_HASH _once_ on startup not for every mutex. But yes, you would mark each mutex in mutex_destroy() as gone. So this would "work", you would find a collision during a slot lookup. You could increase the coverage if you add another marker in pthread_mutex_init() (well, except for static initializes). That means we need glibc changes for this to work. And what do you suggest to report such a collision? A trace point, a file in proc? WARN_ON_ONCE() would not result in CVE but is not that pretty and informative. Not sure if glibc is all for a return code which is not part of POSIX. Also I'm not sure what to do with this information unless this happens during development. That is why I suggest the above which is guaranteed hash clash free :) >> That would be guaranteed collision free and hidden in glibc. > > I'm still not seeing how you can guarantee anything with it; ondemand > local hashing can still collide, right? Since we have a process local hash table for private futex, yes collisions can happen. > Or are you thinking two local hashtables, one for ondemand things and > one for prealloc/free type thingies? I think one hash table for private futexes as we have now and if people still complain about collisions get the dedicated "struct futex_hash_bucket" for each pthread_mutex_t which we had in v1 and then use the returned ticket for every lock operation. But this time keep it ticket number the same within a process. This would guarantee that no collision happens within a task and requires changes in glibc. So the entry barrier is a little higher but we could merge this (more or less) as is and the the collision-free extension on top if there is demand for it. >> My point was during development / testing to figure out which initial >> value is sane / reasonable. Start your APP with 32 slot. Collisions. >> Again with 128 slots. Oh better. > > Ah, right, but given the above and ASLR, you cannot exhaustively test > this. correct. You could disable ASLR but security wise you might not want do that (but then if everything runs as root anyway, disabling ASLR is the next logical step :) ). Sebastian