I have been debugging some test failures seen in Buildbot: https://mariadb.atlassian.net/browse/MDEV-7026
After some debugging, this is what I think is going on. I will refer to MariaDB 5.5 sources, though the problem is also in 10.0. InnoDB/XtraDB mutexes are taken in mutex_enter_func() and unlocked in mutex_exit_func(). Simplifying for clarity, here are the relevant parts of the logic: ENTER: # First spin wait, hoping to get the mutex without yielding. os_compare_and_swap_ulint(&mutex->waiters, 0, 1); if (os_atomic_test_and_set_byte(&mutex->lock_word, 1)) sync_array_wait_event() RELEASE: os_atomic_lock_release_byte(&mutex->lock_word); os_rmb; waiters = *(volatile *)&mutex->waiters; if (waiters) mutex_signal_object(mutex); The interesting point here is how to ensure that we do not lose the wakeup of a waiting thread when the holding thread releases the mutex. The idea is that the waiting thread first sets mutex->waiters=1. Then it checks if the mutex is free - if not, it goes to sleep. The thread that releases the lock first marks the mutex free. Then it checks if there are any waiters, and if so wakes them up. For this to work, we need to ensure that the setting of waiters=1 has completed before the load of lock_word can start - a full memory barrier. Likewise, we need the setting of lock_word to 0 to complete before the load of mutex->waiters - again a full memory barrier. Otherwise we can get something like this: ENTER RELEASE Start store to mutex->waiters Start store to lock_word Load mutex->lock_word Store to lock_word becomes visible Load mutex->waiters Store to waiters becomes visible Check lock word==1 Check waiters==0, skip wakeup Sleep (forever) Or this: ENTER RELEASE Start store to lock_word Start store to mutex->waiters Load mutex->waiters Store to waiters becomes visible Load mutex->lock_word Store to lock_word becomes visible Check lock word==1 Sleep (forever) Check waiters==0, skip wakeup (So both full barriers are needed, the one in ENTER to prevent the first example, the one in RELEASE to prevent the second). It looks like for x86, the full barriers were broken in MariaDB 5.5.40 and 10.0.13. os_compare_and_swap_ulint() is defined as either __sync_bool_compare_and_swap() or atomic_cas_ulong(), both of which implies a full barrier. So ENTER looks ok. os_atomic_lock_release_byte() however is now defined as __sync_lock_release(). This is _not_ a full barrier, just a release barrier. So the RELEASE case seems to be incorrect, allowing the second deadlock example to occur. Previously, RELEASE used os_atomic_test_and_set_byte(), which is __sync_lock_test_and_set(). This is defined in GCC docs as an "aquire" barrier only, which is also wrong, obviously. But it happens to generate an XCHG instruction on x86, which is a full barrier. This is probably why it works on x86 (and probably why it did not work on PowerPC). Here is a small test program, and the assembly from GCC 5.7 on amd64: void test(unsigned *p, unsigned *q) { while (__sync_lock_test_and_set(q, 1)) { } *p++; __sync_lock_release(q); } test: movl $1, %edx .L2: movl %edx, %eax xchgl (%rsi), %eax testl %eax, %eax jne .L2 movl $0, (%rsi) ret As can be seen, the __sync_lock_test_and_set() generates a full barrier, but the __sync_lock_release() generates nothing (as this is sufficient for release semantics for the rather strong memory order on x86). So the InnoDB mutexes look completely broken from MariaDB 5.5.40 and 10.0.13 :-( This is consistent with the observations in MDEV-7026. A thread is waiting for wakeup on a mutex that is already marked free. Apparently this is not a new problem. This comment in mutex_exit_func() makes me want to cry: /* A problem: we assume that mutex_reset_lock word is a memory barrier, that is when we read the waiters field next, the read must be serialized in memory after the reset. A speculative processor might perform the read first, which could leave a waiting thread hanging indefinitely. Our current solution call every second sync_arr_wake_threads_if_sema_free() to wake up possible hanging threads if they are missed in mutex_signal_object. */ So someone found the mutex implementation broken, but instead of fixing it, they implemented a horrible workaround to try and break any deadlocks that might occur. (The reason that the deadlocks are not broken in the MDEV-7026 seems to be that it occurs during startup, before starting the srv_error_monitor_thread() that does the sync_arr_wake_threads_if_sema_free()). Now, what is also interesting is that this seems to explain very nicely the many server lockups we have seen for users that upgrade to recent MariaDB versions. I was doing a very detailed investigation of one such case where the user was able to supply unusually detailed information. I found a thread that was stuck waiting on a mutex, but no other threads seemed to be holding that mutex. This is exactly what would be seen in case of a lost wakeup due to this issue. Furthermore, the sync_arr_wake_threads_if_sema_free() mechanism, as well as the normal InnoDB "long semaphore wait", was itself blocked on that same mutex wait, which explains why they were not effective for the user in that particular case. So it looks like we need to revert the incorrect patch and release new 5.5.40a and 10.0.15 releases ASAP, to solve these hangs for users. And do a proper patch for PowerPC with proper full memory barriers. (Incidentally, the os_rmb in RELEASE seems to also be part of some PowerPC patch, it also looks completely wrong). However, more importantly, this issue clearly shows some fundamental problems in our development procedures that we need to fix ASAP. Just look at it. We have users (and customers!) that suffer from strange hangs. This is their production systems that are down. Support and developers are unable to help them due to the difficulty of debugging rare race conditions in production systems. And all the time the bug was staring us in the face, easily reproducable in our own Buildbot! But we deliberately chose to ignore those failures. The only thing that matters is "when can you push", and getting more releases out. Never mind what kind of crap is then pushed out to users. I am deeply ashamed of treating our users like this. And extremely unhappy. I even wrote about this (the importance of not ignoring Buildbot failures) just a few weeks ago. This was completely ignored, I got zero replies, and the weekly calls continued with mindless juggling of Jira numbers and buggy releases. As a result, our users suffer. We need to stop this. Please? - Kristian. _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp