Hmmm, it seems to me that there's some kind of misunderstanding. So please bear with me if I sound a bit condescending, I intend no offence if the following is just the result of vocabulary mismatch, and the general multithreading code of libev seems to me that the difference is actually well understood and used. But in the event there's an actual misusage of volatiles, I'd rather be sounding condescending than being sorry.
In a nutshell, what I want to say is that to POSIX - and Visual C's default is that mode - volatile is merely an un-optimization request from the programmer to the compiler. It in no means affects the internal state of the CPU. And that was the crux of the problem here: internal CPU caches. I will re-iter saying this several time over the course of that e-mail, because it's really important in my opinion. Compiler-barrier and CPU-barriers, while sounding very similar, aren't being issued the same way. C's volatiles, or Visual C's _ReadWriteBarrier() are only compiler-barriers. C11's _Atomics, or Visual C's MemoryBarrier() are CPU-barriers. Now I'll go over what's being discussed previously, in details. The article http://msdn.microsoft.com/en-us/library/12a04hfd%28v=vs.90%29.aspx says that volatiles work as memory barriers, which is true, but they work as compiler-memory barriers, not as CPU-memory barriers. While it asks the compiler to avoid optimizing around and to not re-order or cache reads and writes to that variable, it won't affect the CPU's internal caches, and reading or writing a volatile variable will not issue CPU synchronization instructions. In the example, they implement a spin lock semaphore using a volatile boolean. Thread1 is busy-waiting on the boolean variable to become true, and Thread2 will eventually set it to true. The trick however is that it's busy looping, meaning that it'll eventually get around the CPU caches in an SMP environment. But in that example, the volatile only has two roles: ensuring that Thread1's generated assembly code doesn't cache the boolean into a register, which would otherwise end up in an infinite loop in Thread1, and ensuring that Thread2 will accomplish what it needed to do before flipping the boolean to true, which would otherwise defeat the purpose of the critical section. And that's it. However, when CPU2 that holds Thread2 finishes writing the boolean in memory, the CPU1 that holds Thread1 may still have the boolean to false in its cache for a little longer. And that's how this code works: it just busy waits until the CPU caches are properly propagated. And since the CPU should not be re-ordering memory access either, the cache in CPU1 will properly be synchronized in the order CPU2 wrote to it. What I am saying is that volatiles are never enough to provide SMP synchronization. That'll be of course enough in a hardware configuration where you have only one single-core CPU, but if you have a modern, multi-core CPU, each with an individual cache, extra work is needed to ensure proper synchronization. In the case of libev, here is what was happening. First, let us not forget about the fact ECB_MEMORY_FENCE* were compiler-barriers only when using Visual C's _ReadWriteBarrier(). Meaning that they wouldn't flush any CPU cache. So they only ask the compiler not to re-order reads and writes around, and not to synchronize CPUs. In Thread1, CPU1, we would start executing ev_run. The ev_run's code would first set pipe_write_wanted to 1, and then, because of the volatility of the variables, and because of the memory fence, would start reading various other variables, including pipe_write_skipped, for the purpose of the "if" condition there. Again: I insist, because of the nature of the memory fence used, no extra code is generated to synchronize CPUs. Then, in Thread2, CPU2, we would start executing evpipe_write. We first set pipe_write_skipped to 1, and head over a few lines below to read pipe_write_wanted, in that order, because of the volatility and the compiler-barrier. Still without CPU synchronization. The end result for the actual race condition relies on the CPU caches: when CPU1 and CPU2 both execute their code in parallel, they both have pipe_write_skipped and pipe_write_wanted in cache, probably because of previous iterations. When CPU1 writes pipe_write_wanted and CPU2 writes pipe_write_skipped, they both go on to reading the other variable at the same time from their respective caches. And they both return 0, resulting in the race condition, that'd be a hard deadlock, if not for libev's MAX_BLOCKTIME. Here's disassembly dumps from Visual Studio, with the previous macro definition, and with the newer one: Without MemoryBarrier(): --- d:\igor\balau\libev\ev.c --------------------------------------------------- inline_speed void evpipe_write (EV_P_ EV_ATOMIC_T *flag) { 003BFF20 push ebp 003BFF21 mov ebp,esp 003BFF23 sub esp,10h 003BFF26 mov eax,dword ptr ds:[0079A000h] 003BFF2B xor eax,ebp 003BFF2D mov dword ptr [ebp-4],eax 003BFF30 push ebx 003BFF31 mov ebx,ecx ECB_MEMORY_FENCE; /* push out the write before this function was called, acquire flag */ if (expect_true (*flag)) 003BFF33 cmp dword ptr [edx],0 003BFF36 jne evpipe_write+80h (03BFFA0h) return; *flag = 1; 003BFF38 mov dword ptr [edx],1 ECB_MEMORY_FENCE_RELEASE; /* make sure flag is visible before the wakeup */ pipe_write_skipped = 1; 003BFF3E mov dword ptr [ebx+0E4h],1 ECB_MEMORY_FENCE; /* make sure pipe_write_skipped is visible before we check pipe_write_wanted */ if (pipe_write_wanted) 003BFF48 cmp dword ptr [ebx+0E0h],0 003BFF4F je evpipe_write+80h (03BFFA0h) It's pretty visible here that the ECB_MEMORY_FENCE have emitted absolutely no instruction whatsoever, and the writes and reads on pipe_write_skipped and pipe_write_wanted are simple movs and cmps, without any memory bus lock. With MemoryBarrier(): --- d:\igor\balau\libev\ev.c --------------------------------------------------- inline_speed void evpipe_write (EV_P_ EV_ATOMIC_T *flag) { 0132FF20 push ebp 0132FF21 mov ebp,esp 0132FF23 sub esp,20h 0132FF26 mov eax,dword ptr ds:[0170A000h] 0132FF2B xor eax,ebp 0132FF2D mov dword ptr [ebp-4],eax 0132FF30 push ebx 0132FF31 mov ebx,ecx ECB_MEMORY_FENCE; /* push out the write before this function was called, acquire flag */ 0132FF33 xor ecx,ecx 0132FF35 lea eax,[ebp-18h] 0132FF38 lock or dword ptr [eax],ecx if (expect_true (*flag)) 0132FF3B cmp dword ptr [edx],ecx 0132FF3D jne evpipe_write+94h (0132FFB4h) return; *flag = 1; 0132FF3F mov dword ptr [edx],1 ECB_MEMORY_FENCE_RELEASE; /* make sure flag is visible before the wakeup */ 0132FF45 lea eax,[ebp-10h] 0132FF48 lock or dword ptr [eax],ecx pipe_write_skipped = 1; 0132FF4B mov dword ptr [ebx+0E4h],1 ECB_MEMORY_FENCE; /* make sure pipe_write_skipped is visible before we check pipe_write_wanted */ 0132FF55 lea eax,[ebp-8] 0132FF58 lock or dword ptr [eax],ecx if (pipe_write_wanted) 0132FF5B cmp dword ptr [ebx+0E0h],ecx 0132FF61 je evpipe_write+94h (0132FFB4h) Here, the code goes a little bit different. Visual C decided to optimize "zero" by clearing the ecx register, and subsequently uses it to issue "lock or" on various locations on the stack, to force a CPU cache synchronization. The write and read to pipe_write_skipped and pipe_write_wanted are still done using simple movs and cmps, but are preceded with bus locking operations, making sure the various cores all have the same caches - since Intel CPUs are pretty scorched-earth-method when it comes to cache-lines purging, a simple lock anywhere will do. So, again, I want to emphasis that volatiles aren't atomics. They don't guarantee CPUs caches synchronization, and merely is a contract between the programmer and the compiler to not re-order operations as a result of optimizations. Which is why C11 and C++11 introduced <stdatomic.h> and <atomic> respectively. And Visual C's _ReadWriteBarrier() is the same as a volatile compiler hint. Since the 2008 release, it no longer issues CPU cache flushes, it simply is a broader volatile. Since Visual Studio 2008, MemoryBarrier() is what should be seen as the equivalent of gcc's __sync_synchronize(). As I said in the preamble, sorry if I sound condescending, but I rather wanted to make sure we were on the same page than leaving any doubt around. -- Nicolas Noble On Sat, Jan 4, 2014 at 1:45 AM, Marc Lehmann <schm...@schmorp.de> wrote: > On Sat, Jan 04, 2014 at 01:35:44AM -0800, Nicolas Noble < > nico...@grumpycoder.net> wrote: > > Basically, if you refer to the MSDN, (from this article, more precisely: > > http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.90).aspx ) the > > _ReadWriteBarrier() intrinsic is now only a compiler intrinsic that only > > This should be fine, as the required variables are marked as volatile, and > according to this: > > http://msdn.microsoft.com/en-us/library/12a04hfd%28v=vs.90%29.aspx > > volatile variables, as a compiler extension, already act as memory > barries (and the article explicitly says using it to implement locks in > multithreaded apps etc. is ok). > > > For more details, if needed, I can provide a fully-featured > reproduction > > case tomorrow - it's 1:30am right now, and I'm only glad I finally found > > this one - but here is some pseudo-code that exacerbates and demonstrates > > the problem: > > This would then either be a compiler bug (as the required vars are > volatile), or another bug unrelated to memory barriers (e.g. in libev > itself, which would be scary, but looks very unlikely at this point). > > > Another workaround I first found, that confirms the problem, is that > I've > > changed pipe_write_wanted to always be 1. Same exact code using the old > > barrier code, just writing 1 in pipe_write_wanted instead of 0, in order > to > > force writing into the pipe. That should demonstrate that > pipe_write_wanted > > is getting desynchronised between the thread that runs evpipe_write and > the > > main thread that runs ev_run. > > Which it shouldn't be, as pipe_write_wanted is also marked as volatile, > and gets acquire/release semantics. > > > Let me know if you have further questions or concerns about this, or if > > you really want a short reproduction code to demonstrate this - right now > > my code is a full-blown application that wouldn't be very easy to use as > a > > test case. > > I don't really mind making libev slower for visual C to work around bugs > in the compiler, so I'll gladly just apply your patch, whether its needed > in a future version of the compiler or not :) > > Thanks for finding this bug, and providing a workaround. > > -- > The choice of a Deliantra, the free code+content > MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / schm...@schmorp.de > -=====/_/_//_/\_,_/ /_/\_\ >
_______________________________________________ libev mailing list libev@lists.schmorp.de http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev